• Q ML and student T-test

    From Cosine@21:1/5 to All on Sun Dec 29 13:45:57 2019
    Hi:

    When we want to test the performance of the algorithms of machine learning, we use the same set of data. Specifically, we take the algorithm-1, divide the dataset into the training set and the testing set, train the algorithm-1, and then test it. We do
    the same for the algorithm-2.

    Now we need to compare the performance of the two algorithms. If we use the student t-test, should we use the independent one or the paired one? Why?

    Thank you,

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Duffy@21:1/5 to Cosine on Mon Dec 30 03:11:39 2019
    Cosine <asecant@gmail.com> wrote:

    When we want to test the performance of the algorithms of machine
    learning, we use the same set of data. Specifically, we take the
    algorithm-1, divide the dataset into the training set and the testing
    set, train the algorithm-1, and then test it. We do the same for the algorithm-2.

    Now we need to compare the performance of the two algorithms. If
    we use the student t-test, should we use the independent one or the
    paired one? Why?

    Where do your standard errors come from? Why not K-fold cross-validation
    or leave-one-out?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Mon Dec 30 14:14:43 2019
    On Sun, 29 Dec 2019 13:45:57 -0800 (PST), Cosine <asecant@gmail.com>
    wrote:

    Hi:

    When we want to test the performance of the algorithms of machine learning, we use the same set of data. Specifically, we take the algorithm-1, divide the dataset into the training set and the testing set, train the algorithm-1, and then test it. We do
    the same for the algorithm-2.

    Now we need to compare the performance of the two algorithms. If we use the student t-test, should we use the independent one or the paired one? Why?


    Use the paired test, since the data exist as pairs.
    And look at the correlation (and 2x2 table, if that
    is the form of prediction).

    If there is not a high correlation, then you can
    get getter results by combining (1) and (2) --
    average score, if continuous; by reporting "agreed
    results" and "mixed answer" for Yes/No.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Wed Jan 1 03:29:25 2020
    Rich Ulrich於 2019年12月31日星期二 UTC+8上午3時14分48秒寫道:
    On Sun, 29 Dec 2019 13:45:57 -0800 (PST), Cosine
    wrote:

    Hi:

    When we want to test the performance of the algorithms of machine learning, we use the same set of data. Specifically, we take the algorithm-1, divide the dataset into the training set and the testing set, train the algorithm-1, and then test it. We
    do the same for the algorithm-2.

    Now we need to compare the performance of the two algorithms. If we use the student t-test, should we use the independent one or the paired one? Why?


    Use the paired test, since the data exist as pairs.
    And look at the correlation (and 2x2 table, if that
    is the form of prediction).

    If there is not a high correlation, then you can
    get getter results by combining (1) and (2) --
    average score, if continuous; by reporting "agreed
    results" and "mixed answer" for Yes/No.


    Thank you for replying.

    First, I'd like to clarify my questions to avoid potential misunderstandings.

    By testing the performances of the two algorithms, it means we would like to find the algorithm (or more generally, the method) performing better. Say, we might want to test to see if X-ray imaging or Y-ray imaging is better in identifying a particular
    diseased condition.

    A further question is that, is it true that as long as we test these two algorithms/methods by using the same data set, i.e., the same group of patients, then the t-test we use for comparison must be paired one?

    Say, even if we use different cross-validation methods for these two algorithms, we still must use a paired t-test, since the data come from the same group of patients?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Wed Jan 1 15:57:15 2020
    On Wed, 1 Jan 2020 03:29:25 -0800 (PST), Cosine <asecant@gmail.com>
    wrote:

    Rich Ulrich? 2019?12?31???? UTC+8??3?14?48????
    On Sun, 29 Dec 2019 13:45:57 -0800 (PST), Cosine
    wrote:

    Hi:

    When we want to test the performance of the algorithms of machine learning, we use the same set of data. Specifically, we take the algorithm-1, divide the dataset into the training set and the testing set, train the algorithm-1, and then test it. We
    do the same for the algorithm-2.

    Now we need to compare the performance of the two algorithms. If we use the student t-test, should we use the independent one or the paired one? Why?


    Use the paired test, since the data exist as pairs.
    And look at the correlation (and 2x2 table, if that
    is the form of prediction).

    If there is not a high correlation, then you can
    get getter results by combining (1) and (2) --
    average score, if continuous; by reporting "agreed
    results" and "mixed answer" for Yes/No.


    Thank you for replying.

    First, I'd like to clarify my questions to avoid potential misunderstandings.

    By testing the performances of the two algorithms, it means we would like to find the algorithm (or more generally, the method) performing better. Say, we might want to test to see if X-ray imaging or Y-ray imaging is better in identifying a particular
    diseased condition.

    A further question is that, is it true that as long as we test these two algorithms/methods by using the same data set, i.e., the same group of patients, then the t-test we use for comparison must be paired one?

    "must" is a strong word. If the data exist "paired",
    the paired test is the proper one, which gives the
    correct test. There are circumstances (forced choice:
    Left/Right/neither, negative r, t-test comparison of L
    vs R) where the paired test has less power.

    Your pairs should have highly correlated results,
    so the paired test gives more power, and the r is
    added information. The 2x2 table is also potentially
    informative. I suppose that "more power" is /not/
    what somewhat wants if they are trying to support
    an inferior method.

    Especially if one method is grossly superior to the other,
    and the per-patient scores were inconvenient to obtain
    and rearrange , I might be satisfied with looking at the
    t-test comparison of "percent success" without pairing;
    also, if I was looking at many, many methods. That may
    be the case for "machine-learning" experiments. I don't
    know what is expected for publication in that area.



    Say, even if we use different cross-validation methods for these two algorithms, we still must use a paired t-test, since the data come from the same group of patients?

    If you use two different cross-validation methods, your
    t-test implicitly tests cross-validation methods (artifact)
    (leave-one-out vs. split-sample, say) in addition to
    testing X-ray vs. Y-ray.

    That is - you /could/ perform a test of two C-V methods,
    performed for Y-ray alone (say). - Other things being
    equal, the "Leave one out" performs better than split
    sample, since each decision is based on a larger N.
    That is true whether you use paired- or nonpaired-t tests.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Mon Jan 13 16:10:43 2020
    How do we check to verify if the type of t-test we used is correct? Say, in the situation discussed, we should use the paired t-test, not the independent one, but how do we verify this?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Cosine on Tue Jan 14 19:19:57 2020
    Cosine wrote:


    How do we check to verify if the type of t-test we used is correct?
    Say, in the situation discussed, we should use the paired t-test, not
    the independent one, but how do we verify this?

    In principle, you should know from the way you structured the
    experiment which test to do. More pragmatically, the reason for doing a
    paired t-test (when it is appropriate) is that the variance of the
    difference of the means is much smaller than it would be if the means
    were independent. The reason for the smaller variance might be
    identified as arising from a positive correlation between the
    individual values in the pairs. A negative correlation would be
    unusual but possible, depending on circumstances. The t-test should use
    an estimate of the variance of the difference that does estimate the
    correct variance.

    So, if you only have the one sample set, you can either:
    (i) compare numerically the two versions of the estimates of the
    variance of the difference in the two versions of the test (or the
    estimates of the standard deviations that appear as the divisors in the
    two versions of the t-statistic);
    (ii) investigate the correlation of the individual values in the pairs.

    If you have more than one sample set, or are prepared to subsample the
    sample set, you could do something more extensive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to dajhawk18@@googlemail.com on Wed Jan 15 00:12:41 2020
    On Tue, 14 Jan 2020 19:19:57 +0000 (UTC), "David Jones" <dajhawk18@@googlemail.com> wrote:

    Cosine wrote:


    How do we check to verify if the type of t-test we used is correct?
    Say, in the situation discussed, we should use the paired t-test, not
    the independent one, but how do we verify this?

    In principle, you should know from the way you structured the
    experiment which test to do. More pragmatically, the reason for doing a >paired t-test (when it is appropriate) is that the variance of the
    difference of the means is much smaller than it would be if the means
    were independent. The reason for the smaller variance might be
    identified as arising from a positive correlation between the
    individual values in the pairs. A negative correlation would be
    unusual but possible, depending on circumstances. The t-test should use
    an estimate of the variance of the difference that does estimate the
    correct variance.

    Yes. As I wrote at the start of my earlier post, the paired
    test is the /correct/ test, with the right error term, when
    the data are paired.


    So, if you only have the one sample set, you can either:
    (i) compare numerically the two versions of the estimates of the
    variance of the difference in the two versions of the test (or the
    estimates of the standard deviations that appear as the divisors in the
    two versions of the t-statistic);
    (ii) investigate the correlation of the individual values in the pairs.

    David, I don't know why you are offering a choice of
    comparing error terms. Using the wrong test is not going
    to be justified by saying it is "more powerful (although
    it is wrong)." Look at the r to get the general idea.

    If you want to justify using a more powerful test, argue for
    using 10% or 20% cutoff value, instead of misusing the 5%.


    "Convenience" is the main, best excuse, though not an
    entirely good one, for using the wrong test. Maybe, to
    display a whole slew of results.

    Beyond that, I can half-way imagine having both sorts of
    tests within one set of analyses, and wanting to use group-tests
    throughout in order to make the apparent effect sizes
    commensurable. Effect sizes get complicated to report
    in mixed models, with both within- and between-effects.
    And I have sympathy for the attempts to explain them.


    I'm just about /always/ curious about the size (and sign) of
    the correlation. And that will tell you whether you are
    gaining or losing power.


    If you have more than one sample set, or are prepared to subsample the
    sample set, you could do something more extensive.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Rich Ulrich on Wed Jan 15 11:23:44 2020
    Rich Ulrich wrote:

    On Tue, 14 Jan 2020 19:19:57 +0000 (UTC), "David Jones" <dajhawk18@@googlemail.com> wrote:

    Cosine wrote:


    How do we check to verify if the type of t-test we used is correct?
    Say, in the situation discussed, we should use the paired t-test,
    not >> the independent one, but how do we verify this?

    In principle, you should know from the way you structured the
    experiment which test to do. More pragmatically, the reason for
    doing a paired t-test (when it is appropriate) is that the variance
    of the difference of the means is much smaller than it would be if
    the means were independent. The reason for the smaller variance
    might be identified as arising from a positive correlation between
    the individual values in the pairs. A negative correlation would be unusual but possible, depending on circumstances. The t-test should
    use an estimate of the variance of the difference that does
    estimate the correct variance.

    Yes. As I wrote at the start of my earlier post, the paired
    test is the correct test, with the right error term, when
    the data are paired.


    So, if you only have the one sample set, you can either:
    (i) compare numerically the two versions of the estimates of the
    variance of the difference in the two versions of the test (or the estimates of the standard deviations that appear as the divisors in
    the two versions of the t-statistic);
    (ii) investigate the correlation of the individual values in the
    pairs.

    David, I don't know why you are offering a choice of
    comparing error terms. Using the wrong test is not going
    to be justified by saying it is "more powerful (although
    it is wrong)." Look at the r to get the general idea.

    If you want to justify using a more powerful test, argue for
    using 10% or 20% cutoff value, instead of misusing the 5%.


    "Convenience" is the main, best excuse, though not an
    entirely good one, for using the wrong test. Maybe, to
    display a whole slew of results.

    Beyond that, I can half-way imagine having both sorts of
    tests within one set of analyses, and wanting to use group-tests
    throughout in order to make the apparent effect sizes
    commensurable. Effect sizes get complicated to report
    in mixed models, with both within- and between-effects.
    And I have sympathy for the attempts to explain them.


    I'm just about always curious about the size (and sign) of
    the correlation. And that will tell you whether you are
    gaining or losing power.


    If you have more than one sample set, or are prepared to subsample
    the sample set, you could do something more extensive.

    I was trying to help the OP to undersrtand the difference between the
    two tests, which essentially lies in the estimates of the variance of
    the difference of the means. The effect of the estimated-variance is
    all in the null distribution of the test statistic, not really on the
    "power of the test", since there is no point in thinking about power if
    the size of the test is not what you think it is.

    If one considers the case where sample sizes are very large, then the
    t-tests are scaled so that the null distribution is close to N(0,1) if
    all the assumptions are valid. If the assumptions are not valid (but
    there really is no difference in the means), then the true distrubution
    of the test statistic will be close to N(0,b) for some b, rather then
    the
    N(0,1) one might think it is.

    In the case that there really is no point on having paired-up samples,
    then the two estimates of the variance-of-the-difference both estimate
    the "correct variance (of the difference)". One estimate of the
    variance would be based on twice as many degrees of freedom than the
    other, and hence notionally "better". Formally, the tests of the two
    types would both be valid in the sense of each having the null
    distribution determined correctly as t with either (n-1) or 2(n-1)
    degrees of freedom. I suppose in this limited case one vwould then go
    on to consider "power"of the tests.

    Going back to the OP's question here (How do we check to verify if the
    type of t-test we used is correct?) .... if there is radical differnce
    in the two estimates of variance, or if there is even a moderate
    correlation in the paired values, then one can see the reason why the
    paired test is more appropriate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)