• Q multiple comparison

    From Cosine@21:1/5 to All on Thu Aug 27 18:59:01 2020
    Hi:

    Suppose we have 3 new methods of medical screening and we want to know whether: 1) any of them perform better than the existing standard method, and 2) the order of their performances, i.e., the best, the 2nd, and the 3rd.

    We test them by using the same set of samples and we use the following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

    Now we have many comparisons to do, and it seems that this would raise an issue of false positive.

    One way to solve this issue is to divide the alpha value by the number of tests to impose more stringent criteria on each of the tests regarding the false positive. That is, instead of using the original alpha (e.g., 5%), we use the corrected one:
    alpha1 = alpha0/N; where N is the number of tests.

    Then we conduct the student t-test to see if any of the tests would be statistically significant.

    But now we have some questions:

    1) what is the value of N?
    2) by reducing the alpha from alpha0 to alpha1, we have made each of the test more difficult to be significant, wouldn't this increase the rate of false-negative? If so, how do we resolve this issue?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Weaver@21:1/5 to Cosine on Fri Aug 28 06:06:08 2020
    On Thursday, August 27, 2020 at 9:59:03 PM UTC-4, Cosine wrote:
    Hi:

    Suppose we have 3 new methods of medical screening and we want to know whether: 1) any of them perform better than the existing standard method, and 2) the order of their performances, i.e., the best, the 2nd, and the 3rd.

    We test them by using the same set of samples and we use the following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

    Now we have many comparisons to do, and it seems that this would raise an issue of false positive.

    One way to solve this issue is to divide the alpha value by the number of tests to impose more stringent criteria on each of the tests regarding the false positive. That is, instead of using the original alpha (e.g., 5%), we use the corrected one:
    alpha1 = alpha0/N; where N is the number of tests.

    Then we conduct the student t-test to see if any of the tests would be statistically significant.

    But now we have some questions:

    1) what is the value of N?
    2) by reducing the alpha from alpha0 to alpha1, we have made each of the test more difficult to be significant, wouldn't this increase the rate of false-negative? If so, how do we resolve this issue?

    Before you proceed with t-tests, I suggest that you take a look at the book by Robert G. Newcombe:

    https://www.routledge.com/Confidence-Intervals-for-Proportions-and-Related-Measures-of-Effect-Size/Newcombe/p/book/9781439812785

    Notice that there is an entire chapter on screening and diagnostic tests that includes these sections:

    Background
    Sensitivity and Specificity
    Positive and Negative Predictive Values
    Trade-Off between Sensitivity and Specificity: The ROC Curve
    Simultaneous Comparison of Sensitivity and Specificity between Two Tests

    And the Support Material tab on that web-page has a link to a zip file. It contains an Excel workbook in which Newcombe has implemented most of the methods he describes.

    As to your concern about maintaining control over the error rate, a false discovery rate (FDR) approach might make more sense than a Bonferroni correction.

    HTH.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Bruce Weaver on Fri Aug 28 15:47:29 2020
    Bruce Weaver wrote:

    On Thursday, August 27, 2020 at 9:59:03 PM UTC-4, Cosine wrote:
    Hi:

    Suppose we have 3 new methods of medical screening and we want to
    know whether: 1) any of them perform better than the existing
    standard method, and 2) the order of their performances, i.e., the
    best, the 2nd, and the 3rd.

    We test them by using the same set of samples and we use the
    following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

    Now we have many comparisons to do, and it seems that this would
    raise an issue of false positive.

    One way to solve this issue is to divide the alpha value by the
    number of tests to impose more stringent criteria on each of the
    tests regarding the false positive. That is, instead of using the
    original alpha (e.g., 5%), we use the corrected one: alpha1 =
    alpha0/N; where N is the number of tests.

    Then we conduct the student t-test to see if any of the tests
    would be statistically significant.

    But now we have some questions:

    1) what is the value of N?
    2) by reducing the alpha from alpha0 to alpha1, we have made each
    of the test more difficult to be significant, wouldn't this
    increase the rate of false-negative? If so, how do we resolve this
    issue?

    Before you proceed with t-tests, I suggest that you take a look at
    the book by Robert G. Newcombe:


    https://www.routledge.com/Confidence-Intervals-for-Proportions-and-Related-Measures-of-Effect-Size/Newcombe/p/book/9781439812785

    Notice that there is an entire chapter on screening and diagnostic
    tests that includes these sections:

    Background
    Sensitivity and Specificity
    Positive and Negative Predictive Values
    Trade-Off between Sensitivity and Specificity: The ROC Curve
    Simultaneous Comparison of Sensitivity and Specificity between Two
    Tests

    And the Support Material tab on that web-page has a link to a zip
    file. It contains an Excel workbook in which Newcombe has
    implemented most of the methods he describes.

    As to your concern about maintaining control over the error rate, a
    false discovery rate (FDR) approach might make more sense than a
    Bonferroni correction.

    HTH.

    In these modern times, one need not restrict oneself to text-book
    methodology. By all means, use text-book methods as a guide to what
    might be done, but one can use random-simulations to overcome worries
    about the various approximations involved; e.g in the Bonferroni
    correction, and "approximate normality".Thus, construct a final test
    statistic, possibly as the minimum of observed p-values, or otherwise,
    and evaluate the null distribution of that final test statistic based
    on simulations from a reasonable "no difference/common" distribution. Simulations can also be use to give one a direct approach to
    understanding the power of whatever test you choose.

    Similarly,, one can use simulations for the second question ( "the
    order of their performances, i.e., the best, the 2nd, and the 3rd."),
    by simulating from distributions with a known amount of difference.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sat Aug 29 03:20:39 2020
    On Thu, 27 Aug 2020 18:59:01 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Hi:

    Suppose we have 3 new methods of medical screening and we want to know whether: 1) any of them perform better than the existing standard method, and 2) the order of their performances, i.e., the best, the 2nd, and the 3rd.

    It has lately come to attention of everyone following coronavirus
    that COST and EASE of ADMINISTRATION are also relevant
    to whether a screening method should be considered to
    perform "better."


    We test them by using the same set of samples and we use the following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

    See the references that Bruce gives. "Accuracy" is a combination
    of sensitivity and specificity, with a sliding scale across a range
    of cutoffs. The graph of that scale is called the ROC curve.

    What is recognized less often is that measures of reliability are
    also conditioned by the (characteristics of) the particular
    sample, starting with the variability of the range of scores
    observed on the measures.


    Now we have many comparisons to do, and it seems that this would raise an issue of false positive.

    One way to solve this issue is to divide the alpha value by the number of tests to impose more stringent criteria on each of the tests regarding the false positive. That is, instead of using the original alpha (e.g., 5%), we use the corrected one:
    alpha1 = alpha0/N; where N is the number of tests.

    One convention I like for comparing multiple groups is to perform
    the overall, repeated-measures F-test first. If that test is not "significant", then no further testing is performed.

    I forget what tests are used for comparing ROC curves, if they
    follow the F-distribution or chisquared.

    I suppose it might end up as a t, for two groups, if it is simply
    the Area under the Curve. On test is said to be "stochastically
    dominant" to the other if it gives better results across the whole
    range (which need not be the case).


    Then we conduct the student t-test to see if any of the tests would be statistically significant.

    Paired t-testing, surely. That is usually contrasted to Student's
    where the latter is the grouped (not paired) test.

    Using paired t-tests is proper followup for group comparisons
    after an overall repeated-measures F shows a difference.

    How many t-tests are you proposing? You have 3 tests if
    you are comparing each of the 3 new methods only to the
    standard method.

    What decision do you want to make? If you are only
    interested in finding something superior to Standard,
    you could prescribe a 1-tailed test... though many people
    do NOT like one-tailed testing, as a matter of principle or
    of superstition.


    But now we have some questions:

    1) what is the value of N?

    You have siad that N is the number of tests. As I said, if
    you are only /interested in/ the comparisons to Standard,
    you hae 3 tests. How serious were you about ranking them?

    - You are unlikely to get results that show one test much
    better than Standard, and another test even better than THAT.
    So people will be apt to show (and expect) to see the ranking
    without any stringent tests between the others.


    2) by reducing the alpha from alpha0 to alpha1, we have made each of the test more difficult to be significant, wouldn't this increase the rate of false-negative? If so, how do we resolve this issue?


    If you want to be strict in preserving the error level, you plan
    your experiment in advance with large enough N that any difference
    large enough to be interesting will provide the good test result,
    for whichever test results you intent to present.

    Larger N is how you "resolve the issue" of insufficient power.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Sat Aug 29 14:35:20 2020
    Hi:

    Let's make sure about finding if a new method of screening is better than the others.

    Suppose we have 3 new methods: S1, S2, and S3, and we use only the sensitivity (SE) for comparing the performance.

    We let alpha = 0.05.

    We conduct the t-test for the 3 methods individually to get the average of the SE: SE1, SE2, and SE3.

    We conduct 2 t-tests for the statistics in terms of the difference between the average of the SEs: DSE12, DSE13, and DSE23.

    We let the alpha_C = alpha/3.

    We determine the p-values as: p_DSE12, p_DSE13, and p_DSE23.

    Now we have some questions below:

    1. If we have: p_DSE12 < alpha_C/2 <- This means we did a two-sided t-test and the result is significant.

    But the two-sided t-test shows no directional preference; it only indicates that there would be difference and this difference is significant. So it seems that we should use a one-sided t-test, right? That is, let the statistic as DSE12 = SE2-SE1 for
    the one-sided test and use alpha_C, instead of alpha_C/2.

    2. If we have: p_DSE12 < alpha_C/2 or p_DSE12 < alpha_C <- significant.

    In this case, would it still be possible that SE1 >= SE2?

    Why or why not?

    3. Could we conduct a statistical test to show that:

    S1 performs better than S2, S1 performs better than S3, and the level of the superiority of S1 against S2 is higher than that of S1 against S3.











    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Duffy@21:1/5 to Cosine on Sun Aug 30 04:02:09 2020
    Cosine <asecant@gmail.com> wrote:
    Hi:

    Let's make sure about finding if a new method of screening is better than the others.

    Suppose we have 3 new methods: S1, S2, and S3, and we use only the sensitivity (SE) for comparing the performance.

    The point of ROC curves is that most tests have a quantitative outcome,
    so one can set the sensitivity to 100% by accepting everything as
    disease. My simple-minded way of thinking comes from knowing Youden's
    Index (Sens+Spec-1) is a correlation coefficient - it was at one time fashionable as a summary of overall goodness of a test, fixing a good
    cut in your ROC, and allowing simple comparisons across tests.

    3. Could we conduct a statistical test to show that:

    S1 performs better than S2, S1 performs better than S3, and the level
    of the superiority of S1 against S2 is higher than that of S1 against S3.

    Yes, you could construct a global likelihood based test. Because of the constraint S1>S2>S3, the null distribution would be a mixture -
    so I would do some kind of simulation/randomization test. My favourite
    book on such methods is
    Noreen (1989) Computer-Intensive Methods for Testing Hypotheses
    which Google Scholar shows available online.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sun Aug 30 14:02:40 2020
    On Sat, 29 Aug 2020 14:35:20 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:


    3. Could we conduct a statistical test to show that:

    S1 performs better than S2, S1 performs better than S3, and the level of the superiority of S1 against S2 is higher than that of S1 against S3.

    To make this interesting, I have to assume something unstated,
    S1 > S2 > S3 in ordering of means.

    Can the test for S1 > S2 be "more significant" than S1 > S3?

    Yes. For unmatched samples, this quandry arises when the
    sample size for S3 is relatively small. IF the test takes sample
    sizes into account. That is why the original precautions about
    when to apply those tests say that the sample sizes should be
    the same, or very nearly so. There are versions of tests that
    ignore the different Ns by "assuming" they all are equal (using
    the geometric mean of N, probably). That prevents such an
    outcome of testing ... by hiding the detail, which seems to
    be "bad science reporting."

    For matched samples, Ns are the same by definition, but the
    two-sample correlations may differ, sometimes by a lot. That
    is why the recommended followups (or multiple tests) use the
    specific r's instead of using the intraclass r. When these r's
    are used in defining the standard error for the difference in
    means, it is, indeed, possible to have that odd testing result.

    For tests across time, I like the Test for Linearity, with a check
    done on the size of the contribution of Nonlinear elements.
    It is nice to see 90% of the sums of squares accounted for by
    linearity, and then (for my Ns) I can say that the rest is noise.

    For Repeated measures data collected across time, the r's
    typically decline as the time-gap increases, so the power of a test
    between more distant times is less, and you might see that result
    if you insisted on testing it.

    For Repeated-measures-tests between alternate methods, the r's
    might depend on the similarity of methods. I was once tasked
    with analyzing several methods, where two of the Methods were
    rating scales that differed only in the inclusion of one new item.



    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)