• #### Q multiple comparison

From Cosine@21:1/5 to All on Thu Aug 27 18:59:01 2020
Hi:

Suppose we have 3 new methods of medical screening and we want to know whether: 1) any of them perform better than the existing standard method, and 2) the order of their performances, i.e., the best, the 2nd, and the 3rd.

We test them by using the same set of samples and we use the following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

Now we have many comparisons to do, and it seems that this would raise an issue of false positive.

One way to solve this issue is to divide the alpha value by the number of tests to impose more stringent criteria on each of the tests regarding the false positive. That is, instead of using the original alpha (e.g., 5%), we use the corrected one:
alpha1 = alpha0/N; where N is the number of tests.

Then we conduct the student t-test to see if any of the tests would be statistically significant.

But now we have some questions:

1) what is the value of N?
2) by reducing the alpha from alpha0 to alpha1, we have made each of the test more difficult to be significant, wouldn't this increase the rate of false-negative? If so, how do we resolve this issue?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Bruce Weaver@21:1/5 to Cosine on Fri Aug 28 06:06:08 2020
On Thursday, August 27, 2020 at 9:59:03 PM UTC-4, Cosine wrote:
Hi:

Suppose we have 3 new methods of medical screening and we want to know whether: 1) any of them perform better than the existing standard method, and 2) the order of their performances, i.e., the best, the 2nd, and the 3rd.

We test them by using the same set of samples and we use the following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

Now we have many comparisons to do, and it seems that this would raise an issue of false positive.

One way to solve this issue is to divide the alpha value by the number of tests to impose more stringent criteria on each of the tests regarding the false positive. That is, instead of using the original alpha (e.g., 5%), we use the corrected one:
alpha1 = alpha0/N; where N is the number of tests.

Then we conduct the student t-test to see if any of the tests would be statistically significant.

But now we have some questions:

1) what is the value of N?
2) by reducing the alpha from alpha0 to alpha1, we have made each of the test more difficult to be significant, wouldn't this increase the rate of false-negative? If so, how do we resolve this issue?

Before you proceed with t-tests, I suggest that you take a look at the book by Robert G. Newcombe:

https://www.routledge.com/Confidence-Intervals-for-Proportions-and-Related-Measures-of-Effect-Size/Newcombe/p/book/9781439812785

Notice that there is an entire chapter on screening and diagnostic tests that includes these sections:

Background
Sensitivity and Specificity
Positive and Negative Predictive Values
Trade-Off between Sensitivity and Specificity: The ROC Curve
Simultaneous Comparison of Sensitivity and Specificity between Two Tests

And the Support Material tab on that web-page has a link to a zip file. It contains an Excel workbook in which Newcombe has implemented most of the methods he describes.

As to your concern about maintaining control over the error rate, a false discovery rate (FDR) approach might make more sense than a Bonferroni correction.

HTH.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From David Jones@21:1/5 to Bruce Weaver on Fri Aug 28 15:47:29 2020
Bruce Weaver wrote:

On Thursday, August 27, 2020 at 9:59:03 PM UTC-4, Cosine wrote:
Hi:

Suppose we have 3 new methods of medical screening and we want to
know whether: 1) any of them perform better than the existing
standard method, and 2) the order of their performances, i.e., the
best, the 2nd, and the 3rd.

We test them by using the same set of samples and we use the
following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

Now we have many comparisons to do, and it seems that this would
raise an issue of false positive.

One way to solve this issue is to divide the alpha value by the
number of tests to impose more stringent criteria on each of the
tests regarding the false positive. That is, instead of using the
original alpha (e.g., 5%), we use the corrected one: alpha1 =
alpha0/N; where N is the number of tests.

Then we conduct the student t-test to see if any of the tests
would be statistically significant.

But now we have some questions:

1) what is the value of N?
2) by reducing the alpha from alpha0 to alpha1, we have made each
of the test more difficult to be significant, wouldn't this
increase the rate of false-negative? If so, how do we resolve this
issue?

Before you proceed with t-tests, I suggest that you take a look at
the book by Robert G. Newcombe:

https://www.routledge.com/Confidence-Intervals-for-Proportions-and-Related-Measures-of-Effect-Size/Newcombe/p/book/9781439812785

Notice that there is an entire chapter on screening and diagnostic
tests that includes these sections:

Background
Sensitivity and Specificity
Positive and Negative Predictive Values
Trade-Off between Sensitivity and Specificity: The ROC Curve
Simultaneous Comparison of Sensitivity and Specificity between Two
Tests

And the Support Material tab on that web-page has a link to a zip
file. It contains an Excel workbook in which Newcombe has
implemented most of the methods he describes.

As to your concern about maintaining control over the error rate, a
false discovery rate (FDR) approach might make more sense than a
Bonferroni correction.

HTH.

In these modern times, one need not restrict oneself to text-book
methodology. By all means, use text-book methods as a guide to what
might be done, but one can use random-simulations to overcome worries
about the various approximations involved; e.g in the Bonferroni
correction, and "approximate normality".Thus, construct a final test
statistic, possibly as the minimum of observed p-values, or otherwise,
and evaluate the null distribution of that final test statistic based
on simulations from a reasonable "no difference/common" distribution. Simulations can also be use to give one a direct approach to
understanding the power of whatever test you choose.

Similarly,, one can use simulations for the second question ( "the
order of their performances, i.e., the best, the 2nd, and the 3rd."),
by simulating from distributions with a known amount of difference.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Rich Ulrich@21:1/5 to All on Sat Aug 29 03:20:39 2020
On Thu, 27 Aug 2020 18:59:01 -0700 (PDT), Cosine <asecant@gmail.com>
wrote:

Hi:

Suppose we have 3 new methods of medical screening and we want to know whether: 1) any of them perform better than the existing standard method, and 2) the order of their performances, i.e., the best, the 2nd, and the 3rd.

It has lately come to attention of everyone following coronavirus
that COST and EASE of ADMINISTRATION are also relevant
to whether a screening method should be considered to
perform "better."

We test them by using the same set of samples and we use the following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).

See the references that Bruce gives. "Accuracy" is a combination
of sensitivity and specificity, with a sliding scale across a range
of cutoffs. The graph of that scale is called the ROC curve.

What is recognized less often is that measures of reliability are
also conditioned by the (characteristics of) the particular
sample, starting with the variability of the range of scores
observed on the measures.

Now we have many comparisons to do, and it seems that this would raise an issue of false positive.

One way to solve this issue is to divide the alpha value by the number of tests to impose more stringent criteria on each of the tests regarding the false positive. That is, instead of using the original alpha (e.g., 5%), we use the corrected one:
alpha1 = alpha0/N; where N is the number of tests.

One convention I like for comparing multiple groups is to perform
the overall, repeated-measures F-test first. If that test is not "significant", then no further testing is performed.

I forget what tests are used for comparing ROC curves, if they

I suppose it might end up as a t, for two groups, if it is simply
the Area under the Curve. On test is said to be "stochastically
dominant" to the other if it gives better results across the whole
range (which need not be the case).

Then we conduct the student t-test to see if any of the tests would be statistically significant.

Paired t-testing, surely. That is usually contrasted to Student's
where the latter is the grouped (not paired) test.

Using paired t-tests is proper followup for group comparisons
after an overall repeated-measures F shows a difference.

How many t-tests are you proposing? You have 3 tests if
you are comparing each of the 3 new methods only to the
standard method.

What decision do you want to make? If you are only
interested in finding something superior to Standard,
you could prescribe a 1-tailed test... though many people
do NOT like one-tailed testing, as a matter of principle or
of superstition.

But now we have some questions:

1) what is the value of N?

You have siad that N is the number of tests. As I said, if
you are only /interested in/ the comparisons to Standard,
you hae 3 tests. How serious were you about ranking them?

- You are unlikely to get results that show one test much
better than Standard, and another test even better than THAT.
So people will be apt to show (and expect) to see the ranking
without any stringent tests between the others.

2) by reducing the alpha from alpha0 to alpha1, we have made each of the test more difficult to be significant, wouldn't this increase the rate of false-negative? If so, how do we resolve this issue?

If you want to be strict in preserving the error level, you plan
large enough to be interesting will provide the good test result,
for whichever test results you intent to present.

Larger N is how you "resolve the issue" of insufficient power.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Cosine@21:1/5 to All on Sat Aug 29 14:35:20 2020
Hi:

Let's make sure about finding if a new method of screening is better than the others.

Suppose we have 3 new methods: S1, S2, and S3, and we use only the sensitivity (SE) for comparing the performance.

We let alpha = 0.05.

We conduct the t-test for the 3 methods individually to get the average of the SE: SE1, SE2, and SE3.

We conduct 2 t-tests for the statistics in terms of the difference between the average of the SEs: DSE12, DSE13, and DSE23.

We let the alpha_C = alpha/3.

We determine the p-values as: p_DSE12, p_DSE13, and p_DSE23.

Now we have some questions below:

1. If we have: p_DSE12 < alpha_C/2 <- This means we did a two-sided t-test and the result is significant.

But the two-sided t-test shows no directional preference; it only indicates that there would be difference and this difference is significant. So it seems that we should use a one-sided t-test, right? That is, let the statistic as DSE12 = SE2-SE1 for
the one-sided test and use alpha_C, instead of alpha_C/2.

2. If we have: p_DSE12 < alpha_C/2 or p_DSE12 < alpha_C <- significant.

In this case, would it still be possible that SE1 >= SE2?

Why or why not?

3. Could we conduct a statistical test to show that:

S1 performs better than S2, S1 performs better than S3, and the level of the superiority of S1 against S2 is higher than that of S1 against S3.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From David Duffy@21:1/5 to Cosine on Sun Aug 30 04:02:09 2020
Cosine <asecant@gmail.com> wrote:
Hi:

Let's make sure about finding if a new method of screening is better than the others.

Suppose we have 3 new methods: S1, S2, and S3, and we use only the sensitivity (SE) for comparing the performance.

The point of ROC curves is that most tests have a quantitative outcome,
so one can set the sensitivity to 100% by accepting everything as
disease. My simple-minded way of thinking comes from knowing Youden's
Index (Sens+Spec-1) is a correlation coefficient - it was at one time fashionable as a summary of overall goodness of a test, fixing a good
cut in your ROC, and allowing simple comparisons across tests.

3. Could we conduct a statistical test to show that:

S1 performs better than S2, S1 performs better than S3, and the level
of the superiority of S1 against S2 is higher than that of S1 against S3.

Yes, you could construct a global likelihood based test. Because of the constraint S1>S2>S3, the null distribution would be a mixture -
so I would do some kind of simulation/randomization test. My favourite
book on such methods is
Noreen (1989) Computer-Intensive Methods for Testing Hypotheses
which Google Scholar shows available online.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Rich Ulrich@21:1/5 to All on Sun Aug 30 14:02:40 2020
On Sat, 29 Aug 2020 14:35:20 -0700 (PDT), Cosine <asecant@gmail.com>
wrote:

3. Could we conduct a statistical test to show that:

S1 performs better than S2, S1 performs better than S3, and the level of the superiority of S1 against S2 is higher than that of S1 against S3.

To make this interesting, I have to assume something unstated,
S1 > S2 > S3 in ordering of means.

Can the test for S1 > S2 be "more significant" than S1 > S3?

Yes. For unmatched samples, this quandry arises when the
sample size for S3 is relatively small. IF the test takes sample
sizes into account. That is why the original precautions about
when to apply those tests say that the sample sizes should be
the same, or very nearly so. There are versions of tests that
ignore the different Ns by "assuming" they all are equal (using
the geometric mean of N, probably). That prevents such an
outcome of testing ... by hiding the detail, which seems to

For matched samples, Ns are the same by definition, but the
two-sample correlations may differ, sometimes by a lot. That
is why the recommended followups (or multiple tests) use the
specific r's instead of using the intraclass r. When these r's
are used in defining the standard error for the difference in
means, it is, indeed, possible to have that odd testing result.

For tests across time, I like the Test for Linearity, with a check
done on the size of the contribution of Nonlinear elements.
It is nice to see 90% of the sums of squares accounted for by
linearity, and then (for my Ns) I can say that the rest is noise.

For Repeated measures data collected across time, the r's
typically decline as the time-gap increases, so the power of a test
between more distant times is less, and you might see that result
if you insisted on testing it.

For Repeated-measures-tests between alternate methods, the r's
might depend on the similarity of methods. I was once tasked
with analyzing several methods, where two of the Methods were
rating scales that differed only in the inclusion of one new item.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)