• #### Q Test how good a new method is

From Cosine@21:1/5 to All on Fri Sep 13 04:45:09 2019
Hi:

Given a method as the reference golden method, how do we statistically show that a new method is better than the reference one?

One way I could think of is to define a statistical variable, I, and then conduct a hypothesis test to see if avg( I_new-I_ref ) >0, which means that in average the difference of the performance of the new method and of the reference one is greater than
zero.

But how do we define one such statistical variable? Also, I heard that sometimes people would conduct hypothesis for multiple variables. In that case, how do we know that the new method is better than the reference one or not?

Thank you,

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Rich Ulrich@21:1/5 to All on Sun Sep 15 02:57:20 2019
On Fri, 13 Sep 2019 04:45:09 -0700 (PDT), Cosine <asecant@gmail.com>
wrote:

Hi:

Given a method as the reference golden method, how do we statistically show that a new method is better than the reference one?

Pragmatically, though, I have come up with an example based
on strong measurement assumptions, and an example based
on retreating to the original sort of data that makes a clinical
predictor "golden".

Measurement Assumption: If we assume that the underlying
variable is smoothly varying across time, then you can compare
the "jitter" - noise - in two time series of (closely spaced)
measures.

Retreat to basic data: A clincal indicator (say) is "golden" if it
predicts the eventual occurrance of some event (death?).
An alternative that proves to be more accurate in the long run (cross-validation studies) becomes the new golded standard.

In practice, this may use ROC curves, which balance false-
negatives against false-positives. For instance, the TB scratch-
test is "positive" for a reaction larger than come conventional
and specific size. There is one ROC curve based on size. Do you
call it a different "method" (I don't) if you use a different cutoff?

A totally different method would have a different curve, and
it might reflect a different phenomenon. For instance, detecting
active TB bacteria is not the same as detecting TB antibodies.
So: What is the purpose of the test?

One way I could think of is to define a statistical variable, I, and then conduct a hypothesis test to see if avg( I_new-I_ref ) >0, which means that in average the difference of the performance of the new method and of the reference one is greater
than zero.

I don't follow.

But how do we define one such statistical variable? Also, I heard that sometimes people would conduct hypothesis for multiple variables. In that case, how do we know that the new method is better than the reference one or not?

The difficulties rasied by "gold standards" are logical, not
statistical. If you have more than one purpose in mind, it is
very easy to suggest that there might be more than one
"gold standard" test to meet the several purposes.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Cosine@21:1/5 to All on Tue Sep 24 13:29:43 2019
Let's consider testing the effects of different types of fertilizer to increase a given type of crop. We want to know which type of fertilizer would make a given area of land product the maximum amounts of crop.

Suppose we have 4 types of fertilizer. Does it mean that we need to conduct C(5,2) times paired student t-tests to determine the best fertilizer?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Rich Ulrich@21:1/5 to All on Wed Sep 25 00:19:08 2019
On Tue, 24 Sep 2019 13:29:43 -0700 (PDT), Cosine <asecant@gmail.com>
wrote:

Let's consider testing the effects of different types of fertilizer to increase a given type of crop. We want to know which type of fertilizer would make a given area of land product the maximum amounts of crop.

This looks like a question that should have been given
"new method" included the notion of a golden standard
or reference, something to be improved upon.

Suppose we have 4 types of fertilizer. Does it mean that we need to conduct C(5,2) times paired student t-tests to determine the best fertilizer?

You probably want to frame your hypthetical design with
something other than Farming problems. "Latin Squares"
are designs from 90 years ago for comparing outcomes in
farming while controlling for /some/ outside factors; and
there are other design variations that protect against
other confounding factors you may get with plots of land.

If you want to compare 5 samples that have equal
an ANOVA with 5 groups. A post-hoc tests like Tukey's
Honestly Significant Difference would let you make
statements about which ones may be superior to
which others, and which ones should so far be grouped
together.

The HSD compensates for the fact that multiple tests
are being performed, so it is more conservative than
a single t-test. Doing the testing in one ANOVA gains
robustness by estimating the error-variance from all
five of the groups, instead of only the two used in a
particular comparison.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)