• Q right way to interpret a test with multiple metrics

    From Cosine@21:1/5 to All on Fri Mar 17 17:47:56 2023
    Hi:

    We could easily find in the literature that a study used more than one performance metric for the hypothesis test without explicitly and clearly stating what hypothesis this study aims to test. Often the paper only states that it intends to test if a
    newly developed object (algorithm, drug, device, technique, etc) would perform better than some chosen benchmarks. Then the paper presents some tables summarizing the results of many comparisons. Among the tables, the paper picks those comparisons having
    better values of some performance metric and showing statistical significance. Finally, the paper claims that the new object is successful since it has some favorable results that are statistically significant.

    This looks odd. SHouldn't we clearly define the hypothesis before conducting any tests? For example, shouldn't we define the success of the object to be "having all the chosen metrics have better results"? Otherwise, why would we test so many metrics,
    instead of only one?

    The aforementioned approach looks like this: we do not know what would happen. So let's pick some commonly used metrics to test if we could get some of them to show favorable and significant results.

    Anyway, what are the correct or rigorous ways to conduct tests with multiple metrics?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Cosine on Sat Mar 18 01:25:44 2023
    Cosine wrote:

    Hi:

    We could easily find in the literature that a study used more than
    one performance metric for the hypothesis test without explicitly and
    clearly stating what hypothesis this study aims to test. Often the
    paper only states that it intends to test if a newly developed object (algorithm, drug, device, technique, etc) would perform better than
    some chosen benchmarks. Then the paper presents some tables
    summarizing the results of many comparisons. Among the tables, the
    paper picks those comparisons having better values of some
    performance metric and showing statistical significance. Finally, the
    paper claims that the new object is successful since it has some
    favorable results that are statistically significant.

    This looks odd. SHouldn't we clearly define the hypothesis before conducting any tests? For example, shouldn't we define the success of
    the object to be "having all the chosen metrics have better results"? Otherwise, why would we test so many metrics, instead of only one?

    The aforementioned approach looks like this: we do not know what
    would happen. So let's pick some commonly used metrics to test if we
    could get some of them to show favorable and significant results.

    Anyway, what are the correct or rigorous ways to conduct tests
    with multiple metrics?

    You might want to search for the terms "multiple testing" and
    "Bonferroni correction".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to dajhawkxx@nowherel.com on Sat Mar 18 14:48:27 2023
    On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones" <dajhawkxx@nowherel.com> wrote:

    Cosine wrote:

    Hi:

    We could easily find in the literature that a study used more than
    one performance metric for the hypothesis test without explicitly and
    clearly stating what hypothesis this study aims to test.

    That sounds like a journal with reviewers who are not doing their job.
    A new method may have better sensitivity or specificity, making it
    useful as a second test. If it is cheaper/easier, that virtue might
    justify slight inferiority. If it is more expensive, there should be
    a gain in accuracy to justify its application (or, it deserves further development).

    Often the
    paper only states that it intends to test if a newly developed object
    (algorithm, drug, device, technique, etc) would perform better than
    some chosen benchmarks. Then the paper presents some tables
    summarizing the results of many comparisons. Among the tables, the
    paper picks those comparisons having better values of some
    performance metric and showing statistical significance. Finally, the
    paper claims that the new object is successful since it has some
    favorable results that are statistically significant.

    This looks odd. SHouldn't we clearly define the hypothesis before
    conducting any tests? For example, shouldn't we define the success of
    the object to be "having all the chosen metrics have better results"?
    Otherwise, why would we test so many metrics, instead of only one?

    The aforementioned approach looks like this: we do not know what
    would happen. So let's pick some commonly used metrics to test if we
    could get some of them to show favorable and significant results.

    I am not comfortable with your use of the word 'metrics' -- I like
    to think of improving the metrics of a scale by taking a power
    transformation, like, square root for Poisson, etc.

    Or, your metric for measuring 'size' might be area, volume, weight....



    Anyway, what are the correct or rigorous ways to conduct tests
    with multiple metrics?

    You might want to search for the terms "multiple testing" and
    "Bonferroni correction".

    That answers the final question -- assuming that you do have
    some stated hypothesis or goal.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Rich Ulrich on Sun Mar 19 10:58:42 2023
    Rich Ulrich wrote:

    On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones" <dajhawkxx@nowherel.com> wrote:

    Cosine wrote:


    Anyway, what are the correct or rigorous ways to conduct tests
    with multiple metrics?

    You might want to search for the terms "multiple testing" and
    "Bonferroni correction".

    That answers the final question -- assuming that you do have
    some stated hypothesis or goal.

    Not quite. The "Bonferroni correction" is an approximation, and one
    needs to think about that, and more deeply than jut the approximation
    to 1-(1-p)^n. More deeply, the formula is exact and valid if all the test-statistics are statistically independent, it is conservative if
    there is positive dependence (and so "OK"). But, theoretically, it
    might be wildly wrong if there is negative dependence

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Rich Ulrich on Sun Mar 19 11:26:16 2023
    Rich Ulrich wrote:

    On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones" <dajhawkxx@nowherel.com> wrote:

    Cosine wrote:

    Hi:

    We could easily find in the literature that a study used more than
    one performance metric for the hypothesis test without explicitly
    and >> clearly stating what hypothesis this study aims to test.

    That sounds like a journal with reviewers who are not doing their job.
    A new method may have better sensitivity or specificity, making it
    useful as a second test. If it is cheaper/easier, that virtue might
    justify slight inferiority. If it is more expensive, there should be
    a gain in accuracy to justify its application (or, it deserves further development).

    Often the
    paper only states that it intends to test if a newly developed
    object >> (algorithm, drug, device, technique, etc) would perform
    better than >> some chosen benchmarks. Then the paper presents some
    tables >> summarizing the results of many comparisons. Among the
    tables, the >> paper picks those comparisons having better values of
    some >> performance metric and showing statistical significance.
    Finally, the >> paper claims that the new object is successful since
    it has some >> favorable results that are statistically significant.

    This looks odd. SHouldn't we clearly define the hypothesis before
    conducting any tests? For example, shouldn't we define the success
    of >> the object to be "having all the chosen metrics have better
    results"? >> Otherwise, why would we test so many metrics, instead
    of only one? >>
    The aforementioned approach looks like this: we do not know what
    would happen. So let's pick some commonly used metrics to test if
    we >> could get some of them to show favorable and significant
    results.

    I am not comfortable with your use of the word 'metrics' -- I like
    to think of improving the metrics of a scale by taking a power transformation, like, square root for Poisson, etc.

    Or, your metric for measuring 'size' might be area, volume, weight....



    Anyway, what are the correct or rigorous ways to conduct tests
    with multiple metrics?

    You might want to search for the terms "multiple testing" and
    "Bonferroni correction".

    That answers the final question -- assuming that you do have
    some stated hypothesis or goal.

    My other answer concentrated on the case where you put all attention on
    the null hypothesis "no effect of any kind", but one could also think
    of finding if any of the alternatives on which the test-statistics are
    based are of any importance, and if so, which one(s).

    In theory the "Bonferroni correction" approach doesn't deal with this.
    One presumably would need to go back to estimates of effect sizes. But,
    if the plan was to do further experiments targeted at getting better
    estimates of particular effects, how do you choose how many and which
    effects to investigate further. The original experiment might suggest
    the one with the smallest p-value, but that might just be a chance
    event, with some other one being better.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to dajhawkxx@nowherel.com on Sun Mar 19 20:48:47 2023
    On Sun, 19 Mar 2023 10:58:42 -0000 (UTC), "David Jones" <dajhawkxx@nowherel.com> wrote:

    Rich Ulrich wrote:

    On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones"
    <dajhawkxx@nowherel.com> wrote:

    Cosine wrote:


    Anyway, what are the correct or rigorous ways to conduct tests
    with multiple metrics?

    You might want to search for the terms "multiple testing" and
    "Bonferroni correction".

    That answers the final question -- assuming that you do have
    some stated hypothesis or goal.

    Not quite. The "Bonferroni correction" is an approximation, and one

    The sufficient answer started with "Search for the terms" -- You
    should find much more than "How To" apply Bonferroni correction.

    Multiple testing is also a broad topic. The original question was
    not very specific, but there should be a GOAL, something about
    making some /decision/ or reaching a conclusion.

    Here's some open-ended thinking about an open-ended question.

    I think I can usually work a decision into some hypothesis; but
    "p-level of 0.05" is a convention of social science research. Not
    every hypothesis merits that test.

    Some areas with tests (new atomic particles) use far more stringent
    nominal levels ... I think the official logic incorporates
    "bonferroni"-type considerations. But for decisions in general,
    in other areas, sometimes we settle for "50%" (or worse).


    needs to think about that, and more deeply than jut the approximation
    to 1-(1-p)^n. More deeply, the formula is exact and valid if all the >test-statistics are statistically independent, it is conservative if
    there is positive dependence (and so "OK"). But, theoretically, it
    might be wildly wrong if there is negative dependence

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)