• Q differences between these two ways of comparisons

    From Cosine@21:1/5 to All on Fri Aug 11 18:28:50 2023
    Hi:

    Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

    Kind-1:

    M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

    Then we claim that A performs better than all the rest 4 algorithms.

    Kind-2:

    M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,
    M_B > M_C, M_B > M_D, M_B > M_E,
    M_C > M_D, M_C > M_E, and
    M_D > M_E

    Then, we claim that A performs best among all the 5 algorithms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sat Aug 12 00:09:50 2023
    On Fri, 11 Aug 2023 18:28:50 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Hi:

    Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

    Kind-1:

    M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

    Then we claim that A performs better than all the rest 4 algorithms.

    It seems that you are describing the RESULT of a set
    of comparisons. The two 'kinds' would be, A versus each other,
    and "all comparisons among them."

    You should say, "on these test data" and "better on M than ..."
    and "performed" (past tense).


    Kind-2:

    M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,
    M_B > M_C, M_B > M_D, M_B > M_E,
    M_C > M_D, M_C > M_E, and
    M_D > M_E

    Then, we claim that A performs best among all the 5 algorithms.


    I would state that A performed better (on M) than the rest, and also
    the rest were strictly ordered in how well they performed.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Fri Aug 11 23:31:41 2023
    Rich Ulrich 在 2023年8月12日 星期六中午12:10:06 [UTC+8] 的信中寫道:
    On Fri, 11 Aug 2023 18:28:50 -0700 (PDT), Cosine
    wrote:
    Hi:

    Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

    Kind-1:

    M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

    Then we claim that A performs better than all the rest 4 algorithms.
    It seems that you are describing the RESULT of a set
    of comparisons. The two 'kinds' would be, A versus each other,
    and "all comparisons among them."

    You should say, "on these test data" and "better on M than ..."
    and "performed" (past tense).

    Kind-2:

    M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,
    M_B > M_C, M_B > M_D, M_B > M_E,
    M_C > M_D, M_C > M_E, and
    M_D > M_E

    Then, we claim that A performs best among all the 5 algorithms.

    I would state that A performed better (on M) than the rest, and also
    the rest were strictly ordered in how well they performed.

    --
    Rich Ulrich

    In other words, if the purpose is only to demonstrate that A performed better on M than the rest 4 algorithms,
    we only need to do the first kind of comparison. We do the second kind only if we want to demonstrate the ordering.

    By the way. it seems that to reach the desired conclusion, both kinds of comparison require doing multiple comparisons.

    The first kind requires 4 ( = 5-1 ) and the second requires C(5,2) = 10.

    Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

    If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?

    But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sat Aug 12 15:24:07 2023
    On Fri, 11 Aug 2023 23:31:41 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Rich Ulrich ? 2023?8?12? ?????12:10:06 [UTC+8] ??????
    On Fri, 11 Aug 2023 18:28:50 -0700 (PDT), Cosine
    wrote:
    Hi:

    Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

    Kind-1:

    M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

    Then we claim that A performs better than all the rest 4 algorithms.
    It seems that you are describing the RESULT of a set
    of comparisons. The two 'kinds' would be, A versus each other,
    and "all comparisons among them."

    You should say, "on these test data" and "better on M than ..."
    and "performed" (past tense).

    Kind-2:

    M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,
    M_B > M_C, M_B > M_D, M_B > M_E,
    M_C > M_D, M_C > M_E, and
    M_D > M_E

    Then, we claim that A performs best among all the 5 algorithms.

    I would state that A performed better (on M) than the rest, and also
    the rest were strictly ordered in how well they performed.

    --
    Rich Ulrich

    In other words, if the purpose is only to demonstrate that A performed better on M than the rest 4 algorithms,
    we only need to do the first kind of comparison. We do the second kind only if we want to demonstrate the ordering.

    By the way. it seems that to reach the desired conclusion, both kinds of comparison require doing multiple comparisons.

    The first kind requires 4 ( = 5-1 ) and the second requires C(5,2) = 10.

    Before you take on 'multiple comparisons' and p-levels, you ought
    to have a Decision to be made, or a question What do you have
    here? Making a statement about what happens to fit the sample
    best does not require assumptions; drawing inferences to elsewhere
    does require assumptions.

    Who or what does your sample /represent/? Where do the algorithms
    come from? (and how do they differ?). What are you hoping to
    generalize to?

    I can imagine that your second set of results could be a summary
    of step-wise regression, where Metric is the R-squared and A is
    the result after mutiple steps. Each step shows an increase in
    R-squared, by definition. Ta-da!

    The hazards of step-wise regression are well-advertised by now.
    I repeated Frank Harrell's commentary multiple times in the stats
    Usenet groups, and others picked it up. I can add: When there
    are dozens of candidate variables to Enter, each step is apt to
    provide a WORSE algorithm when applied to a separate sample for
    validation. Sensible algorithms usually require the application of
    good sense by the developers -- instead of over-capitalizing on
    chance in a model built on limited data.

    If you have huge data, then you should also pay attention to
    robustness and generalizability across sub-populations, rather
    than focus on p-levels for the whole shebang.


    Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

    In my experience, I talked people out of corrections many times
    by cleaning up their questions. Bonferroni fits best when you
    have /independent/ questions of equal priority. And when you
    have a reason to pay heed to family-wise error.


    If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?

    But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

    If you don't have any idea what you are looking for, one common
    procedure is to proclaim the effort 'exploratory' and report
    the nominal levels.



    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Sat Aug 12 16:03:52 2023
    Hmm, let's start by asking or clarifying the research questions then.

    Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.

    Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.

    Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

    Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.

    Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

    Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.

    This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Mon Aug 14 19:03:25 2023
    On Sat, 12 Aug 2023 16:03:52 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Hmm, let's start by asking or clarifying the research questions then.

    Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.

    Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.

    Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

    Okay. You are outside the scope of what I have read.
    Whatever I read about machine learning, decades ago, was
    far more primitive or preliminary than this. I can offer a note
    or two on 'reading' such papers..


    Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.

    Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

    Did the statisitcs include p-values?

    The comparison I can think of is the demonstartions I have
    seen about 'statistical tests' offered for consideration. That is,
    authors are comparing (say, too simplistically) Student's t-test to
    a t-test for unequal variances, or to a t on rank-orders.

    Here, everyone can inspect the tests and imagine when they
    will differ; randomized samples are created which feature various
    aspects of non-normality, for various matches of Ns. What is
    known is that the tests will differ -- a 5% test does not 'reject'
    2.5% at each end, when computed on 10,000 generated samples,
    when its assumptions are intentionally violated.

    What is interesting is how MUCH they differ, and how much more
    they differ for smaller N or for smaller alpha.



    Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.

    This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

    So, what I know (above) won't apply if you have to treat the
    algorithms as 'black-box' operations -- you can't predict when
    an algorithm will perform its best.

    I think I would be concerned about the generality of the test
    bank, and the legitimacy/credibility of the authors.

    I can readily imagine a situation like with the 'meta-analyses'
    that I read in the 1990s: You need a good statistician and a
    good subject-area scientist to create a good meta-analysis, and
    most of the ones I read had neither.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Tue Aug 15 08:10:14 2023
    Well, let's consider a more classical problem.

    Regarding the English teaching method for high school students, we develop a new method (A1) and want to demonstrate if it performs better than other methods (A2, A3, and A4) by comparing the average scores of the experimental class using different
    methods. Each comparison uses paired t-test. Since each comparison is independent of the other, the correct significance level using the Bonferroni test is alpha_original/( 4-1 ).

    Suppose we want to investigate if the developed method (A1) is better than other methods (A2. A3. and A4) for English, Spanish, and German, then the correct alpha = alpha_original/( 4-1 )/3.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sat Aug 19 00:25:49 2023
    On Tue, 15 Aug 2023 08:10:14 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Well, let's consider a more classical problem.

    Regarding the English teaching method for high school students, we
    develop a new method (A1) and want to demonstrate if it performs
    better than other methods (A2, A3, and A4) by comparing the average
    scores of the experimental class using different methods. Each
    comparison uses paired t-test. Since each comparison is independent of
    the other, the correct significance level using the Bonferroni test is alpha_original/( 4-1 ).

    It took me a bit to figure out how this was a classical problem,
    especially with paired t-tests -- I've never read that literature
    in particular. 'Paired' on individuals does not work because you
    can't teach the same material to the same student in two ways
    from the same starting point.

    Maybe I got it. 'Teachers' account for so much variance in
    learning that the same teacher needs to teach two methods
    to two different classes. 'Teachers' are the units of analyses,
    comparing success for pairs of methods.

    Doing this would be similar to what I've read a little more about,
    testing two methods of clinical intervention. What also seems
    similar for both is that the PI wants to know that the teacher/
    clinician can and will properly administer the Method without too
    much contamination.


    Suppose we want to investigate if the developed method (A1) is
    better than other methods (A2. A3. and A4) for English, Spanish, and
    German, then the correct alpha = alpha_original/( 4-1 )/3.

    From my own consulting world, 'power of analysis' was always
    a major concern. So I must mention that there is a very good
    reason that studies usually compare only TWO methods if they
    want a firm answer: More than two comparisons will require
    larger Ns for the same power, and funding agencies (US, now)
    typically care about the power of analysis matters. So if cost/size
    is a problem, there won't be four Methods or four Languages.

    For the combined experiment, I bring up what I said before:
    Are you sure you are asking the question you want? (or that
    you need?)

    One way to comprise a simple design would be to look at the
    two-way analysis of Method x Language. The main effect for
    Method would matter, and the interaction of Method x Language
    would say that they don't work the same. A main effect for
    Language would mainly be confusing.

    Beyond that, there is what I mentioned before, Are you sure
    that family-wise alpha error deserves to be protected?

    For educational methods -- or clinical ones -- being 'just as good'
    may be fine if the teachers and students like it better. In fact, for
    drug treatments (which I never dealt with on this level), NIH
    had some (maybe confusing) prescriptions for how to 'show
    equivalence'.

    I say '(confusing)' because I do remember reading some criticism
    and contradictory advice -- when I read about it, 20 years ago.
    (I hope they've figured it out by now.)

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)