• Q what conclusion could be draw from data, p-value and CI

    From Cosine@21:1/5 to All on Thu Apr 29 00:08:06 2021
    We conducted a test on two groups (A and B). We used a
    15-item scale to measure the results. A cut-off score of
    6 (scores ranging from 0 to 15, with the higher score being
    indicative for stronger reaction) was set to differentiate
    the individuals with a clinical reaction from normal individuals.

    The null hypothesis is that the two groups have no difference.
    The alternative hypothesis is that the reaction of the members
    of Group A is greater than that of Group B.

    We defined the difference = score of A - score of B.
    We chose the alpha = 0.05

    We got the following data summarized in the table below.

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -2.10-3.70 400

    In addition to the following analysis, what else could we
    draw from the data?

    Case-1:
    P-value < alpha -> significant
    95%CI all > 0 -> A > B

    Case-2:
    the same as those of A

    Case-3:
    P-value > alpha -> insignificant
    95%CI consists of 0 -> not sure if A > B or A < B

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From duncan smith@21:1/5 to Cosine on Thu Apr 29 16:07:47 2021
    On 29/04/2021 08:08, Cosine wrote:
    We conducted a test on two groups (A and B). We used a
    15-item scale to measure the results. A cut-off score of
    6 (scores ranging from 0 to 15, with the higher score being
    indicative for stronger reaction) was set to differentiate
    the individuals with a clinical reaction from normal individuals.

    The null hypothesis is that the two groups have no difference.
    The alternative hypothesis is that the reaction of the members
    of Group A is greater than that of Group B.

    We defined the difference = score of A - score of B.
    We chose the alpha = 0.05

    We got the following data summarized in the table below.

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -2.10-3.70 400

    In addition to the following analysis, what else could we
    draw from the data?

    Case-1:
    P-value < alpha -> significant
    95%CI all > 0 -> A > B

    Case-2:
    the same as those of A

    Case-3:
    P-value > alpha -> insignificant
    95%CI consists of 0 -> not sure if A > B or A < B


    It looks like some kind of class exercise / assignment. I'd say look at
    the results carefully, and think what other information you'd like to
    have in order to make sense of it. I'd have several questions to ask,
    starting with exactly what these 3 cases are (I'm pretty sure what
    they're not).

    Duncan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Thu Apr 29 09:29:21 2021
    Cosine 在 2021年4月29日 星期四下午3:08:08 [UTC+8] 的信中寫道:
    We conducted a test on two groups (A and B). We used a
    15-item scale to measure the results. A cut-off score of
    6 (scores ranging from 0 to 15, with the higher score being
    indicative for stronger reaction) was set to differentiate
    the individuals with a clinical reaction from normal individuals.

    The null hypothesis is that the two groups have no difference.
    The alternative hypothesis is that the reaction of the members
    of Group A is greater than that of Group B.

    We defined the difference = score of A - score of B.
    We chose the alpha = 0.05

    We got the following data summarized in the table below.

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -2.10-3.70 400

    In addition to the following analysis, what else could we
    draw from the data?

    Case-1:
    P-value < alpha -> significant
    95%CI all > 0 -> A > B

    Case-2:
    the same as those of A

    Case-3:
    P-value > alpha -> insignificant
    95%CI consists of 0 -> not sure if A > B or A < B

    We could intitutively connect the P-value inference with the CI inference by P-value < alpha <=> reject H0 <=> ( 1-alpha )CI doesn't consists of 0.
    But is there a formal way to prove the latter part, i.e, making inference by CI?

    We could also draw the conclusion of clinical significance if we have additional information on a clinically meaningful value. Then we could
    say that the result is clinically significant if 1) the CI consists of that clinical measure, and 2) the width of the CI is narrow enough. Nevertheless,
    are there ways to determine if the width of the CI is too wide objectively?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Thu Apr 29 13:57:55 2021
    On Thu, 29 Apr 2021 00:08:06 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    We conducted a test on two groups (A and B). We used a
    15-item scale to measure the results. A cut-off score of
    6 (scores ranging from 0 to 15, with the higher score being
    indicative for stronger reaction) was set to differentiate
    the individuals with a clinical reaction from normal individuals.

    The null hypothesis is that the two groups have no difference.
    The alternative hypothesis is that the reaction of the members
    of Group A is greater than that of Group B.

    We defined the difference = score of A - score of B.
    We chose the alpha = 0.05

    We got the following data summarized in the table below.

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -2.10-3.70 400

    In addition to the following analysis, what else could we
    draw from the data?

    Bad reporting. Is the N a total, for equal group sizes?

    Whatever the "cases" are, they are vastly different in SD.
    Perhaps Case 1 has scores near zero for all. Or: It will be more
    sensible if Case 1 happened to report "Average item score"
    whereas the others reported "Scale Total". That would make
    the adjusted line for Case 1 read
    1 2.25 0.001 0.75-3.75 2000

    I haven't done calculations to be sure, but that does
    seem like a large SE (on all three) for the reported Ns and
    a 15 point scale.


    Then too, some numbers have to be wrong. For Case
    3, the mean difference is the midpoint of (-1.1, 3.7), not
    of the reported (-2.1, 3.7). I assume -1.1 is correct.

    But, more seriously, the test results (CI) are inconsistent with
    the reported p-values. The SE for each comparison, the
    denominator of the t-tests, is about 1/4th the range of the
    CI. Using that for a close approximation gives me t-tests of
    3.0, 4.94, and 1.08, respectively. The difference for case 2
    is clearly the largest, and it is smaller than "p-value = 0.005".



    Case-1:
    P-value < alpha -> significant
    95%CI all > 0 -> A > B

    Case-2:
    the same as those of A

    Case-3:
    P-value > alpha -> insignificant
    95%CI consists of 0 -> not sure if A > B or A < B

    If this is a homework assignment, as Duncan suggests,
    you should give credit where credit is due.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Cosine on Thu Apr 29 17:20:59 2021
    Cosine wrote:

    Cosine 在 2021年4月29日 星期四下午3:08:08 [UTC+8] 的信中寫道:
    We conducted a test on two groups (A and B). We used a
    15-item scale to measure the results. A cut-off score of
    6 (scores ranging from 0 to 15, with the higher score being
    indicative for stronger reaction) was set to differentiate
    the individuals with a clinical reaction from normal individuals.

    The null hypothesis is that the two groups have no difference.
    The alternative hypothesis is that the reaction of the members
    of Group A is greater than that of Group B.

    We defined the difference = score of A - score of B.
    We chose the alpha = 0.05

    We got the following data summarized in the table below.

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -2.10-3.70 400

    In addition to the following analysis, what else could we
    draw from the data?

    Case-1:
    P-value < alpha -> significant
    95%CI all > 0 -> A > B

    Case-2:
    the same as those of A

    Case-3:
    P-value > alpha -> insignificant
    95%CI consists of 0 -> not sure if A > B or A < B

    We could intitutively connect the P-value inference with the CI
    inference by P-value < alpha <=> reject H0 <=> ( 1-alpha )CI doesn't
    consists of 0. But is there a formal way to prove the latter part,
    i.e, making inference by CI?


    A standard, and best, way of constructing a confidence interval in any
    general situation is to define the confidence interval to contain
    exactly all those values for which the signifance test that the true
    value is that particular value is not rejected. This is standard stuff
    in any reliable text-book or statistics course.


    We could also draw the conclusion of clinical significance if we
    have additional information on a clinically meaningful value. Then we
    could say that the result is clinically significant if 1) the CI
    consists of that clinical measure, and 2) the width of the CI is
    narrow enough. Nevertheless, are there ways to determine if the width
    of the CI is too wide objectively?

    You need to revise this to say that you have a result of clinical
    importance if the confidence interval contains only values that are
    large enough to be medically useful, and NO OTHERS. That last
    stipulation replaces your concern about the confidence interval being
    too wide.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Thu Apr 29 11:44:58 2021
    Rich Ulrich 在 2021年4月30日 星期五上午1:58:02 [UTC+8] 的信中寫道:
    On Thu, 29 Apr 2021 00:08:06 -0700 (PDT), Cosine <ase...@gmail.com>
    wrote:
    We conducted a test on two groups (A and B). We used a
    15-item scale to measure the results. A cut-off score of
    6 (scores ranging from 0 to 15, with the higher score being
    indicative for stronger reaction) was set to differentiate
    the individuals with a clinical reaction from normal individuals.

    The null hypothesis is that the two groups have no difference.
    The alternative hypothesis is that the reaction of the members
    of Group A is greater than that of Group B.

    We defined the difference = score of A - score of B.
    We chose the alpha = 0.05

    We got the following data summarized in the table below.

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -2.10-3.70 400

    In addition to the following analysis, what else could we
    draw from the data?
    Bad reporting. Is the N a total, for equal group sizes?

    Whatever the "cases" are, they are vastly different in SD.
    Perhaps Case 1 has scores near zero for all. Or: It will be more
    sensible if Case 1 happened to report "Average item score"
    whereas the others reported "Scale Total". That would make
    the adjusted line for Case 1 read
    1 2.25 0.001 0.75-3.75 2000

    I haven't done calculations to be sure, but that does
    seem like a large SE (on all three) for the reported Ns and
    a 15 point scale.


    Then too, some numbers have to be wrong. For Case
    3, the mean difference is the midpoint of (-1.1, 3.7), not
    of the reported (-2.1, 3.7). I assume -1.1 is correct.

    But, more seriously, the test results (CI) are inconsistent with
    the reported p-values. The SE for each comparison, the
    denominator of the t-tests, is about 1/4th the range of the
    CI. Using that for a close approximation gives me t-tests of
    3.0, 4.94, and 1.08, respectively. The difference for case 2
    is clearly the largest, and it is smaller than "p-value = 0.005".

    Case-1:
    P-value < alpha -> significant
    95%CI all > 0 -> A > B

    Case-2:
    the same as those of A

    Case-3:
    P-value > alpha -> insignificant
    95%CI consists of 0 -> not sure if A > B or A < B
    If this is a homework assignment, as Duncan suggests,
    you should give credit where credit is due.

    --
    Rich Ulrich

    This has nothing to do with homework or whatsoever.

    The table came from Table I of this following paper.

    Aarts, S., B. Winkens and M. van den Akker (2012). "The insignificance of statistical significance." European Journal of General Practice 18(1): 50-52.

    But the 95% CI of case 3 was printed as: 21.10-3.70.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Thu Apr 29 21:42:51 2021
    On Thu, 29 Apr 2021 11:44:58 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Rich Ulrich ? 2021?4?30? ?????1:58:02 [UTC+8] ??????
    On Thu, 29 Apr 2021 00:08:06 -0700 (PDT), Cosine <ase...@gmail.com>
    wrote:
    We conducted a test on two groups (A and B). We used a
    15-item scale to measure the results. A cut-off score of
    6 (scores ranging from 0 to 15, with the higher score being
    indicative for stronger reaction) was set to differentiate
    the individuals with a clinical reaction from normal individuals.

    The null hypothesis is that the two groups have no difference.
    The alternative hypothesis is that the reaction of the members
    of Group A is greater than that of Group B.

    We defined the difference = score of A - score of B.
    We chose the alpha = 0.05

    We got the following data summarized in the table below.

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -2.10-3.70 400

    In addition to the following analysis, what else could we
    draw from the data?
    Bad reporting. Is the N a total, for equal group sizes?

    Whatever the "cases" are, they are vastly different in SD.
    Perhaps Case 1 has scores near zero for all. Or: It will be more
    sensible if Case 1 happened to report "Average item score"
    whereas the others reported "Scale Total". That would make
    the adjusted line for Case 1 read
    1 2.25 0.001 0.75-3.75 2000

    I haven't done calculations to be sure, but that does
    seem like a large SE (on all three) for the reported Ns and
    a 15 point scale.


    Then too, some numbers have to be wrong. For Case
    3, the mean difference is the midpoint of (-1.1, 3.7), not
    of the reported (-2.1, 3.7). I assume -1.1 is correct.

    But, more seriously, the test results (CI) are inconsistent with
    the reported p-values. The SE for each comparison, the
    denominator of the t-tests, is about 1/4th the range of the
    CI. Using that for a close approximation gives me t-tests of
    3.0, 4.94, and 1.08, respectively. The difference for case 2
    is clearly the largest, and it is smaller than "p-value = 0.005".

    Case-1:
    P-value < alpha -> significant
    95%CI all > 0 -> A > B

    Case-2:
    the same as those of A

    Case-3:
    P-value > alpha -> insignificant
    95%CI consists of 0 -> not sure if A > B or A < B
    If this is a homework assignment, as Duncan suggests,
    you should give credit where credit is due.

    --
    Rich Ulrich

    This has nothing to do with homework or whatsoever.

    The table came from Table I of this following paper.

    Aarts, S., B. Winkens and M. van den Akker (2012). "The insignificance of statistical significance." European Journal of General Practice 18(1): 50-52.

    Without looking, I would guess that I correctly nailed the
    distinction of Case 1 vs. 2 and 3. And they were trying to make
    a point which turns out to ba a point about incompetent readers.

    I'm reminded of an article I read, maybe 1985, that documented
    the surprisingly high error rate for footnotes to scientific studies.
    (9hat is - where references cited gave the wrong page, named the
    journal wrong, or whatever). The next issue of the journal
    included a note that apologized for three errors in the footnotes
    of that article.

    Or, to the point: I don't have much respect for people who talk
    about "the insignificance of statistical significance".
    It doesn't surprise me a bit that they carelessly screwed up a table
    both logically and typographically, because such people are not
    careful people.


    But the 95% CI of case 3 was printed as: 21.10-3.70.

    Okay. You guessed wrong on the correction. Was -1.10.
    not 21.10 or -2.10.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Duffy@21:1/5 to Cosine on Fri Apr 30 05:49:50 2021
    Cosine <asecant@gmail.com> wrote:
    In addition to the following analysis, what else could we
    draw from the data?

    A different way thinking about what a P-value is telling you is via the literature on estimation or calibration of posterior P-values eg Sellke
    et al (2001). The argument is basically seen for a result with P=0.05,
    when you have set alpha=0.05 - if what you saw is the true effect size,
    then you only have a 50% chance of getting a significant result if you
    repeated exactly the same study (same N etc).

    For simple states of affairs,

    "...here is the basic and surprising conclusion for normal testing, first established (theoretically) by Berger and Sellke (1987). Suppose it is
    known, a priori, that about 50% of the drugs tested have a negligible
    effect. (We shortly consider the more general case.) Then:

    "1. Of the Di for which the p value ~ .05, at least 23% (and typically
    close to 50%) will have negligible effect.

    "2. Of the Di for which the p value ~ .01, at least 7% (and typically
    close to 15%) will have negligible effect.

    If H0 and H1 have equal prior probabilities of 1/2, Sellke et al give

    alpha(p) = 1/(1 + 1/(-e p log(p)))

    as the posterior probability of H0, and as a frequentist calibration of
    p. This is only simple for "precise" alternative hypotheses, obviously.

    Relatedly, in genetic linkage analysis, where we set the critical
    alpha to 0.0003 (chosen because there are 22 (pairs of) chromosomes),
    the power to replicate a *true* finding using the same size and type
    dataset (with P close to 0.0003) is ~20% (obtained via simulations).

    You can think about the three results in your example and the
    "replication crisis" through this lens.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Fri Apr 30 05:42:17 2021
    How do we determine if the width of the CI is adequate or too wide?

    The corrected data of Table I is given below:

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -1.10-3.70 400

    For the data provided by the above paper, the author wrote:

    Let us reconsider the above-mentioned hypothetical study. The null hypothesis states that the mean difference between females and males on the GDS-15 (scale ranging from 0 to 15) is zero. Hence, if zero is detected in the 95% CI, the null hypothesis is
    not rejected. Examples of possible study results, using an α of 5%, are displayed in Table I. ...
    Example 2 is not only statistically significant but also clinically relevant; the difference between females and males on the GDS-15 is approximately two whole points. Moreover, the confidence interval is quite
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    narrow, which indicates that the sample size is large enough to make a proper judgement.
    ^^^^^^^^^
    What is the basis for the author to make this judgment?

    The author also wrote:
    Example 3 is not statistically significant. The confidence interval in this example is very large (almost six
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    points), which makes it difficult to draw any firm conclusions. Since the confidence interval in this
    ^^^^^^^
    Again, why could the author make this statement? What did it mean by almost 6 points?

    example includes both negative and positive values, it is not yet clear if there is a difference between these two groups (if females report more depressive symptoms than males or vice versa). Consequently, this study should be repeated using a larger
    sample size, which will decrease the width of the confidence interval.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sat May 1 19:35:59 2021
    On Fri, 30 Apr 2021 05:42:17 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    How do we determine if the width of the CI is adequate or too wide?

    The corrected data of Table I is given below:

    Case Mean Difference P-value 95%CI N
    1 0.15 0.001 0.05-0.25 2000
    2 2.10 0.005 1.25-2.95 1200
    3 1.30 0.089 -1.10-3.70 400



    Here's some computation showing Cohen's d for each Case.
    Cohen's d is the ueual recommendation for two-group
    comparisons of effect size. That seems very relevant to the
    reported title of that paper.

    Cohen's d = (m1-m2) / s_w for the Means and Within SD.

    The s_w can be recovered from the t-test: note, the t is
    incorporated in the computation of the CI, approximately
    +/- 2 (easier than 1.96) for the 95% CI.

    t-test t= (m1-m2)/ s_diff where I compute the standard error of
    the difference, using the common s_w for Case 3, N= 400 as 200+200:

    The variance of a difference is equal to the sum of the variances,
    thus,

    s_diff= sqrt( s_w**2 /200 + s_w**2 /200)
    = sqrt( 2* s_w**2 /200)
    = s_w /10

    Or, s_w= 10* s_diff .

    For Case 3, the range for +/- 1.96 is about 4* s_diff.
    For Case 3, the range is 4.8, so that s_diff is 1.2.
    Thus s_w is computed as 10 times that, or 12.

    Cohen's d would be a "small" effect, 1.1 (from 1.3/12; but
    that is less relevant than the fact that "12" is impossible as the
    SD for scores between (0,15) -- If all scores are at 0 and
    15, equally distributed, the maximum SD of 7.5 is achieved,
    as you get by re-scaling of a 0-1 variable to 0-15.

    Computations for Cases 1 and 2 get s_w's of 1.12 and 7.36
    (nearly the max of 7.5); and Cohen's d's, respectively, of 0.13
    and 0.29. Case 2 has a moderate difference.

    I don't like to criticize a paper from a distance, that is, without
    actually reading it. I'm using the numbers and description,
    as given.

    Am I all confused, and screwing up? or is this example, as
    it has been presented, totally bad?


    For the data provided by the above paper, the author wrote:

    Let us reconsider the above-mentioned hypothetical study. The null hypothesis states that the mean difference between females and males on the GDS-15 (scale ranging from 0 to 15) is zero. Hence, if zero is detected in the 95% CI, the null hypothesis is
    not rejected. Examples of possible study results, using an ? of 5%, are displayed in Table I. ...
    Example 2 is not only statistically significant but also clinically relevant; the difference between females and males on the GDS-15 is approximately two whole points. Moreover, the confidence interval is quite
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    narrow, which indicates that the sample size is large enough to make a proper judgement.
    ^^^^^^^^^
    What is the basis for the author to make this judgment?

    Knowing the subject matter (almost) always matters.

    Females rate higher on typical depression scales (U.S.)
    because of non-depressive artifacts, like, TALKING more
    with people about everything, including mood. Women
    also see doctors more often, which is not entirely accounted
    for by pregnancy or menustration. Thus - such results as
    these be followed by showing that there are items that
    /matter/ that are relevant and differ.




    The author also wrote:
    Example 3 is not statistically significant. The confidence interval in this example is very large (almost six
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    points), which makes it difficult to draw any firm conclusions. Since the confidence interval in this
    ^^^^^^^
    Again, why could the author make this statement? What did it mean by almost 6 points?

    That's what he calls, 4.8. "Clumsy" makes many mistakes.
    "Careless" fails to catch them.


    example includes both negative and positive values, it is not yet clear if there is a difference between these two groups (if females report more depressive symptoms than males or vice versa). Consequently, this study should be repeated using a larger
    sample size, which will decrease the width of the confidence interval.


    They should have started with real data.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)