• Q number of samples and metrics

    From Cosine@21:1/5 to All on Tue Aug 2 15:00:35 2022
    Hi:

    How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small?
    For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

    Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics
    analyzed in the experiment?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Tue Aug 2 23:11:09 2022
    On Tue, 2 Aug 2022 15:00:35 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Hi:

    How do we determine the minimal number of samples required for a
    statistical experiment?

    This is called "power analysis." The statistical procedure uses the distribution of the non-central F or whatever. I suggest Jacob
    Cohen's book for an introduction that goes beyond simply presenting
    the tables that can be used for lookup.

    It was in the 1980s when NIMH started requiring power analyses
    as part of our research grants (psychiatric medicine).

    Which statistical test (F, t, etc.)? What alpha-error? What
    beta-error (chance of missing an effect), given What assumed
    effect size? ... for what N? Power is equal to (1-beta).

    Thus, a power analysis might include a table that shows the
    power obtained by using specific Ns with specific underlying
    effects for the test we are using.

    For a two-tailed t-test, at 5%,
    FIXED FORMAT TABLE
    N needed assumed
    effect sizes(d)
    0.5 0.6 0.8
    power 60% < n's >
    80%
    90%
    95%

    Greater power implies larger N; larger effect
    size implies smaller N.

    In our area, 80% was the minimum for most studies.
    If there are multiple hypotheses, the same table shows how
    likely the study will "detect" the various effect sizes with
    a nominal test of that size.


    For example, we found on the internet that "N
    < 30" is considered a set with a small number of samples. But how do
    we decide if the number of samples is too small? For example, are 2,
    3, ..., 10 samples too small? Why is that?

    I had a friend who did lab work on cells. He told me that his
    typical N was 3: The only effect sizes he was interested in was
    the HUGE ones. If he used just one or two, then a weird result
    might be lab error; two simliar weird results showed that he had
    something.

    Any theory to support the
    decision? Likewise, what is the theory behind that decides "N < 30" is
    a set with a small number of samples?

    Next, let's consider the number of metrics (e.g., accuracy and
    specificifity) analyzed in the experiment. If we use too many metrics,
    it would be considered that we are fishing the dataset. But again, how
    do we determine the proper number of metrics analyzed in the
    experiment?

    I think you are confusing two other discussions here. Metrics
    like "specificity and sensitivity" are not assessed by statistical
    tests like the t-test; they are found with a sample large enough to
    give s small-enough standard deviation.

    "Multiple variables" opens a discussion that starts with setting
    up your experiment: Have FEW "main" hypotheses. There can be
    sub-hypotheses; there can be other, "frankly exploratory" results.

    One approach for several variables is to use Bonferroni correction
    for multiple tests; the Power Table then might have to refer to a
    "nominal" alpha of 2.5% or what-not, to correct for multiple tests.

    Another approach is to do a 'multivariate analysis' that tests
    several hypotheses at once; that gets into other discussions of
    how to properly consider mutliple tests, since the OVERALL test
    does not tell you about the relative import of different variables.

    I've always recommended creating "composite scores" where they
    combine the main criteria -- if you can't just pick a single score.

    I took part in a multi-million dollar study, several hundred patients
    followed for two years, a dozen rating scales collected at multiple
    time points ... where the main criterion for treatment success was
    whether a patient had to be withdrawn from the trial because
    re-hospitalization was imminent.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Weaver@21:1/5 to Cosine on Tue Sep 6 12:56:51 2022
    I'm about a month late to this party, but I have a couple of thoughts. See below.

    On Tuesday, August 2, 2022 at 6:00:37 PM UTC-4, Cosine wrote:
    Hi:

    How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small?
    For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

    Are you talking about the central limit theorem (CLT) and the so-called "rule of 30"? If so, remember that the shape of the sampling distribution of the mean depends on both the shape of the raw score (population) distribution and the sample size. If
    the population of raw scores is normal, the sampling distribution of the mean will be normal for any sample size (even n=1, in which case, it will be an exact copy of the normal population distribution). How large n must be to ensure that the sampling
    distribution of the mean is approximately normal depends on the shape of the population distribution. For many variables that are not too asymmetrical, n=30 may be enough. But for some other variables, it will not be enough.

    If this is what you were asking about, you may find some of the following discussion interesting:

    https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz


    Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics
    analyzed in the experiment?

    You talk about accuracy and specificity. But I wonder if you are really just talking about having multiple dependent (or outcome) variables--i.e., the so-called multiplicity problem. If you are, I recommend two 2005 Lancet articles by Schulz and Grimes
    (links below). For me, they are two of the most thoughtful articles I have read on the multiplicity problem. HTH.

    https://pubmed.ncbi.nlm.nih.gov/15866314/ https://pubmed.ncbi.nlm.nih.gov/15885299/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to bweaver@lakeheadu.ca on Wed Sep 7 13:26:23 2022
    On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver
    <bweaver@lakeheadu.ca> wrote:

    I'm about a month late to this party, but I have a couple of thoughts. See below.

    Bruce - This does not indicate that you saw the long reply from me.

    I talked about power analysis; also, multipllicity problem. What you
    add about normality is good. And the references.



    On Tuesday, August 2, 2022 at 6:00:37 PM UTC-4, Cosine wrote:
    Hi:

    How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small?
    For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

    Are you talking about the central limit theorem (CLT) and the so-called "rule of 30"? If so, remember that the shape of the sampling distribution of the mean depends on both the shape of the raw score (population) distribution and the sample size. If
    the population of raw scores is normal, the sampling distribution of the mean will be normal for any sample size (even n=1, in which case, it will be an exact copy of the normal population distribution). How large n must be to ensure that the sampling
    distribution of the mean is approximately normal depends on the shape of the population distribution. For many variables that are not too asymmetrical, n=30 may be enough. But for some other variables, it will not be enough.

    If this is what you were asking about, you may find some of the following discussion interesting:

    https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz

    Interesting comments.



    Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics
    analyzed in the experiment?

    You talk about accuracy and specificity. But I wonder if you are really just talking about having multiple dependent (or outcome) variables--i.e., the so-called multiplicity problem. If you are, I recommend two 2005 Lancet articles by Schulz and
    Grimes (links below). For me, they are two of the most thoughtful articles I have read on the multiplicity problem. HTH.

    https://pubmed.ncbi.nlm.nih.gov/15866314/

    This Abstract annoys me a little. FOCUS. Any /randomized/
    trial generally has a purpose, an aim, an intention or goal that
    should be reduced to one hypothesis (or two, not more than
    three). In the NIMH grants that I worked on, NIMH review
    insisted on planning the central testing in advance -- I hope.
    Beyond the main test, there were 'confirmatory' and descriptive
    analyses; and exploratory results.

    However, the PIs of those grants had some freedom in what
    they reported, so they might need this advice, "Respect your
    a-priori planning." This Abstract does not mention that.

    A friend who worked at VA, which funded its own studies, once
    complained (1990s, maybe -- could be different now) that the VA
    research structure was overly iron-clad in that respect; that is,
    it was really tough for a PI to write up any test that had not been
    described in the proposal.


    https://pubmed.ncbi.nlm.nih.gov/15885299/

    That Abstract reads nicely enough.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Weaver@21:1/5 to Rich Ulrich on Wed Sep 7 13:25:34 2022
    On Wednesday, September 7, 2022 at 1:26:30 PM UTC-4, Rich Ulrich wrote:
    On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver
    <bwe...@lakeheadu.ca> wrote:

    I'm about a month late to this party, but I have a couple of thoughts. See below.
    Bruce - This does not indicate that you saw the long reply from me.

    I talked about power analysis; also, multipllicity problem. What you
    add about normality is good. And the references.
    --- snip ---

    Hi Rich. I had seen your post, but clearly skimmed through it too quickly, because I missed that you had talked about multiplicity. Sorry about that.

    Your comment in your later reply about writing up tests that were not in the proposal reminded me of this recent article, which I think is very good.

    Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5-18. https://journals.sagepub.com/doi/full/10.1177/0149206316679487

    I don't know if that link will work for everyone, but it might.

    Cheers,
    Bruce

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to bweaver@lakeheadu.ca on Wed Sep 7 18:49:18 2022
    On Wed, 7 Sep 2022 13:25:34 -0700 (PDT), Bruce Weaver
    <bweaver@lakeheadu.ca> wrote:

    On Wednesday, September 7, 2022 at 1:26:30 PM UTC-4, Rich Ulrich wrote:
    On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver
    <bwe...@lakeheadu.ca> wrote:

    I'm about a month late to this party, but I have a couple of thoughts. See below.
    Bruce - This does not indicate that you saw the long reply from me.

    I talked about power analysis; also, multipllicity problem. What you
    add about normality is good. And the references.
    --- snip ---

    Hi Rich. I had seen your post, but clearly skimmed through it too quickly, because I missed that you had talked about multiplicity. Sorry about that.

    Your comment in your later reply about writing up tests that were not in the proposal reminded me of this recent article, which I think is very good.

    Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5-18. https://journals.sagepub.com/doi/full/10.1177/0149206316679487

    I don't know if that link will work for everyone, but it might.


    The link works for me.

    Very good article, for offsetting some bad practices.

    The ongoing difficulty is providing a high quality of
    t-harking. Breakthrough? Bad guess?

    What already happens for health findings is that the
    slight hint of a Big Deal gets grabbed and publicized ...
    and, a few years later, when the hypothesis proves to
    be a bus, it becomes One More Example of 'Scientists misled
    us again, and they don't really know anything.'

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)