• Q interpretation of statistically negative values

    From Cosine@21:1/5 to All on Sat Sep 24 00:29:03 2022
    Hi:

    When doing statistical analysis, we often compute the values of the mean and standard error (SE) of the sample. Then we check the cumulative probability of the interval centered at the mean and depart from there by some positive and negative SE.
    However, this kind of interval sometimes would include negative values. How do we interpret this kind of result if the variable, by definition, should be always positive, e.g., age, weight, height, and salary?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sat Sep 24 13:47:05 2022
    On Sat, 24 Sep 2022 00:29:03 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Hi:

    When doing statistical analysis, we often compute the values of
    the mean and standard error (SE) of the sample. Then we check the
    cumulative probability of the interval centered at the mean and depart
    from there by some positive and negative SE. However, this kind of
    interval sometimes would include negative values. How do we interpret
    this kind of result if the variable, by definition, should be always
    positive, e.g., age, weight, height, and salary?

    Q: What does it mean when the computed confidence interval
    extends beyond the range of the variable?

    A: The assumptions for constructing and using the CI as an accurate
    indicator have not been met. And if one tail is obviously too long,
    the other tail is often too short, which might be a concern.

    In my experience, I saw people confused by CI's on proportions
    when they went beyond 0 or 100%. The statistical literature
    contains several alternatives for those CIs, which vary the
    assumptions about the underlying distribution ("logisitc"?) and
    construct intervals that are more precise and legitimate. (Note:
    approximations can be easier to compute than exact answers.)

    For natural measures which have a large range and are never zero,
    starting with the log transformation is often appropriate: Transform;
    get the average; back-transform if you prefer the original units.

    For well-behaved distributions, transformations to achieve "equal
    interval" (in the measurement space of whatever matters) will
    usually give good CIs.

    For distributions on hand that are not well-behaved, you might be
    well-advised to switch from Mean to Median as your central measure,
    and use some version of ranges instead of Standard Deviation/Error.
    Bootstrap methods are used in some problems, to overcome the
    "oddness" of distributions.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Cosine@21:1/5 to All on Sat Sep 24 16:11:09 2022
    Hi:

    Thank you for replying.

    However, different transformations would distort the original numeric line in different manners.

    For example, while using the log function transforms the original non-negative numeric line [0, inf] to the full numeric line [-inf, inf], it "expands" the part of [0, 1] to [-inf, 0]. If we use another nonlinear transformation, we will get a different
    distortion. After all, we only restrict the transformation to one-to-one.

    Since the width of the confidence interval represents the cumulative proportions, would the type of transformation affect the determination of statistical significance?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Sat Sep 24 23:00:05 2022
    On Sat, 24 Sep 2022 16:11:09 -0700 (PDT), Cosine <asecant@gmail.com>
    wrote:

    Hi:

    Thank you for replying.

    However, different transformations would distort the original numeric line in different manners.

    That does not deserve a "However,"....

    Yes, you will compute different values when formulas use
    different assumptions. As I wrote,

    * * For well-behaved distributions, transformations to achieve "equal
    interval" (in the measurement space of whatever matters) will
    usually give good CIs. * *



    For example, while using the log function transforms the original
    non-negative numeric line [0, inf] to the full numeric line [-inf,
    inf], it "expands" the part of [0, 1] to [-inf, 0]. If we use another
    nonlinear transformation, we will get a different distortion. After
    all, we only restrict the transformation to one-to-one.

    I don't take the log of zero. Undefined, not -inf.

    Also note: Some people misconstrue "equal intervals." Wealth is
    measured in dollars; 'dollars' are seen (erroneously) to make the
    factor linear and equal-interval when /measured/ in dollars. But
    adding a million dollars is a grossly different contribution to
    'wealth' depending on the start -- there are unequal intervals
    at the extremes. Think of the variables as 'latent factors' for
    what you are interested in, and imagine what makes equal intervals
    for that factor. Like 'wealth' or whatever, the available units are
    often misleading.


    Since the width of the confidence interval represents the
    cumulative proportions, would the type of transformation affect the determination of statistical significance?

    If you want a statement about cumulative proportions, the
    safe way is to use rank-order. The range from the 40th to
    the 60th percentile (for instance) will be a 95% CI for the
    median, for some easily computed N.

    "Statistical significance" (to me) implies testing, rather than
    presenting CIs. If you don't have 'equal intervals' in the
    sense I describe above, your testing will be deficient to some
    extent.

    Does it matter? The usual tests are pretty robust against
    moderate distortion of scaling, when you use the usual 5% test
    size (actual size remains in the range 4-6%). ANOVA tests at
    0.001 on moderately skewed distributions are often wrong
    by five-fold or more.

    Extremly fat tails or far outliers mess up p-values even at the
    5% size. This is why cleaning your data takes at least 90% of
    the time of a competent data analyst hired for a job: We
    want to know for ourselves that the means will be meaningful,
    et cetera. That usually means fixing stuff, or writing cautions
    at the end.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)