• COVID and statistics

    From Steve Pope@21:1/5 to All on Mon Mar 30 21:41:57 2020
    I hope everyone is well. This one is long and possibly boring.

    Since now we are apparently all epidimologists I thought I'd give
    it a try along with everyone else. And I have three technical
    questions in the below.

    USA total, and USA by state statistics are available at the COVID
    Tracking Project among other places. I've seen a lot of actors
    draw conclusions from such data using pure chartology - no
    statistical analysis.

    Typically they want to show that the epidemic is growing at different
    rates in different locations (which it is). But since the epidemic also
    got started at different times in different locations, they use a
    heuristic "zero day".

    For example, the number of test positives reached 1.75 per million
    population on March 7 in California, but not until March 9 in USA
    overall. Using this two day offset, the chartist can slide the curves
    under comparison on top of each other for visual discussion.

    (Before we get sidetracked here, I know there are huge numbers
    of confounders, my questions are limited to just the mathematics
    at hand.)

    (1) First question. Is there a more scientific way of selecting this
    "zero day" other than by a randomly-chosen heuristic?

    Taking the above data and dates, I tested the hypothesis of whether
    the test positives have been growing at a different exponential rate
    in California than in USA overall. I did this by first taking the
    log of the cumulative test positives; taking the first order difference
    of this; and plugging this into a plain vanilla student t test.

    The result is that with 99.4% probability, The California curve is
    slower than the USA curve, at 0.087 decades per day as opposed to
    0.120 decades per day.

    (2) Second question. A friend who works in health sciences suggests
    a student T test is not a good choice and I should use an ARIMA
    model to "correct for autocorellation on the regressions".
    Myself, for this simple hypothesis, I'm not seeing this, but is my
    friend's observation valid?

    (Meta-question: is it common to whiten the data before performing
    a statistical test, and if so, why?)

    (Other meta-question, does not an ARIMA tool, after it's done munging
    and massaging the data series, perform a statistical test such as Student t,
    or a more generalized statistical test, anyway?)

    On to the next question: I do want to improve upon my result. One
    problem I have is that the early part of the series is noisier than
    the later part of the series since there is more noise averaging
    over 6000 positive tests (reported in California on March 29) than on 70 positives (reported on March 7). (Under the assumption there is
    fixed noise component on each individual test result.)

    What I'd like to do is some form of Maximal Ratio Combining to fix this.
    Which I have not yet done, but I can figure it out.

    (3) Third question: is there a standard way of doing a statistical
    test when the SNR in the data series is evolving over time?

    Thanks much if you have gotten this far.

    Steve

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pulpo@21:1/5 to Steve Pope on Mon Mar 30 17:04:39 2020
    On Tuesday, March 31, 2020 at 6:42:00 AM UTC+9, Steve Pope wrote:
    I hope everyone is well. This one is long and possibly boring.

    Since now we are apparently all epidimologists I thought I'd give
    it a try along with everyone else. And I have three technical
    questions in the below.

    USA total, and USA by state statistics are available at the COVID
    Tracking Project among other places. I've seen a lot of actors
    draw conclusions from such data using pure chartology - no
    statistical analysis.

    Typically they want to show that the epidemic is growing at different
    rates in different locations (which it is). But since the epidemic also
    got started at different times in different locations, they use a
    heuristic "zero day".

    For example, the number of test positives reached 1.75 per million population on March 7 in California, but not until March 9 in USA
    overall. Using this two day offset, the chartist can slide the curves
    under comparison on top of each other for visual discussion.

    (Before we get sidetracked here, I know there are huge numbers
    of confounders, my questions are limited to just the mathematics
    at hand.)

    (1) First question. Is there a more scientific way of selecting this
    "zero day" other than by a randomly-chosen heuristic?

    Taking the above data and dates, I tested the hypothesis of whether
    the test positives have been growing at a different exponential rate
    in California than in USA overall. I did this by first taking the
    log of the cumulative test positives; taking the first order difference
    of this; and plugging this into a plain vanilla student t test.

    The result is that with 99.4% probability, The California curve is
    slower than the USA curve, at 0.087 decades per day as opposed to
    0.120 decades per day.

    (2) Second question. A friend who works in health sciences suggests
    a student T test is not a good choice and I should use an ARIMA
    model to "correct for autocorellation on the regressions".
    Myself, for this simple hypothesis, I'm not seeing this, but is my
    friend's observation valid?

    (Meta-question: is it common to whiten the data before performing
    a statistical test, and if so, why?)

    (Other meta-question, does not an ARIMA tool, after it's done munging
    and massaging the data series, perform a statistical test such as Student t, or a more generalized statistical test, anyway?)

    On to the next question: I do want to improve upon my result. One
    problem I have is that the early part of the series is noisier than
    the later part of the series since there is more noise averaging
    over 6000 positive tests (reported in California on March 29) than on 70 positives (reported on March 7). (Under the assumption there is
    fixed noise component on each individual test result.)

    What I'd like to do is some form of Maximal Ratio Combining to fix this. Which I have not yet done, but I can figure it out.

    (3) Third question: is there a standard way of doing a statistical
    test when the SNR in the data series is evolving over time?

    Thanks much if you have gotten this far.

    Steve

    I think a Generalized Additive Model on the log of the data could help to determine whether the traces of two different states are different or not. You could incorporate the autocorrelation in the model and also compute the derivatives to see if it's
    getting slower or not. Check this out: http://jacolienvanrij.com/Tutorials/GAMM.html#summed-effects-with-without-random-effects

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)