Forum: >>> Magnum BBS <<<

COVID and statistics

From Steve Pope@21:1/5 to All on Mon Mar 30 21:41:57 2020

I hope everyone is well. This one is long and possibly boring.

Since now we are apparently all epidimologists I thought I'd give
it a try along with everyone else. And I have three technical
questions in the below.

USA total, and USA by state statistics are available at the COVID
Tracking Project among other places. I've seen a lot of actors
draw conclusions from such data using pure chartology - no
statistical analysis.

Typically they want to show that the epidemic is growing at different
rates in different locations (which it is). But since the epidemic also
got started at different times in different locations, they use a
heuristic "zero day".

For example, the number of test positives reached 1.75 per million
population on March 7 in California, but not until March 9 in USA
overall. Using this two day offset, the chartist can slide the curves
under comparison on top of each other for visual discussion.

(Before we get sidetracked here, I know there are huge numbers
of confounders, my questions are limited to just the mathematics
at hand.)

(1) First question. Is there a more scientific way of selecting this
"zero day" other than by a randomly-chosen heuristic?

Taking the above data and dates, I tested the hypothesis of whether
the test positives have been growing at a different exponential rate
in California than in USA overall. I did this by first taking the
log of the cumulative test positives; taking the first order difference
of this; and plugging this into a plain vanilla student t test.

The result is that with 99.4% probability, The California curve is
slower than the USA curve, at 0.087 decades per day as opposed to
0.120 decades per day.

(2) Second question. A friend who works in health sciences suggests
a student T test is not a good choice and I should use an ARIMA
model to "correct for autocorellation on the regressions".
Myself, for this simple hypothesis, I'm not seeing this, but is my
friend's observation valid?

(Meta-question: is it common to whiten the data before performing
a statistical test, and if so, why?)

(Other meta-question, does not an ARIMA tool, after it's done munging
and massaging the data series, perform a statistical test such as Student t,
or a more generalized statistical test, anyway?)

On to the next question: I do want to improve upon my result. One
problem I have is that the early part of the series is noisier than
the later part of the series since there is more noise averaging
over 6000 positive tests (reported in California on March 29) than on 70 positives (reported on March 7). (Under the assumption there is
fixed noise component on each individual test result.)

What I'd like to do is some form of Maximal Ratio Combining to fix this.
Which I have not yet done, but I can figure it out.

(3) Third question: is there a standard way of doing a statistical
test when the SNR in the data series is evolving over time?

Thanks much if you have gotten this far.

Steve

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From pulpo@21:1/5 to Steve Pope on Mon Mar 30 17:04:39 2020

On Tuesday, March 31, 2020 at 6:42:00 AM UTC+9, Steve Pope wrote:

I hope everyone is well. This one is long and possibly boring.

Since now we are apparently all epidimologists I thought I'd give
it a try along with everyone else. And I have three technical
questions in the below.

USA total, and USA by state statistics are available at the COVID
Tracking Project among other places. I've seen a lot of actors
draw conclusions from such data using pure chartology - no
statistical analysis.

Typically they want to show that the epidemic is growing at different
rates in different locations (which it is). But since the epidemic also
got started at different times in different locations, they use a
heuristic "zero day".

For example, the number of test positives reached 1.75 per million population on March 7 in California, but not until March 9 in USA
overall. Using this two day offset, the chartist can slide the curves
under comparison on top of each other for visual discussion.

(Before we get sidetracked here, I know there are huge numbers
of confounders, my questions are limited to just the mathematics
at hand.)

(1) First question. Is there a more scientific way of selecting this
"zero day" other than by a randomly-chosen heuristic?

Taking the above data and dates, I tested the hypothesis of whether
the test positives have been growing at a different exponential rate
in California than in USA overall. I did this by first taking the
log of the cumulative test positives; taking the first order difference
of this; and plugging this into a plain vanilla student t test.

The result is that with 99.4% probability, The California curve is
slower than the USA curve, at 0.087 decades per day as opposed to
0.120 decades per day.

(2) Second question. A friend who works in health sciences suggests
a student T test is not a good choice and I should use an ARIMA
model to "correct for autocorellation on the regressions".
Myself, for this simple hypothesis, I'm not seeing this, but is my
friend's observation valid?

(Meta-question: is it common to whiten the data before performing
a statistical test, and if so, why?)

(Other meta-question, does not an ARIMA tool, after it's done munging
and massaging the data series, perform a statistical test such as Student t, or a more generalized statistical test, anyway?)

On to the next question: I do want to improve upon my result. One
problem I have is that the early part of the series is noisier than
the later part of the series since there is more noise averaging
over 6000 positive tests (reported in California on March 29) than on 70 positives (reported on March 7). (Under the assumption there is
fixed noise component on each individual test result.)

What I'd like to do is some form of Maximal Ratio Combining to fix this. Which I have not yet done, but I can figure it out.

(3) Third question: is there a standard way of doing a statistical
test when the SNR in the data series is evolving over time?

Thanks much if you have gotten this far.

Steve

I think a Generalized Additive Model on the log of the data could help to determine whether the traces of two different states are different or not. You could incorporate the autocorrelation in the model and also compute the derivatives to see if it's
getting slower or not. Check this out: http://jacolienvanrij.com/Tutorials/GAMM.html#summed-effects-with-without-random-effects

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sat Dec 21 03:08:36 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Smith
  Sat Dec 21 02:40:27 2024
  from Here via SSH
- Bob Worm
  Fri Dec 20 21:12:29 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Dec 20 15:21:07 2024
  from Wales, Uk via Telnet
- Johnnyv
  Fri Dec 20 15:17:33 2024
  from Bilbao, Spain via Raw
- Keyop
  Thu Dec 19 23:12:34 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Thu Dec 19 19:31:18 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Thu Dec 19 08:59:46 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	379
Nodes:	16 (2 / 14)
Uptime:	67:40:25
Calls:	8,084
Calls today:	2
Files:	13,068
Messages:	5,849,522

COVID and statistics

Who's Online

Recent Visitors

System Info