• Differences between R and SPSS on probit analysis

    From bmf2doa@gmail.com@21:1/5 to All on Thu Feb 23 12:12:41 2017
    Hi!!!
    I'm working on the effects of alternative larvicides on Aedes aegypti. Right now, I am doing a binary mortality response with a single explanatory variable (dose) on 4 concentrations of one larvicide (+ control). Our university is fond of SPSS, and I
    have learned to conduct the basic probit model with it, including a natural logarithm transformation on my dosis data.
    Not so long ago, I've started working with R, and through a combination of the 'glm' and 'dose.p' functions, I get the same slope and intercept, as well as LD50 calculations. Nevertheless, the standard errors and Z-scores calculated through the Probit
    model in SPSS comes out completely different in R. Additionally, the 95% confidence intervals for the LD50 come out differently between the two programs. I really don't have a clue on how I am getting the same slopes, intercepts and LD50's, but totally
    different SE, Z, and 95% CI. Can anybody help me so I can get the same results in R??
    I'll pass you the script and hypothetical data:
    dose <- c(6000, 4500, 3000, 1500, 0)
    total <- c(100, 100, 100, 100, 100)
    affected <- c(91, 82, 69, 49, 0)
    finney71 <- data.frame(dose, total, affected)
    fm1 <- glm(affected/total ~ log(dose),
    family=binomial(link = probit), data=finney71[finney71$dose != 0, ])
    xp1 <- dose.p(fm1, p=c(0.5,0.9))
    xp.ci <- xp1 + attr(xp1, "SE") %*% matrix(qnorm(1 - 0.05/2)*c(-1,1), nrow=1) EAUS.Aa <- exp(cbind(xp1, attr(xp1, "SE"), xp.ci[,1], xp.ci[,2])) dimnames(EAUS.Aa)[[2]] <- c("LD", "SE", "LCL","UCL")
    So, this is the regression results I get with R:
    summary(fm1)
    Deviance Residuals:
    1 2 3 4
    0.06655 -0.02814 -0.06268 0.03474
    Coefficients:
    Estimate Std. Error z value
    (Intercept) -6.8940 10.7802 -0.640
    log(dose) 0.9333 1.3441 0.694
    |z|)
    (Intercept) 0.522
    log(dose) 0.487
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 0.513878 on 3 degrees of freedom
    Residual deviance: 0.010356 on 2 degrees of freedom
    AIC: 6.5458
    Number of Fisher Scoring iterations: 5
    And the LD50 and CI transformed:
    print(EAUS.Aa)
    LD SE LCL UCL
    p = 0.5: 1614.444 3.207876 164.3822 15855.91
    p = 0.9: 6373.473 3.764879 474.1600 85669.72
    These are the values I get on SPSS (just replacing the values on R output) : Coefficients:
    Estimate Std. Error z value
    (Intercept) -6.8940 1.082 -6.373
    (dose) 2.149 0.311 6.918
    And the LD50 and CI transformed:
    LD LCL UCL
    p = 0.5: 1614.444 1198.932 1953.120
    p = 0.9: 6373.473 5145.767 9013.354
    So, please if somebody can help me with this, I'd be grateful. If working with those functions won't do it, I'll use another, the one you recommend.
    Thank you very much!
    PD. I've already googled it but there's no satisfactory answer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Duffy@21:1/5 to All on Fri Feb 24 23:40:55 2017
    bmf2doa@gmail.com wrote:
    (+ control). Our university is fond of SPSS, and I have learned to
    conduct the basic probit model with it
    I'll pass you the script and hypothetical data:
    dose <- c(6000, 4500, 3000, 1500, 0)
    total <- c(100, 100, 100, 100, 100)
    affected <- c(91, 82, 69, 49, 0)
    R MASS::dose.p
    LD SE LCL UCL
    p = 0.5: 1614.444 3.207876 164.3822 15855.91
    SPSS
    LD LCL UCL
    p = 0.5: 1614.444 1198.932 1953.120

    I'm too lazy to work out where your SPSS programming went wrong, but
    with such a small dataset, don't you think the CIs should be wide rather
    than narrow?

    predict(fm1, newdata=data.frame(dose=1614.444), type="response", se.fit=TRUE) $fit 0.5 $se.fit 0.4339901

    or, using some Monte Carlo based CI estimate,
    library("HelpersMG")
    LD50(finney71[1:4,], equation="probit")
    The LD50 is 7.387 SE 1.231
    The lower limit of transitional range of doses is 5.625 SE 0.929
    The higher limit of transitional range of doses is 9.149 SE 1.278

    Incidentally, when we are log transforming a range of values including 0,
    it is common to try and include this data point (fudge up to 0.5 or 1).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to davidD@qimr.edu.au on Fri Feb 24 21:11:47 2017
    On Fri, 24 Feb 2017 23:40:55 +0000 (UTC), David Duffy
    <davidD@qimr.edu.au> wrote:

    bmf2doa@gmail.com wrote:
    (+ control). Our university is fond of SPSS, and I have learned to
    conduct the basic probit model with it
    I'll pass you the script and hypothetical data:
    dose <- c(6000, 4500, 3000, 1500, 0)
    total <- c(100, 100, 100, 100, 100)
    affected <- c(91, 82, 69, 49, 0)
    R MASS::dose.p
    LD SE LCL UCL
    p = 0.5: 1614.444 3.207876 164.3822 15855.91
    SPSS
    LD LCL UCL
    p = 0.5: 1614.444 1198.932 1953.120

    I'm too lazy to work out where your SPSS programming went wrong, but
    with such a small dataset, don't you think the CIs should be wide rather
    than narrow?

    predict(fm1, newdata=data.frame(dose=1614.444), type="response", se.fit=TRUE) >$fit 0.5 $se.fit 0.4339901

    or, using some Monte Carlo based CI estimate,
    library("HelpersMG")
    LD50(finney71[1:4,], equation="probit")
    The LD50 is 7.387 SE 1.231
    The lower limit of transitional range of doses is 5.625 SE 0.929
    The higher limit of transitional range of doses is 9.149 SE 1.278

    Incidentally, when we are log transforming a range of values including 0,
    it is common to try and include this data point (fudge up to 0.5 or 1).

    Are you saying that the 0 is being ignored in both analyses, since
    log(0) is undefined?

    I have an ignorant question. Or observation, leading to a question.

    I've never run R, and I've never run probit. I Googled to
    try to find what "finney71( ) referred to, and that led to an
    ambiguous result: An example using R (which I do not use)
    seemed to show, in a plot, that the "0" was scored as 0 while
    the logarithm was used for the other values. - and there was
    a warning that "outlying" values could create non-robust SEs
    and models. - and the model would depend on the scaling
    of the dose variable, you might note, which /should/ be
    a very dubious feature.

    Well, for the data above, that would lead to a strong effect:
    all the log(dose) values are in a narrow range, compared to 0;
    and the 0-dose result is 0 affected, compared to 49%-91%
    for the rest. So, by chance, it might fit a fairly linear model.
    But, still....

    So, question: What does the "Finney71( )" mean? - beyond
    the fact that it refers to /something/ in Finney's fine 1971 book
    on Probit analysis.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Duffy@21:1/5 to Rich Ulrich on Sat Feb 25 06:34:03 2017
    Rich Ulrich <rich.ulrich@comcast.net> wrote:
    On Fri, 24 Feb 2017 23:40:55 +0000 (UTC), David Duffy
    <davidD@qimr.edu.au> wrote:

    bmf2doa@gmail.com wrote:
    (+ control). Our university is fond of SPSS, and I have learned to
    conduct the basic probit model with it
    I'll pass you the script and hypothetical data:
    dose <- c(6000, 4500, 3000, 1500, 0)
    total <- c(100, 100, 100, 100, 100)
    affected <- c(91, 82, 69, 49, 0)
    R MASS::dose.p
    LD SE LCL UCL
    p = 0.5: 1614.444 3.207876 164.3822 15855.91
    SPSS
    LD LCL UCL
    p = 0.5: 1614.444 1198.932 1953.120

    Are you saying that the 0 is being ignored in both analyses, since
    log(0) is undefined?
    So, question: What does the "Finney71( )" mean? - beyond

    I presume it's an example.

    Nothing like Usenet to parade one's mistakes (meaning mine)! I should
    downvote myself. The correct R code should have been:
    finney71$ldose <- log(dose)
    finney71$ldose[dose==0] <- 0

    fm1 <- glm(cbind(affected, total-affected) ~ ldose,
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    family=binomial(link=probit),data=finney71)
    ld50 <- dose.p(fm1)
    exp(ld50 + c(0,-1.96,1.96)*attr(ld50,"SE"))
    [1] 1614.444 1284.708 2028.811

    Closer to the SPSS result. Again, showing my ignorance,
    including/excluding the zero dose made no difference if logdose was set
    to zero (recall 0/100 affected, so more important if nonzero base rate).

    Cheers, David Duffy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Smith@21:1/5 to All on Tue Feb 28 08:30:39 2017
    I notice that the results from R are for log-dose. Results from SPSS are for dose. The results have to be different.

    Results from R
    Estimate Std. Error z value
    (Intercept) -6.8940 10.7802 -0.640
    log(dose) 0.9333 1.3441 0.694

    Results from SPSS
    Estimate Std. Error z value
    (Intercept) -6.8940 1.082 -6.373
    (dose) 2.149 0.311 6.918


    Results from R give the same coefficients that I get from Stata (see below) but the standard errors are bigger in R by a factor of 10 and the z values are smaller proportionally.

    I believe that the reason is that R is not taking into account the sample sizes that you have for each rate. If R is treating the sample size as 1 instead of 100 then this would result in a standard error bigger by a factor of 10, which is about what you
    have compared with Stata.

    For the intercept your results show a similar difference between R and SPSS. This suggests that SPSS is accounting for the frequency counts. I was not able to reproduce your results for SPSS.

    If you use the proper frequency weights then all your confidence intervals will be shorter including those for the LD-50 estimate.

    Does R permit the specification of a frequency weight? If not you will need a data set with 500 observations and lots of duplicates.

    Results from Stata for log-dose using logistic regression (400 observations)

    Coef. Std. Err. z
    constant -6.893981 1.08167 -6.37
    lndose .9332907 .1349154 6.92

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Duffy@21:1/5 to David Smith on Tue Feb 28 22:59:02 2017
    David Smith <davidwsmithphd@gmail.com> wrote:
    I notice that the results from R are for log-dose. Results from SPSS are for dose. The results have to be different.

    Results from Stata for log-dose using logistic regression (400 observations)

    Coef. Std. Err. z
    constant -6.893981 1.08167 -6.37
    lndose .9332907 .1349154 6.92

    As I pointed out above, the R syntax for the model was incorrect. Once corrected, one obtains for the probit link:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -6.8940 1.0780 -6.395 1.60e-10 ***
    log(dose) 0.9333 0.1344 6.944 3.82e-12 ***

    I am guessing you meant probit rather than logistic regression.
    The logit link gives
    Coefficients:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) -11.4797 1.8312 -6.269 3.64e-10 ***
    log(dose) 1.5539 0.2301 6.753 1.45e-11 ***

    If the dose are not logged, the coefficients are wildly different from this,
    so I agree I don't know exactly what model was fitted in SPSS, and given the
    OP never responded...

    Cheers, David Duffy.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From erburgess83@gmail.com@21:1/5 to bmf...@gmail.com on Thu Jun 22 10:12:15 2017
    On Thursday, February 23, 2017 at 2:12:42 PM UTC-6, bmf...@gmail.com wrote:
    Hi!!!
    I'm working on the effects of alternative larvicides on Aedes aegypti. Right now, I am doing a binary mortality response with a single explanatory variable (dose) on 4 concentrations of one larvicide (+ control). Our university is fond of SPSS, and I
    have learned to conduct the basic probit model with it, including a natural logarithm transformation on my dosis data.
    Not so long ago, I've started working with R, and through a combination of the 'glm' and 'dose.p' functions, I get the same slope and intercept, as well as LD50 calculations. Nevertheless, the standard errors and Z-scores calculated through the Probit
    model in SPSS comes out completely different in R. Additionally, the 95% confidence intervals for the LD50 come out differently between the two programs. I really don't have a clue on how I am getting the same slopes, intercepts and LD50's, but totally
    different SE, Z, and 95% CI. Can anybody help me so I can get the same results in R??
    I'll pass you the script and hypothetical data:
    dose <- c(6000, 4500, 3000, 1500, 0)
    total <- c(100, 100, 100, 100, 100)
    affected <- c(91, 82, 69, 49, 0)
    finney71 <- data.frame(dose, total, affected)
    fm1 <- glm(affected/total ~ log(dose),
    family=binomial(link = probit), data=finney71[finney71$dose != 0, ])
    xp1 <- dose.p(fm1, p=c(0.5,0.9))
    xp.ci <- xp1 + attr(xp1, "SE") %*% matrix(qnorm(1 - 0.05/2)*c(-1,1), nrow=1) EAUS.Aa <- exp(cbind(xp1, attr(xp1, "SE"), xp.ci[,1], xp.ci[,2])) dimnames(EAUS.Aa)[[2]] <- c("LD", "SE", "LCL","UCL")
    So, this is the regression results I get with R:
    summary(fm1)
    Deviance Residuals:
    1 2 3 4
    0.06655 -0.02814 -0.06268 0.03474
    Coefficients:
    Estimate Std. Error z value
    (Intercept) -6.8940 10.7802 -0.640
    log(dose) 0.9333 1.3441 0.694
    |z|)
    (Intercept) 0.522
    log(dose) 0.487
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 0.513878 on 3 degrees of freedom
    Residual deviance: 0.010356 on 2 degrees of freedom
    AIC: 6.5458
    Number of Fisher Scoring iterations: 5
    And the LD50 and CI transformed:
    print(EAUS.Aa)
    LD SE LCL UCL
    p = 0.5: 1614.444 3.207876 164.3822 15855.91
    p = 0.9: 6373.473 3.764879 474.1600 85669.72
    These are the values I get on SPSS (just replacing the values on R output) : Coefficients:
    Estimate Std. Error z value
    (Intercept) -6.8940 1.082 -6.373
    (dose) 2.149 0.311 6.918
    And the LD50 and CI transformed:
    LD LCL UCL
    p = 0.5: 1614.444 1198.932 1953.120
    p = 0.9: 6373.473 5145.767 9013.354
    So, please if somebody can help me with this, I'd be grateful. If working with those functions won't do it, I'll use another, the one you recommend.
    Thank you very much!
    PD. I've already googled it but there's no satisfactory answer.

    Hopefully you've solved this problem already, but if not, here is the solution.

    You've used family=binomial in your model, which fixes the dispersion parameter at 1. Based on your ratio of residual deviance to residual deviance df, you are under-dispersed. The assumption that the dispersion ratio is 1 is what it throwing your
    results off. Instead, use family=quasibinomial. This allows the dispersion parameter to be estimated, which in turn affects the variance estimate, SE, etc. Give it a try by running one model with family=binomial and one model with family = quasibinomial.
    Notice the SE for the slopes and intercepts. This in turn should change your goodness of fit, and 95% CI's to almost exactly align with the SPSS results, as well as any other pertinent result SPSS spits out. Any minor differences in numbers are likely
    due to rounding under the hood of each piece of software and are of no theoretical concern.

    Hope that helps you and anybody else having this issue!

    Sincerely,

    Edwin R. Burgess IV, PhD

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)