• Q same samples but regression in different manner

    From Cosine@21:1/5 to All on Tue Jan 22 04:38:10 2019
    Hi:

    Suppose we have a set of IID samples with size N. Would the performance of each of the following way of regression be different? Which one would be more reliable (closer to the true situation) and why? Assume the same model is used for regression, e.g.,
    all the same linear model or the same polynominal model.

    (1) Use all the N samples for regression to determine the parameters of the model.

    (2) randomly resample the original samples to 10 sub-sets, each of N/10 samples. Do the regression for each of the 10 sub-sets. The the average of the 10 regression results as the final result.

    (3) randomly resample the original samples to form a sub-set with N/10 samples. Conduct the regression. Put those samples of the sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10
    sets of regression models. Take the average of these models as the final model.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to David Jones on Thu Jan 24 00:02:18 2019
    David Jones wrote:

    Cosine wrote:

    Hi:

    Suppose we have a set of IID samples with size N. Would the
    performance of each of the following way of regression be different?
    Which one would be more reliable (closer to the true situation) and
    why? Assume the same model is used for regression, e.g., all the
    same linear model or the same polynominal model.

    (1) Use all the N samples for regression to determine the parameters
    of the model.

    (2) randomly resample the original samples to 10 sub-sets, each of
    N/10 samples. Do the regression for each of the 10 sub-sets. The the average of the 10 regression results as the final result.

    (3) randomly resample the original samples to form a sub-set with
    N/10 samples. Conduct the regression. Put those samples of the
    sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10 sets of
    regression models. Take the average of these models as the final
    model.

    The answer to this is that, provided you have the correct model, you
    cannot do better than the single overall "regression" ... that is
    assuming that your "regression"methodology coincides with maximum
    likelihood estimation for the given model. This follows from the
    theory for sufficient statistics. THat theory says that your result
    should depend directly on the sufficient statistics only.

    However, if you don't know that you have the correct structure of
    model and it is open to selection as part of the model-fitting
    process, then your outline goes a very small way in the direction of "cross-validation", where fitting on sub-samples is tested on the rest
    of the sample ... but it is really very different from what you
    suggest.

    An alternative use of leave-some-out analyses is in looking for
    outliers and leverage points, which again relate to not wanting to
    rely on a given model.

    I have come across one early mention of making use of fits on
    sub-samples of the overall data set: see the section "The Nair and Shrivastava method" on page 54 of the book "Analysis of Staight Line
    Data", by Foreman S. Acton, Dover Publications, NY, 1966 (SBN
    486-61747-5) .... which is a corrected version of the title published
    by Wiley in 1959.

    An alternative viewpoint is that of least-squares regression, where the estimates are specifically formulated to be the optimal linear
    combination of the observed y-values. Your suggested procudures would
    be a different linear combination (being a linear combination of a
    linear combination) ... and so it can't be better than the optimal
    linear combination.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Jones@21:1/5 to Cosine on Wed Jan 23 23:34:57 2019
    Cosine wrote:

    Hi:

    Suppose we have a set of IID samples with size N. Would the
    performance of each of the following way of regression be different?
    Which one would be more reliable (closer to the true situation) and
    why? Assume the same model is used for regression, e.g., all the same
    linear model or the same polynominal model.

    (1) Use all the N samples for regression to determine the parameters
    of the model.

    (2) randomly resample the original samples to 10 sub-sets, each of
    N/10 samples. Do the regression for each of the 10 sub-sets. The the
    average of the 10 regression results as the final result.

    (3) randomly resample the original samples to form a sub-set with
    N/10 samples. Conduct the regression. Put those samples of the
    sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10 sets of
    regression models. Take the average of these models as the final
    model.

    The answer to this is that, provided you have the correct model, you
    cannot do better than the single overall "regression" ... that is
    assuming that your "regression"methodology coincides with maximum
    likelihood estimation for the given model. This follows from the theory
    for sufficient statistics. THat theory says that your result should
    depend directly on the sufficient statistics only.

    However, if you don't know that you have the correct structure of model
    and it is open to selection as part of the model-fitting process, then
    your outline goes a very small way in the direction of
    "cross-validation", where fitting on sub-samples is tested on the rest
    of the sample ... but it is really very different from what you suggest.

    An alternative use of leave-some-out analyses is in looking for
    outliers and leverage points, which again relate to not wanting to rely
    on a given model.

    I have come across one early mention of making use of fits on
    sub-samples of the overall data set: see the section "The Nair and
    Shrivastava method" on page 54 of the book "Analysis of Staight Line
    Data", by Foreman S. Acton, Dover Publications, NY, 1966 (SBN
    486-61747-5) .... which is a corrected version of the title published
    by Wiley in 1959.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)