Cosine wrote:
Hi:
Suppose we have a set of IID samples with size N. Would the
performance of each of the following way of regression be different?
Which one would be more reliable (closer to the true situation) and
why? Assume the same model is used for regression, e.g., all the
same linear model or the same polynominal model.
(1) Use all the N samples for regression to determine the parameters
of the model.
(2) randomly resample the original samples to 10 sub-sets, each of
N/10 samples. Do the regression for each of the 10 sub-sets. The the average of the 10 regression results as the final result.
(3) randomly resample the original samples to form a sub-set with
N/10 samples. Conduct the regression. Put those samples of the
sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10 sets of
regression models. Take the average of these models as the final
model.
The answer to this is that, provided you have the correct model, you
cannot do better than the single overall "regression" ... that is
assuming that your "regression"methodology coincides with maximum
likelihood estimation for the given model. This follows from the
theory for sufficient statistics. THat theory says that your result
should depend directly on the sufficient statistics only.
However, if you don't know that you have the correct structure of
model and it is open to selection as part of the model-fitting
process, then your outline goes a very small way in the direction of "cross-validation", where fitting on sub-samples is tested on the rest
of the sample ... but it is really very different from what you
suggest.
An alternative use of leave-some-out analyses is in looking for
outliers and leverage points, which again relate to not wanting to
rely on a given model.
I have come across one early mention of making use of fits on
sub-samples of the overall data set: see the section "The Nair and Shrivastava method" on page 54 of the book "Analysis of Staight Line
Data", by Foreman S. Acton, Dover Publications, NY, 1966 (SBN
486-61747-5) .... which is a corrected version of the title published
by Wiley in 1959.
Hi:
Suppose we have a set of IID samples with size N. Would the
performance of each of the following way of regression be different?
Which one would be more reliable (closer to the true situation) and
why? Assume the same model is used for regression, e.g., all the same
linear model or the same polynominal model.
(1) Use all the N samples for regression to determine the parameters
of the model.
(2) randomly resample the original samples to 10 sub-sets, each of
N/10 samples. Do the regression for each of the 10 sub-sets. The the
average of the 10 regression results as the final result.
(3) randomly resample the original samples to form a sub-set with
N/10 samples. Conduct the regression. Put those samples of the
sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10 sets of
regression models. Take the average of these models as the final
model.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 113 |
Nodes: | 8 (1 / 7) |
Uptime: | 71:01:34 |
Calls: | 2,499 |
Files: | 8,667 |
Messages: | 1,913,105 |