• #### Q same samples but regression in different manner

From Cosine@21:1/5 to All on Tue Jan 22 04:38:10 2019
Hi:

Suppose we have a set of IID samples with size N. Would the performance of each of the following way of regression be different? Which one would be more reliable (closer to the true situation) and why? Assume the same model is used for regression, e.g.,
all the same linear model or the same polynominal model.

(1) Use all the N samples for regression to determine the parameters of the model.

(2) randomly resample the original samples to 10 sub-sets, each of N/10 samples. Do the regression for each of the 10 sub-sets. The the average of the 10 regression results as the final result.

(3) randomly resample the original samples to form a sub-set with N/10 samples. Conduct the regression. Put those samples of the sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10
sets of regression models. Take the average of these models as the final model.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From David Jones@21:1/5 to David Jones on Thu Jan 24 00:02:18 2019
David Jones wrote:

Cosine wrote:

Hi:

Suppose we have a set of IID samples with size N. Would the
performance of each of the following way of regression be different?
Which one would be more reliable (closer to the true situation) and
why? Assume the same model is used for regression, e.g., all the
same linear model or the same polynominal model.

(1) Use all the N samples for regression to determine the parameters
of the model.

(2) randomly resample the original samples to 10 sub-sets, each of
N/10 samples. Do the regression for each of the 10 sub-sets. The the average of the 10 regression results as the final result.

(3) randomly resample the original samples to form a sub-set with
N/10 samples. Conduct the regression. Put those samples of the
sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10 sets of
regression models. Take the average of these models as the final
model.

The answer to this is that, provided you have the correct model, you
cannot do better than the single overall "regression" ... that is
assuming that your "regression"methodology coincides with maximum
likelihood estimation for the given model. This follows from the
theory for sufficient statistics. THat theory says that your result
should depend directly on the sufficient statistics only.

However, if you don't know that you have the correct structure of
model and it is open to selection as part of the model-fitting
process, then your outline goes a very small way in the direction of "cross-validation", where fitting on sub-samples is tested on the rest
of the sample ... but it is really very different from what you
suggest.

An alternative use of leave-some-out analyses is in looking for
outliers and leverage points, which again relate to not wanting to
rely on a given model.

I have come across one early mention of making use of fits on
sub-samples of the overall data set: see the section "The Nair and Shrivastava method" on page 54 of the book "Analysis of Staight Line
Data", by Foreman S. Acton, Dover Publications, NY, 1966 (SBN
486-61747-5) .... which is a corrected version of the title published
by Wiley in 1959.

An alternative viewpoint is that of least-squares regression, where the estimates are specifically formulated to be the optimal linear
combination of the observed y-values. Your suggested procudures would
be a different linear combination (being a linear combination of a
linear combination) ... and so it can't be better than the optimal
linear combination.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From David Jones@21:1/5 to Cosine on Wed Jan 23 23:34:57 2019
Cosine wrote:

Hi:

Suppose we have a set of IID samples with size N. Would the
performance of each of the following way of regression be different?
Which one would be more reliable (closer to the true situation) and
why? Assume the same model is used for regression, e.g., all the same
linear model or the same polynominal model.

(1) Use all the N samples for regression to determine the parameters
of the model.

(2) randomly resample the original samples to 10 sub-sets, each of
N/10 samples. Do the regression for each of the 10 sub-sets. The the
average of the 10 regression results as the final result.

(3) randomly resample the original samples to form a sub-set with
N/10 samples. Conduct the regression. Put those samples of the
sub-set back, and resample a sub-set with N/10 samples. Conduct the regression. Repeat this process so that we again have 10 sets of
regression models. Take the average of these models as the final
model.

The answer to this is that, provided you have the correct model, you
cannot do better than the single overall "regression" ... that is
assuming that your "regression"methodology coincides with maximum
likelihood estimation for the given model. This follows from the theory
for sufficient statistics. THat theory says that your result should
depend directly on the sufficient statistics only.

However, if you don't know that you have the correct structure of model
and it is open to selection as part of the model-fitting process, then
your outline goes a very small way in the direction of
"cross-validation", where fitting on sub-samples is tested on the rest
of the sample ... but it is really very different from what you suggest.

An alternative use of leave-some-out analyses is in looking for
outliers and leverage points, which again relate to not wanting to rely
on a given model.

I have come across one early mention of making use of fits on
sub-samples of the overall data set: see the section "The Nair and
Shrivastava method" on page 54 of the book "Analysis of Staight Line
Data", by Foreman S. Acton, Dover Publications, NY, 1966 (SBN
486-61747-5) .... which is a corrected version of the title published
by Wiley in 1959.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)