Can a model be overfitted even when N is very large?

Santiago Valdivieso @Santiago-Valdivieso-2

07 December 2019 7 2K Report

I have a well endowed database with almost 29 0000 observations and I want to make an analysis with more than 50 variables. What are the problems that can arise from this situation? Can the model be overfitted? If it is possible, why?

Vince Daly

If your 50 variables include ALL relevant variables then, asymptotically, any irrelevant variables should be revealed as such. Your N is so large that asymptotic arguments should be a relevant approximation unless your residual variation is very large. On the other hand, if you are missing some relevant variables then your fit might include some irrelevant variables that correlate with these and the number of included irrelevants might exceed the number of missing relevants.

Kenneth Carling

Overfitting is not a well-defined concept in statistics. It is about being cautious in not trying to identify to many patterns in relation to the magnitude of the data. Generally, I would say that greater data sets allows for a greater flexibility of the model. 50 variables on 29,000 observations does not seem to be overfitting. However, if you include all the 1225 interaction terms, it is Another story.

Pablo Pincheira

Yes. Your model can be overfitted. You can think of overfitting in several ways, but let us take two different avenues. First, number of relevant variables. Imagine that the truly correct model has only 30 of the total of 50 variables that you happen to have. Whatever method you use to identify the correct variables in your model can lead you to "false discoveries". This is closely related to the type I error in statistical inference. You can be very strict with the type I error, but this will never be zero. So, you are admitting the possibility of false discoveries. These false discoveries have more chances to occur, the more variables and transformations of variables you try over the same sample. You mention that you have 50 variables...but what about using the squares of these variables, or logs of them or product of pairwise combinations of them. The more combinations you try over the same sample, the more likely is that you end up with false disvoveries: variables that are not in the model...but your statistical method is unable to detect that they are not....

Second, imagine that the true share of variance that can be explained in your model is 50%. So, this is the best R2 you can get if you could identify the correct model. Now, because you are trying to find the correct variables using the same sample over and over again, you end up with a sample R2 of 75%. You again have an overfitting issue induced by the data mining process.

Large N helps but a lot of the overfitting problem relies on the repeated use of the same sample to find a correct model.

This is a key topic....hope it helps!

James R Knaub

Santiago -

Did you mean you have a sample size of n = 290,000 or that you have a finite population of size N = 290,000?

If you have a census, then it sounds like you have a very computer intensive job to do of checking possibilities.

If you are talking about having a very large sample size, from a large finite or infinite population, and want to know if you might overfit your model to that sample and not model the population as well as you'd like, then 'yes' it is still possible. If your model uses so many variables that it covers all sample observations with an unrealistically small sum of squared estimated residuals, then other observations from the rest of the population are not going to be predicted so well. The more the sample relationship to regressor data varies from the relationship(s) of the remainder of the population to the regressor data, the more obvious the overfitting will be.

That is, just because you have a large sample does not mean that the model which fits it best will fit the rest of the population best. Larger might be better, but not so much if there is a part of the population which behaves differently, and requires different regressor data, and you are leaving it out. That, I think, is a big problem with Big Data.

Note in general that more complex models tend to increase variance. In Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold: London and Oxford University Press, on pages 109-110, we see Ken Brewer tell us that "It is well known that regressor variables, when introduced for reasons other than that they may have appreciable explanatory power, tend to increase rather than decrease the estimates of variance." On the other hand, you could have omitted variable bias if you leave out an important variable. Having a large sample may help, but holding some data out for "cross-validation" can help to determine if you may still be overfitting.

Some may argue, but I think subject matter expertise should be considered. If you just go by what the data indicate, you may find that a different sample will tell you something different.

I worked for many years with very many, very small populations, for the publication of official statistics, and in another sense, regarding sampling, I routinely traded a small amount of bias for a very large decrease in variance. But it all depends upon the situation.

I suggest that you use graphical residual analysis to check fit, and use some kind of cross-validation or similar check to guard against overfitting here, as you would do with any model using statistical learning techniques.

Just some thoughts.

Cheers - Jim

Anton Rainer

With so many explaining variables, overfitting is almost unavoidable - independent of how many observations you have. A thorough (theory-based) specification is important. Factor analysis might be helpful. May be you can break down the (one-equation) model into several models. In a first step, you could try to find out which are the most important explaining variables and estimate their relation to your (main) endogenous variable. In a second step, you may take these variables as andogenous and estimate their relation to other explaining variables.

Fabrice Clerot

the nice thing when you have such a large sample is that you can always spare a (not too) small fraction as a validation set so as to quantify overfitting (here, defining overfitting as the drop in performance between "in training set" and "out of training set")

Paul Louangrath

50 variables = overfit. \This is a case where the investigator does not know the answer; thus, grab every thing under the sun as a possible explanatory factors. Large sample size does not mitigate this defect.

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

What is the difference between mathematical R^4 space and physical 4D unit space?

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

Controlling for pupil light reflex when analyzing pupil size time course?

What are a “Farmers Producer Organization” (FPO) and its essential features?

Strugglling with m6A dot blot any suugesstion ?

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How to get moment output in Abaqus Standart?

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

How can I prepare virus for a TEM or SEM imaging?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?