I have a well endowed database with almost 29 0000 observations and I want to make an analysis with more than 50 variables. What are the problems that can arise from this situation? Can the model be overfitted? If it is possible, why?
If your 50 variables include ALL relevant variables then, asymptotically, any irrelevant variables should be revealed as such. Your N is so large that asymptotic arguments should be a relevant approximation unless your residual variation is very large. On the other hand, if you are missing some relevant variables then your fit might include some irrelevant variables that correlate with these and the number of included irrelevants might exceed the number of missing relevants.
Overfitting is not a well-defined concept in statistics. It is about being cautious in not trying to identify to many patterns in relation to the magnitude of the data. Generally, I would say that greater data sets allows for a greater flexibility of the model. 50 variables on 29,000 observations does not seem to be overfitting. However, if you include all the 1225 interaction terms, it is Another story.
Yes. Your model can be overfitted. You can think of overfitting in several ways, but let us take two different avenues. First, number of relevant variables. Imagine that the truly correct model has only 30 of the total of 50 variables that you happen to have. Whatever method you use to identify the correct variables in your model can lead you to "false discoveries". This is closely related to the type I error in statistical inference. You can be very strict with the type I error, but this will never be zero. So, you are admitting the possibility of false discoveries. These false discoveries have more chances to occur, the more variables and transformations of variables you try over the same sample. You mention that you have 50 variables...but what about using the squares of these variables, or logs of them or product of pairwise combinations of them. The more combinations you try over the same sample, the more likely is that you end up with false disvoveries: variables that are not in the model...but your statistical method is unable to detect that they are not....
Second, imagine that the true share of variance that can be explained in your model is 50%. So, this is the best R2 you can get if you could identify the correct model. Now, because you are trying to find the correct variables using the same sample over and over again, you end up with a sample R2 of 75%. You again have an overfitting issue induced by the data mining process.
Large N helps but a lot of the overfitting problem relies on the repeated use of the same sample to find a correct model.
Did you mean you have a sample size of n = 290,000 or that you have a finite population of size N = 290,000?
If you have a census, then it sounds like you have a very computer intensive job to do of checking possibilities.
If you are talking about having a very large sample size, from a large finite or infinite population, and want to know if you might overfit your model to that sample and not model the population as well as you'd like, then 'yes' it is still possible. If your model uses so many variables that it covers all sample observations with an unrealistically small sum of squared estimated residuals, then other observations from the rest of the population are not going to be predicted so well. The more the sample relationship to regressor data varies from the relationship(s) of the remainder of the population to the regressor data, the more obvious the overfitting will be.
That is, just because you have a large sample does not mean that the model which fits it best will fit the rest of the population best. Larger might be better, but not so much if there is a part of the population which behaves differently, and requires different regressor data, and you are leaving it out. That, I think, is a big problem with Big Data.
Note in general that more complex models tend to increase variance. In Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold: London and Oxford University Press, on pages 109-110, we see Ken Brewer tell us that "It is well known that regressor variables, when introduced for reasons other than that they may have appreciable explanatory power, tend to increase rather than decrease the estimates of variance." On the other hand, you could have omitted variable bias if you leave out an important variable. Having a large sample may help, but holding some data out for "cross-validation" can help to determine if you may still be overfitting.
Some may argue, but I think subject matter expertise should be considered. If you just go by what the data indicate, you may find that a different sample will tell you something different.
I worked for many years with very many, very small populations, for the publication of official statistics, and in another sense, regarding sampling, I routinely traded a small amount of bias for a very large decrease in variance. But it all depends upon the situation.
I suggest that you use graphical residual analysis to check fit, and use some kind of cross-validation or similar check to guard against overfitting here, as you would do with any model using statistical learning techniques.
With so many explaining variables, overfitting is almost unavoidable - independent of how many observations you have. A thorough (theory-based) specification is important. Factor analysis might be helpful. May be you can break down the (one-equation) model into several models. In a first step, you could try to find out which are the most important explaining variables and estimate their relation to your (main) endogenous variable. In a second step, you may take these variables as andogenous and estimate their relation to other explaining variables.
the nice thing when you have such a large sample is that you can always spare a (not too) small fraction as a validation set so as to quantify overfitting (here, defining overfitting as the drop in performance between "in training set" and "out of training set")
50 variables = overfit. \This is a case where the investigator does not know the answer; thus, grab every thing under the sun as a possible explanatory factors. Large sample size does not mitigate this defect.