I'm currently running GLMMs in R and want to know if there is a max number of explanatory variables you can include, based on the N of study. For example, you can have 10 IVs for every 10 subjects.
however, great care should be taken when reaching the "p>>n" regime (much more explanatory variables than cases) because of the quasi-certainty of overfitting ; sparsity-inducing regularization highly recommended :
That's fantastic, thank you. I'll go through those links in detail shortly. We have some research questions with N 150 and about 8 IVs so it sounds like that might not be too much of a problem.
In some areas of study it makes sense to have very large number of predictors, like in some modern biostatistics. It is good that p>>n methods are available. But, it depends what you want to do with the results and often the number of predictor variables is more usefully constrained by your theories and the use of them than the number of subjects and statistical methods. So if you put all 8 predictors in (and let's assume no interactions, but ...), then will it be useful interpreting your coefficients as conditioned on 7 others. That might be tricky.
If all you are doing is trying to predict values, then using all of them is more understandable.
@Katherine: I think 8 IVs are not too much for N=150. Nevertheless, after fitting the model, test if the IVs are significantly different from zero or not.
Tricky what significance means for an individual coefficient after using something like the lasso. Even if just using all eight and not shrinking te model, make sure to interpret them in light of there being 8 tests in the family and that each coefficient is conditional on the rest.
Also, before this thread grows too much, we should ask if interactions among the predictors will be included, and check thatjust linear relations are included. The answers to these may complicate the model, but depending on the area of science may be critical.
The questions are relating to a questionnaire study. We are trying to see if 8 demographic factors (age, education level etc) are affecting certain responses, such as if they get a factual question right or wrong. We will look for interaction effects as well.
Even if you just restrict yourself to 2 and 3 way interactions (8 choose 2 plus 8 choose 3 is 84 so more effects to estimate), this means that there are a lot of predictors (even assuming just linear) and interpretation will be complex. So be careful.
On the factual questions, do you have a set of right/wrong questions, so are you using something from item response theory (IRT) to analyze (or analyse, I lived in the UK for 21 years) them?
Possibly I did not interpret your question appropriately, but it will not work to try and fit Y=X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+e with only ten observations. I think that any model where the degrees of freedom from each variable plus error term equals or exceeds the number of replicates will fail. Interaction terms also use degrees of freedom, so as Daniel pointed out, a few variables can generate a large number of interactions. In Daniel's example, if you use all of the interactions you will have a total of 255 variables in your model. A sample size of 150 will not support a model with all of these. Also, missing cases effectively reduce N. Sometimes missing values are imputed.
Timothy, just to clarify. Fabrice's point is that if you use forward selection (or a bunch of other techniques which are now popular), you can find a solution to predict Y from K variables where K>>N, but of course not entering them all in with a standard regression. Emma's situation is more like the typical traditional problems, but still with interactions and say if you wanted to allow splines rather than straight lines the number of dfs in the model goes may up, so that there are computational problems.
The point I was trying to make in my first comment is that there can be interpretational problems too. Suppose you have a million cases and 8 predictors, and everything solves in a straightforward manner (but the predictors are correlated). It is tricky to interpret what each coefficient means because it is conditional on the other 8. This is a problem that I often have trying to explain results to others. Phrases like "holding 7 things constant but allowing this one to vary" just don't seem to help if you can't do that in reality.