I am running an OLS regression with 48 regressors, while only having 160 observations. Most of these regressors are categorical dummies (36). Does this have implications on my regression?
Sure does. drop say 10 observations and OLS probably won't even run. If this is a predictive model do adaptive lasso variable selection (with all the cat. variables, you need adaptive group lasso). If you really need all those IVs you have to get a bigger n. Good luck. D. Booth See the attached two papers. They might give you some ideas.
Are these categorical dummies related to one categorical variable or to several categorical variables ? In the former case, grouping (based on knowledge, not on the results) can be considered. In the latter case, suppose you have m categorical variables. You can try by taking a subset of m -- 1 of them (that means omitting one), i.e. m regressions, and see if the results are stable. Maybe you can omit one of them. I remember having seen good results with 48 observations and 20 variables in Vatter et al. (1978).
Thank you for your answers! Guy Mélard : I have a categorical variable of the first two digits of the NACE code of the industry of the firm, which results in 26 dummies, and one categorical variable Province that results in 10 dummies. Because they are related to a few categorical variables, I thought this would not be problematic. So what you suggest is that I regroup my industry dummies n for example divisions , and my province dummy in for example regions?
No. Of course you can try to regroup several NACE sectors and/or several provinces. For example for provinces you can consider North, West, South and East, ans similarly for the sectors. But my second suggestions was to use only NACE sectors, on the one hand, and only Provinces on the other hand, and see if the results are stable. Anyway, in principle you should also consider interactions (i.e. all products of the NACE and Province dummy variables) but then it is hopeless because you will have 26*10 = 260 dummies in all. In my count I did not treat the case of the constant.
David Eugene Booth I am struggling to determine when the number of my regressors are to big, what is the influence on my regression results? What is striking is the fact that the F value points to joint insignificance. Is this a consequence of the small sample size? and does this mean I cannot interpret the results? However, I cannot increase n since I am doing research on the whole sample
If your model contains two predictors and the interaction term, you’ll need 30-45 observations. However, if the effect size is small or there is high multicollinearity, you may need more observations per term. Compare this with the 48 regressors in your study.
Timea De Wispelaere a couple of points: first the rules of thumb are data dependent so sometimes 5/IV is ok sometimes 10 is ok there is no one single value that is magic. More is better. If your are forming a explanatory model that's about all you can say except the usual power sample size calcs can be done. do one for each term and then take a number greater than max(ni. ) For predictive models we can do a little better. I currently like lasso for various reasons. 1 adaptive lasso is has an oracle property(i.e. the model gives you the best predictor set from among the candidate variables by using cross validation with a max information criterion like BIC or AIC. That's good for predictive models. Further lasso models run for n
Hello David. Thank you for your valuable answer. However, the main cause of the high number of regressors in my model is control variables industry (26 variables) and province (10 variables). Should I also use this Lasso approach, knowing that the majority of my regressors are there for the purpose to control?