'Can' a regression have more predictors than participants? If your goal is to fit the data as well a possible, no matter what, then you can have any number of predictors (e.g., "big data"). But if you're developing a scientific model, then I would worry about parsimony. Scientifically, we would like the most simple explanation to explain the most data. Here's a concrete, simple example:
'Can' a regression have more predictors than participants? If your goal is to fit the data as well a possible, no matter what, then you can have any number of predictors (e.g., "big data"). But if you're developing a scientific model, then I would worry about parsimony. Scientifically, we would like the most simple explanation to explain the most data. Here's a concrete, simple example:
R-square is often misleading, so I'd prefer a "graphical residual analysis."
But assuming your fit really is that good, perhaps even when you whittle down to 6 or 7 really good independent variables which work well together, you may have an overfit problem, meaning you have a model that works very well for those 30 sample members, but is it too custom made for them and will it not work so well for the rest of the population? More data so that you can see how well you would predict for them would be best. There are different cross-validations though. One suggestion I've seen is to pull one (how about three?) member(s) of the sample out at a time, rotating through, to see if you use the others for the model, how well do you predict for the ones left out?
For your model, some people may like principal components analysis, but interpretation may be problematic.
There is a bias-variance tradeoff phenomenon that says that usually when you add independent variables/complexity, you add variance*, and fewer variables may mean bias (like omitted variable bias), but for your problem, I wonder if you have bias, not in the sense of not modeling your sample well, but modeling it too well, if the population may be substantially different. By different, I mean the model relationship for prediction. Your data could look 'different' but have the same model relationship to the regressors used, though that may generally be unlikely.
It seems odd to reduce your number of independent variables/predictors/regressors so far, and still have overfitting, but with a small sample, I think that is perhaps still a problem.
.
.
.........................
* On pages 109-110 of Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold: London and Oxford University Press, Ken says that "It is well known that regressor variables, when introduced for reasons other than that they may have appreciable explanatory power, tend to increase rather than decrease the estimates of variance." I found a notable case in my work where electric power plants that switch fuels need an additional regressor or regressors (independent variables) to help predict a given fuel use for them, when past fuel use, by fuels, are regressors. When there was little or no fuel switching, one or more additional variables slightly increased variance. When there was substantial fuel switching, which we did not know until after data collection and processing for frequent official data publication, then the estimated variance of the prediction error was greatly reduced.
...........
Perhaps you could try even fewer variables and see if your results are about as good. But you do not want to throw out any important regressors, and you have reduced the number of regressors a great deal already. Also, it may depend on the combination of regressors more than any one important one.
Perhaps you could try other sets of regressors and compare models on the same scatterplot using graphical residual analysis. If cross-validation and if your knowledge of the subject matter indicate that one is likely to be generally better for the population, you could choose that way. It may also depend partly upon for which regressors you have the best data quality. Also, if you know something about the population, you might consider your model good for part of it, but maybe you need other data and another model for another part of it.
It has no scientific credence. Your results may be optimum for the data you have but will not generalize.
There were a number of papers in the 1960s that showed that you could derive perfect models ( R-squared of 100%) from pure random noise. You have clearly used some method in going from 200 to 6/7 variables and if that involves some badness of fit it will undoubtedly capitalise on chance results. You need some form of cross validation built into the process, so that you see how well the candidate model does with data that has been (randomly) admitted.
This is a useful piece - I would start all over again
Above I suggested that Indrajeet might "...have a model that works very well for those 30 sample members, but is ... too custom made for them and will ... not work so well for the rest of the population."
You mentioned composite predictors, but it seems odd to me that that would solve the problem of a sample size that is too small for all these predictors or the information/relationships they suggest, even when represented in composite. However, these notes from Northwestern Kellogg seem to agree with you about composite predictors (see pages 9 and 10), while also agreeing not to throw in "junk" predictors: https://www.kellogg.northwestern.edu/faculty/dranove/htm/dranove/coursepages/Mgmt%20469/choosing%20variables.pdf. The quote from Ken Brewer I had above is consistent with the Bias-Variance Tradeoff noted in the area of statistical learning, and would seem to warn against the enthusiasm for more information noted at the bottom of page 9, except if it is quite important.
In that article for which you provided a link, "Stopping stepwise: Why stepwise selection is bad and what you should use instead," by Peter Flom, he notes that he believes that "...no method can be sensibly applied in a truly automatic manner. ... to be general, the evaluation should not rely on the particular issues related to a particular problem. However, in actually solving data analytic problems, these particularities are essential." I like that because to me it means both data and subject matter, and I have tried to emphasize subject matter considerations elsewhere, as well as data issues causing spurious results. Flom notes that "...no method can substitute for substantive and statistical expertise...."
Stepwise may be particularly bad, but there is no one-size-fits-all method.
I think that the question appears to contain some degree of generalisation. Very high R-squared values might be sceptical in social sciences. Can such very high values
be considered sceptical in physical/
engineering sciences ? In any case, it has been stated earlier that significance test is essential..
Harold Chike My skepticism comes from the method - 200 to 6/7 ; knowing that, I would doubt that this is the Royal road to the Truth. To which I would add the substantive knowledge of the process. In the physical sciences weather forecasting works well for a few days but then struggles. But on a normal weekday I - a social scientist - can forecast traffic into a city at different times pretty well and I can adjust for weekends, the holiday period and football matches. So sometimes the human world can be predictable. Closed systems can be predicted. Open systems which are capable of self change are much more difficult. One can get an excellent prediction without understanding contra positivism. But when we try to extrapolate we can hit a wall. Just look at what has happened to the 'engineering' of Boeing 737 Max.
True experiments are about 'controlling' for other influences and making the world more like a machine with predictable outcomes., but that does not on its own bring understanding of what is going on. Nor does it mean that it is always predictable in all circumstances. These are big questions!
"...significance test is essential." - Well, I question the usefulness of 'significance tests' when estimation, prediction, and variance can be more readily interpretable, and less likely to be misunderstood or misused. Our questions should be of the nature "About how much?" not questions we want answered with a "Yes," or a "No."
Galit Shmueli put a public version of her Statistical Science article right here on ResearchGate: https://www.researchgate.net/publication/48178170_To_Explain_or_to_Predict.
I recall an early version of that paper she did when at the U of Maryland. She changed it quite a lot. I actually liked the earlier version better. In this one on page 6 she talks about the "EPE," which includes bias and variance, though I've seen that when you estimate sigma and the model is biased, that sigma is already impacted in practice. (I don't know if Dr Shmueli said anything about that.)
The example in the appendix shows that the more "correct" model can sometimes give you a less accurate prediction. However, when looking at the explanation aspect of regression, with a number of "independent" variables, I suspect that the influence of variables on each other means that the best combination of independent variables may not be the same as the combination of the best independent variables. That is, you cannot just take the independent variables you separately consider of high explanatory value, and think that together they explain more. Certainly they may predict better, but they might also explain more in certain combinations than others if you know the subject matter. (I don't know if Dr. Shmueli said anything about that.)
Is using n < p your advice to Indrajeet, David? The concern here is overfitting to a particular sample. As Kelvyn put it, "Your results may be optimum for the data you have but will not generalize." That was for the 6 or 7 variables picked, but for a lot more, I expect that it is liable to be worse. Different circumstances may indicate different approaches. What about Indrajeet's question?
I have just watched the 2nd of the Royal Institution Xmas lectures which is about algorithmic learning - it showed some successes and some hilarious failures - just as you would expect
Mathematician Dr Hannah Fry presents the 2019 CHRISTMAS LECTURES – Secrets and lies: The hidden power of maths. Broadcast on BBC 4 at 8pm on 26, 27 and 28 December.