'Can' a regression have more predictors than participants? If your goal is to fit the data as well a possible, no matter what, then you can have any number of predictors (e.g., "big data"). But if you're developing a scientific model, then I would worry about parsimony. Scientifically, we would like the most simple explanation to explain the most data. Here's a concrete, simple example:
Sounds terrific and too good to be true. I would check the raw values of the variables to make sure that the values of some variables are not replicas of values from other variables.
'Can' a regression have more predictors than participants? If your goal is to fit the data as well a possible, no matter what, then you can have any number of predictors (e.g., "big data"). But if you're developing a scientific model, then I would worry about parsimony. Scientifically, we would like the most simple explanation to explain the most data. Here's a concrete, simple example:
R-square is often misleading, so I'd prefer a "graphical residual analysis."
But assuming your fit really is that good, perhaps even when you whittle down to 6 or 7 really good independent variables which work well together, you may have an overfit problem, meaning you have a model that works very well for those 30 sample members, but is it too custom made for them and will it not work so well for the rest of the population? More data so that you can see how well you would predict for them would be best. There are different cross-validations though. One suggestion I've seen is to pull one (how about three?) member(s) of the sample out at a time, rotating through, to see if you use the others for the model, how well do you predict for the ones left out?
For your model, some people may like principal components analysis, but interpretation may be problematic.
There is a bias-variance tradeoff phenomenon that says that usually when you add independent variables/complexity, you add variance*, and fewer variables may mean bias (like omitted variable bias), but for your problem, I wonder if you have bias, not in the sense of not modeling your sample well, but modeling it too well, if the population may be substantially different. By different, I mean the model relationship for prediction. Your data could look 'different' but have the same model relationship to the regressors used, though that may generally be unlikely.
It seems odd to reduce your number of independent variables/predictors/regressors so far, and still have overfitting, but with a small sample, I think that is perhaps still a problem.
.
.
.........................
* On pages 109-110 of Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold: London and Oxford University Press, Ken says that "It is well known that regressor variables, when introduced for reasons other than that they may have appreciable explanatory power, tend to increase rather than decrease the estimates of variance." I found a notable case in my work where electric power plants that switch fuels need an additional regressor or regressors (independent variables) to help predict a given fuel use for them, when past fuel use, by fuels, are regressors. When there was little or no fuel switching, one or more additional variables slightly increased variance. When there was substantial fuel switching, which we did not know until after data collection and processing for frequent official data publication, then the estimated variance of the prediction error was greatly reduced.
...........
Perhaps you could try even fewer variables and see if your results are about as good. But you do not want to throw out any important regressors, and you have reduced the number of regressors a great deal already. Also, it may depend on the combination of regressors more than any one important one.
Perhaps you could try other sets of regressors and compare models on the same scatterplot using graphical residual analysis. If cross-validation and if your knowledge of the subject matter indicate that one is likely to be generally better for the population, you could choose that way. It may also depend partly upon for which regressors you have the best data quality. Also, if you know something about the population, you might consider your model good for part of it, but maybe you need other data and another model for another part of it.
Another concern is that the number of predictors/regressors may be too high for the relatively small sample size of 30. If possible, you can compute some composite scores for clusters of variables (two or more predictors merged into one) based on factor analysis and execute multiple regressions using a smaller number of composite predictors.
It has no scientific credence. Your results may be optimum for the data you have but will not generalize.
There were a number of papers in the 1960s that showed that you could derive perfect models ( R-squared of 100%) from pure random noise. You have clearly used some method in going from 200 to 6/7 variables and if that involves some badness of fit it will undoubtedly capitalise on chance results. You need some form of cross validation built into the process, so that you see how well the candidate model does with data that has been (randomly) omitted.
This is a useful piece - I would start all over again
Above I suggested that Indrajeet might "...have a model that works very well for those 30 sample members, but is ... too custom made for them and will ... not work so well for the rest of the population."
You mentioned composite predictors, but it seems odd to me that that would solve the problem of a sample size that is too small for all these predictors or the information/relationships they suggest, even when represented in composite. However, these notes from Northwestern Kellogg seem to agree with you about composite predictors (see pages 9 and 10), while also agreeing not to throw in "junk" predictors: https://www.kellogg.northwestern.edu/faculty/dranove/htm/dranove/coursepages/Mgmt%20469/choosing%20variables.pdf. The quote from Ken Brewer I had above is consistent with the Bias-Variance Tradeoff noted in the area of statistical learning, and would seem to warn against the enthusiasm for more information noted at the bottom of page 9, except if it is quite important.
In that article for which you provided a link, "Stopping stepwise: Why stepwise selection is bad and what you should use instead," by Peter Flom, he notes that he believes that "...no method can be sensibly applied in a truly automatic manner. ... to be general, the evaluation should not rely on the particular issues related to a particular problem. However, in actually solving data analytic problems, these particularities are essential." I like that because to me it means both data and subject matter, and I have tried to emphasize subject matter considerations elsewhere, as well as data issues causing spurious results. Flom notes that "...no method can substitute for substantive and statistical expertise...."
Stepwise may be particularly bad, but there is no one-size-fits-all method.
I think that the question appears to contain some degree of generalisation. Very high R-squared values might be sceptical in social sciences. Can such very high values
be considered sceptical in physical/
engineering sciences ? In any case, it has been stated earlier that significance test is essential..
Harold Chike My skepticism comes from the method - 200 to 6/7 ; knowing that, I would doubt that this is the Royal road to the Truth. To which I would add the substantive knowledge of the process. In the physical sciences weather forecasting works well for a few days but then struggles. But on a normal weekday I - a social scientist - can forecast traffic into a city at different times pretty well and I can adjust for weekends, the holiday period and football matches. So sometimes the human world can be predictable. Closed systems can be predicted. Open systems which are capable of self change are much more difficult. One can get an excellent prediction without understanding contra positivism. But when we try to extrapolate we can hit a wall. Just look at what has happened to the 'engineering' of Boeing 737 Max.
True experiments are about 'controlling' for other influences and making the world more like a machine with predictable outcomes., but that does not on its own bring understanding of what is going on. Nor does it mean that it is always predictable in all circumstances. These are big questions!
"...significance test is essential." - Well, I question the usefulness of 'significance tests' when estimation, prediction, and variance can be more readily interpretable, and less likely to be misunderstood or misused. Our questions should be of the nature "About how much?" not questions we want answered with a "Yes," or a "No."
Galit Shmueli put a public version of her Statistical Science article right here on ResearchGate: https://www.researchgate.net/publication/48178170_To_Explain_or_to_Predict.
I recall an early version of that paper she did when at the U of Maryland. She changed it quite a lot. I actually liked the earlier version better. In this one on page 6 she talks about the "EPE," which includes bias and variance, though I've seen that when you estimate sigma and the model is biased, that sigma is already impacted in practice. (I don't know if Dr Shmueli said anything about that.)
The example in the appendix shows that the more "correct" model can sometimes give you a less accurate prediction. However, when looking at the explanation aspect of regression, with a number of "independent" variables, I suspect that the influence of variables on each other means that the best combination of independent variables may not be the same as the combination of the best independent variables. That is, you cannot just take the independent variables you separately consider of high explanatory value, and think that together they explain more. Certainly they may predict better, but they might also explain more in certain combinations than others if you know the subject matter. (I don't know if Dr. Shmueli said anything about that.)
Is using n < p your advice to Indrajeet, David? The concern here is overfitting to a particular sample. As Kelvyn put it, "Your results may be optimum for the data you have but will not generalize." That was for the 6 or 7 variables picked, but for a lot more, I expect that it is liable to be worse. Different circumstances may indicate different approaches. What about Indrajeet's question?
I have just watched the 2nd of the Royal Institution Xmas lectures which is about algorithmic learning - it showed some successes and some hilarious failures - just as you would expect
Mathematician Dr Hannah Fry presents the 2019 CHRISTMAS LECTURES – Secrets and lies: The hidden power of maths. Broadcast on BBC 4 at 8pm on 26, 27 and 28 December.
Thank you immensely, Prof James R Knaub. The precise answer. to the question asked by Indrajeet, the originator of our intellectual discussions has not been provided.
We have rather exhibited our special experiences. My very good friend Prof. Eugenr David Booth had previously misunderstood my approach of providing answers to a question, before going into the dialectics of the surrounding intellectual discussions. Such discussions need to provide conclusive answers where possible.
We have now left Indrajeet Indrajeet to sort out the needed answer from our discussions.
That is quite appropriate, but for advanced researchers in the specialty Thank you all.
This just popped up on my feed, but reading through the answers from some very bright commentators I was surprised no one asked two questions which I believe are necessary to address this (and the questions may be related, depending on the answer). There is also a point I'll add on why Rs near one sometimes occurs.
1. How was the R^2 value adjusted? In particular, did it take into the total number of variables (~200), was it based on the R^2 from a separate sample than used to choose the predictors in the model, or something else.
2. How were the 6-7 variables selected, and in particular were they selected on the basis of some characteristics independent of the n=30 study.
The point is:
3. Sometimes R values near 1 (or at 1, and we don't know how the adjusted statistic was calculated, so it might be 1, but close values can also occur depending on how variables are created) occur in projects I read because the student calculates a mean or a sum of several variables, forgets what this variable is, and the later uses this as a response variable with predictors that include those used to created.
It is worth agreeing with the respondents here that in a lot of research contexts this design would be poor. You don't provide enough information in your question for me to confidently say the design is poor (I can confidently say the question is poor because it is incomplete).
David Eugene Booth, it is ironic that you bring up GWAS and SNP studies to justify black-box methods. You ask:
"James has the use of microarray data not revolutionized cancer research?"
My answer: NO, it has not! So what has decades of mindless data dredging over terabytes of what is functionally biological noise (except for the expected long tail of multicollinear perfectly fit noise) given us? Mainly more utopian predictions of the coming genomics revolution that is just around the corner. That and the realization of how widespread magical thinking, cargo cult science, and an uncritical and astonishing disregard for both empiricism and false promises of clinical outcomes made to the public.
What's funny is that the workers who churn out publications with yet another clinically irrelevant cancer-'gene' association also predict addiction, demographics, and just about anything that can be 'analysed' via the overfitting algorithms that always find something. Their is no need to even take account of or have any knowledge in physiology, pharmacology, or worry in the least about case ethics.
They are hard at work as we speak on COVID, churning out associations and hand-wavy discussions on drugs of interest (that they conveniently never have to empirically verify the effect of). Surely, something of the failure of the current prediction-without-verification approach to statistical modelling has finally dawned here in 2020.
I have seen papers with p-values of 10-18 or less for some interactions. It doesn't matter how a statistic is meant to be interpreted or in what transform it was derived, if it approaches the magnitude of particles in the universe, it's time to step back.
But David Eugene Booth, you are welcome to point out where the application of mindless statistics on nucleotide data (gene expression is too strong of a term I think) has led to actual meaningful clinical applications). That would presumably be the claim, given it is necessarily a translational goal and not simply an endless fog of delusional clarity ("understanding"). By applications, I mean interventions and cures, not a best guess of which medicine one has to take when there is no meaningful outcome difference.
By the way, while it is possible that some rare genetic disease could be reproducibly categorized in this way, to what end, practically? Moreover, these investigators typically only guess what kinds of patterns to trace because of a priori empirical description - so the contribution is at best secondary.
Miky Timothy Why this has now appeared in my feed is quite strange but in any case I was not talking about OLS regression which is quite well characterized by Kelvyn Jones but rather adaptive lasso regression, which is well characterized in the literature that we cited. I think that if you have scientific concerns then publish them in scientific places where they may be judged on their merits in the appropriate manner. BTW your commentary
http://atm.amegroups.com/article/view/19244/html
is such an article. Unfortunately you didn't seem to read it. It criticizes the Stepwise methods just as our paper does. It does not mention any of the techniques mentioned in our paper that were actually used there. The paper that you cited discusses GWAS methods which are quite different than what we considered which was based on two human genes studied by adaptive lasso not OLS methods used in a GENOME WIDE study Thus your citation is simply irrelevant to a discussion of our work but well worth reading on it's own merits. BTW i suggest you also read our paper and it's citations which will explain how adaptive lasso prevents overfitting. That is one of its main reasons for use.
David Eugene Booth , blame me for it popping up. It came up on mine and read the comments and replied, and then saw the date. btw, I think this is another case of the questioner not providing enough information and commentators assuming a lot about what the person meant (see my comment above). It might be useful if RG had a separate way to ask for the questioner to clarify things from people giving answers (and probably also a way to reply to a comment rather than the question it self). Anyway, hope everybody enjoyed Nevada Day yesterday and for those who celebrate Halloween, enjoy that today!
There is no need for blame Daniel Wright. You replied to the original question and added another take on the problem, which will be useful for those that search the topic. I would add that R2 is a rather uninformative and RMSE, MAE, and residuals diagnostics are of greater practical interest. The question is so opaque and without context that it is impossible to answer definitively.
David Eugene Booth, my comment appeared in your timeline because I was responding to what you wrote - it was from April but I fail to see what is so strange about this. Nor do I understand how you completely misread the point of my comment, given the context and specifically the in-text quote from you. Below is your statement in full:
James, has the use of microarray data not revolutionized cancer research?. n
Well Prof. Timothy I certainly appreciate your helpful comments. Perhaps you would care to comment on the over 5000 patients that provided data for the first and subsequent study that i was involved in. Full details of the study group are contained in the first portion of the study as published in Cancer Research by the group prior to my joining it. Perhaps these patients had some reason for consenting to join the study. It has been my experience, I have been involved in clinical trials myself, as a subject, i.e the VITAL study, and I had reasons for joining beyond the excellent monthly newsletter that the staff produced. If you have read our paper you have realized that the large SELECT trial was stopped early because the group treated with Selenium showed a an unexpectedly high rate of prostate cancer. Based on our work we have been able to propose a mechanism for that stoppage which if supported by current work could make Selenium again a possible treatment for certain types of prostate cancer. While we have not cured anyone of cancer We have proposed a reasonable candidate for treatment of a prostate cancer subtype. While we would like to be able to say more our current progress does not allow that. We hope that our work combined with that of others may revive the SELECT study which may actually provide a new treatment for this cancer subtype. This is how research progresses. If you consider the Cancer Research paper previously mentioned and cited in the SRP paper this later work would not have been possible when the first paper was published because the adaptive lasso methods used were not available at that point. There were no methods available to do that in 2008. By the way, would I be confident in signing a surgical release.? If this work ended the way we hope it would not lead to surgical intervention. However I would be willing to sign a release for any treatment successfully developed by this approach. At least as confident as I was in signing my release for open heart surgery two years ago.
Specifically my earlier reply was to James comment, on the use of modern selection techniques. The reason I made the comment is that I believe that our work and that of others is valuable because of the above mentioned reasons. I hope that this has settled your fears that we are irrelevant. Best wishes to you and your group in New York, David Booth
David Eugene Booth, my thanks for your thoughtful and detailed responses to my questions. It is heartening to hear that you and your collaborators have indeed put a lot of thought into the clinical ends of this work. This sadly is very often not the case, and has made me a cynic (justifiably I believe).
It is also instructive (and relevant in the context of this thread) to read your description of how a (seemingly abstract) statistical procedure is used in applied medicine.
Best wishes and the best of health to you and your colleagues in Ohio!