Can a value of adjusted R square = 0.98 in a multiple linear regression trusted?

Indrajeet Indrajeet @Indrajeet-Indrajeet-2

29 December 2019 31 3K Report

No of participants = 30

no of predictors = 6-7 selected from a group of ~ 200

Kevin Grobman Popular answer

'Can' a regression have more predictors than participants? If your goal is to fit the data as well a possible, no matter what, then you can have any number of predictors (e.g., "big data"). But if you're developing a scientific model, then I would worry about parsimony. Scientifically, we would like the most simple explanation to explain the most data. Here's a concrete, simple example:

X Y

0 5.971501365

1 9.909952245

2 13.71428374

3 19.93356482

4 30.91709445

5 40.76530827

One solution fits perfectly:

y = -0.0671x5 + 0.6624x4 - 1.8729x3 + 1.9206x2 + 3.2953x + 5.9715

R2 = 1

But a more simple solution fits nearly as well:

y = 1.0405x2 + 1.7466x + 6.298

R2 = 0.9968

But an even more simple equation, fits well, but not near as well:

y = 6.9488x + 2.8298

R2 = 0.9513

So which is correct? In this case I made up the data by adding random variation to underlying equation y=x2+2x+4

Hope this helps some, Indrajeet!

Kevin

Jimmy Y. Zhong

Sounds terrific and too good to be true. I would check the raw values of the variables to make sure that the values of some variables are not replicas of values from other variables.

Kevin Grobman

X Y

0 5.971501365

1 9.909952245

2 13.71428374

3 19.93356482

4 30.91709445

5 40.76530827

One solution fits perfectly:

y = -0.0671x5 + 0.6624x4 - 1.8729x3 + 1.9206x2 + 3.2953x + 5.9715

R2 = 1

But a more simple solution fits nearly as well:

y = 1.0405x2 + 1.7466x + 6.298

R2 = 0.9968

But an even more simple equation, fits well, but not near as well:

y = 6.9488x + 2.8298

R2 = 0.9513

So which is correct? In this case I made up the data by adding random variation to underlying equation y=x2+2x+4

Hope this helps some, Indrajeet!

Kevin

James R Knaub

Indrajeet -

R-square is often misleading, so I'd prefer a "graphical residual analysis."

But assuming your fit really is that good, perhaps even when you whittle down to 6 or 7 really good independent variables which work well together, you may have an overfit problem, meaning you have a model that works very well for those 30 sample members, but is it too custom made for them and will it not work so well for the rest of the population? More data so that you can see how well you would predict for them would be best. There are different cross-validations though. One suggestion I've seen is to pull one (how about three?) member(s) of the sample out at a time, rotating through, to see if you use the others for the model, how well do you predict for the ones left out?

For your model, some people may like principal components analysis, but interpretation may be problematic.

There is a bias-variance tradeoff phenomenon that says that usually when you add independent variables/complexity, you add variance*, and fewer variables may mean bias (like omitted variable bias), but for your problem, I wonder if you have bias, not in the sense of not modeling your sample well, but modeling it too well, if the population may be substantially different. By different, I mean the model relationship for prediction. Your data could look 'different' but have the same model relationship to the regressors used, though that may generally be unlikely.

It seems odd to reduce your number of independent variables/predictors/regressors so far, and still have overfitting, but with a small sample, I think that is perhaps still a problem.

.........................

* On pages 109-110 of Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold: London and Oxford University Press, Ken says that "It is well known that regressor variables, when introduced for reasons other than that they may have appreciable explanatory power, tend to increase rather than decrease the estimates of variance." I found a notable case in my work where electric power plants that switch fuels need an additional regressor or regressors (independent variables) to help predict a given fuel use for them, when past fuel use, by fuels, are regressors. When there was little or no fuel switching, one or more additional variables slightly increased variance. When there was substantial fuel switching, which we did not know until after data collection and processing for frequent official data publication, then the estimated variance of the prediction error was greatly reduced.

...........

Perhaps you could try even fewer variables and see if your results are about as good. But you do not want to throw out any important regressors, and you have reduced the number of regressors a great deal already. Also, it may depend on the combination of regressors more than any one important one.

Perhaps you could try other sets of regressors and compare models on the same scatterplot using graphical residual analysis. If cross-validation and if your knowledge of the subject matter indicate that one is likely to be generally better for the population, you could choose that way. It may also depend partly upon for which regressors you have the best data quality. Also, if you know something about the population, you might consider your model good for part of it, but maybe you need other data and another model for another part of it.

More data would be very helpful.

Best wishes - Jim

Jimmy Y. Zhong

Another concern is that the number of predictors/regressors may be too high for the relatively small sample size of 30. If possible, you can compute some composite scores for clusters of variables (two or more predictors merged into one) based on factor analysis and execute multiple regressions using a smaller number of composite predictors.

Lukasz Derdowski

Depending on the nature of your IVs and DV, an issue of common method bias would need to be considered as well.

Kelvyn Jones

It has no scientific credence. Your results may be optimum for the data you have but will not generalize.

There were a number of papers in the 1960s that showed that you could derive perfect models ( R-squared of 100%) from pure random noise. You have clearly used some method in going from 200 to 6/7 variables and if that involves some badness of fit it will undoubtedly capitalise on chance results. You need some form of cross validation built into the process, so that you see how well the candidate model does with data that has been (randomly) omitted.

This is a useful piece - I would start all over again

https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df

Indrajeet Indrajeet

Thanks Everyone. I myself is very critical of the ~ 1 r square. I am cross checking and looking at each aspects.

James R Knaub

Jimmy, et.al. -

Above I suggested that Indrajeet might "...have a model that works very well for those 30 sample members, but is ... too custom made for them and will ... not work so well for the rest of the population."

You mentioned composite predictors, but it seems odd to me that that would solve the problem of a sample size that is too small for all these predictors or the information/relationships they suggest, even when represented in composite. However, these notes from Northwestern Kellogg seem to agree with you about composite predictors (see pages 9 and 10), while also agreeing not to throw in "junk" predictors: https://www.kellogg.northwestern.edu/faculty/dranove/htm/dranove/coursepages/Mgmt%20469/choosing%20variables.pdf. The quote from Ken Brewer I had above is consistent with the Bias-Variance Tradeoff noted in the area of statistical learning, and would seem to warn against the enthusiasm for more information noted at the bottom of page 9, except if it is quite important.

Cheers - Jim

James R Knaub

Kelvyn, et.al. -

In that article for which you provided a link, "Stopping stepwise: Why stepwise selection is bad and what you should use instead," by Peter Flom, he notes that he believes that "...no method can be sensibly applied in a truly automatic manner. ... to be general, the evaluation should not rely on the particular issues related to a particular problem. However, in actually solving data analytic problems, these particularities are essential." I like that because to me it means both data and subject matter, and I have tried to emphasize subject matter considerations elsewhere, as well as data issues causing spurious results. Flom notes that "...no method can substitute for substantive and statistical expertise...."

Stepwise may be particularly bad, but there is no one-size-fits-all method.

Jim

Kelvyn Jones

I was going to add - it is thinking that counts - but some might see that as flippant or smug!

Happy New Year.

Harold Chike

I think that the question appears to contain some degree of generalisation. Very high R-squared values might be sceptical in social sciences. Can such very high values

be considered sceptical in physical/

engineering sciences ? In any case, it has been stated earlier that significance test is essential..

Kelvyn Jones

Harold Chike My skepticism comes from the method - 200 to 6/7 ; knowing that, I would doubt that this is the Royal road to the Truth. To which I would add the substantive knowledge of the process. In the physical sciences weather forecasting works well for a few days but then struggles. But on a normal weekday I - a social scientist - can forecast traffic into a city at different times pretty well and I can adjust for weekends, the holiday period and football matches. So sometimes the human world can be predictable. Closed systems can be predicted. Open systems which are capable of self change are much more difficult. One can get an excellent prediction without understanding contra positivism. But when we try to extrapolate we can hit a wall. Just look at what has happened to the 'engineering' of Boeing 737 Max.

True experiments are about 'controlling' for other influences and making the world more like a machine with predictable outcomes., but that does not on its own bring understanding of what is going on. Nor does it mean that it is always predictable in all circumstances. These are big questions!

Chapter The practice of quantitative methods

James R Knaub

"...significance test is essential." - Well, I question the usefulness of 'significance tests' when estimation, prediction, and variance can be more readily interpretable, and less likely to be misunderstood or misused. Our questions should be of the nature "About how much?" not questions we want answered with a "Yes," or a "No."

David Eugene Booth

AS usual, I am late to the party but in some types of studies, in particular, gene expression studies n

James R Knaub

Galit Shmueli put a public version of her Statistical Science article right here on ResearchGate: https://www.researchgate.net/publication/48178170_To_Explain_or_to_Predict.

I recall an early version of that paper she did when at the U of Maryland. She changed it quite a lot. I actually liked the earlier version better. In this one on page 6 she talks about the "EPE," which includes bias and variance, though I've seen that when you estimate sigma and the model is biased, that sigma is already impacted in practice. (I don't know if Dr Shmueli said anything about that.)

The example in the appendix shows that the more "correct" model can sometimes give you a less accurate prediction. However, when looking at the explanation aspect of regression, with a number of "independent" variables, I suspect that the influence of variables on each other means that the best combination of independent variables may not be the same as the combination of the best independent variables. That is, you cannot just take the independent variables you separately consider of high explanatory value, and think that together they explain more. Certainly they may predict better, but they might also explain more in certain combinations than others if you know the subject matter. (I don't know if Dr. Shmueli said anything about that.)

As for n

David Eugene Booth

James R Knaub James has the use of microarray data not revolutionized cancer research?. n

James R Knaub

Is using n < p your advice to Indrajeet, David? The concern here is overfitting to a particular sample. As Kelvyn put it, "Your results may be optimum for the data you have but will not generalize." That was for the 6 or 7 variables picked, but for a lot more, I expect that it is liable to be worse. Different circumstances may indicate different approaches. What about Indrajeet's question?

Kelvyn Jones

I have just watched the 2nd of the Royal Institution Xmas lectures which is about algorithmic learning - it showed some successes and some hilarious failures - just as you would expect

https://www.rigb.org/christmas-lectures/2019-secrets-and-lies

CHRISTMAS LECTURES 2019: Secrets and lies

Mathematician Dr Hannah Fry presents the 2019 CHRISTMAS LECTURES – Secrets and lies: The hidden power of maths. Broadcast on BBC 4 at 8pm on 26, 27 and 28 December.

https://www.bbc.co.uk/iplayer/episodes/b00pmbqq/royal-institution-christmas-lectures

for those who can access the BBC iplayer.

Harold Chike

Thanks to everybody who has contributed in providing answers to the question "" Can a value of The square = 0.98 be trusted

Harold Chike

CORRIGENDUM

Please correct the last line of my remarks to read as follows:

R square= 0.98 in multiple linear regression be treated?

The resulting intellectual discussions demonstrate that the answer is not as straightforward as the question.

Harold Chike

Thank you immensely, Prof James R Knaub. The precise answer. to the question asked by Indrajeet, the originator of our intellectual discussions has not been provided.

We have rather exhibited our special experiences. My very good friend Prof. Eugenr David Booth had previously misunderstood my approach of providing answers to a question, before going into the dialectics of the surrounding intellectual discussions. Such discussions need to provide conclusive answers where possible.

We have now left Indrajeet Indrajeet to sort out the needed answer from our discussions.

That is quite appropriate, but for advanced researchers in the specialty Thank you all.

Peter N. Rampling

No to close to auto-correlation, a pearsons should be done to eliminate near perfect or perfect correlation.

Daniel Wright

This just popped up on my feed, but reading through the answers from some very bright commentators I was surprised no one asked two questions which I believe are necessary to address this (and the questions may be related, depending on the answer). There is also a point I'll add on why Rs near one sometimes occurs.

1. How was the R^2 value adjusted? In particular, did it take into the total number of variables (~200), was it based on the R^2 from a separate sample than used to choose the predictors in the model, or something else.

2. How were the 6-7 variables selected, and in particular were they selected on the basis of some characteristics independent of the n=30 study.

The point is:

3. Sometimes R values near 1 (or at 1, and we don't know how the adjusted statistic was calculated, so it might be 1, but close values can also occur depending on how variables are created) occur in projects I read because the student calculates a mean or a sum of several variables, forgets what this variable is, and the later uses this as a response variable with predictors that include those used to created.

It is worth agreeing with the respondents here that in a lot of research contexts this design would be poor. You don't provide enough information in your question for me to confidently say the design is poor (I can confidently say the question is poor because it is incomplete).

Miky Timothy

David Eugene Booth, it is ironic that you bring up GWAS and SNP studies to justify black-box methods. You ask:

"James has the use of microarray data not revolutionized cancer research?"

My answer: NO, it has not! So what has decades of mindless data dredging over terabytes of what is functionally biological noise (except for the expected long tail of multicollinear perfectly fit noise) given us? Mainly more utopian predictions of the coming genomics revolution that is just around the corner. That and the realization of how widespread magical thinking, cargo cult science, and an uncritical and astonishing disregard for both empiricism and false promises of clinical outcomes made to the public.

What's funny is that the workers who churn out publications with yet another clinically irrelevant cancer-'gene' association also predict addiction, demographics, and just about anything that can be 'analysed' via the overfitting algorithms that always find something. Their is no need to even take account of or have any knowledge in physiology, pharmacology, or worry in the least about case ethics.

They are hard at work as we speak on COVID, churning out associations and hand-wavy discussions on drugs of interest (that they conveniently never have to empirically verify the effect of). Surely, something of the failure of the current prediction-without-verification approach to statistical modelling has finally dawned here in 2020.

I have seen papers with p-values of 10-18 or less for some interactions. It doesn't matter how a statistic is meant to be interpreted or in what transform it was derived, if it approaches the magnitude of particles in the universe, it's time to step back.

But David Eugene Booth, you are welcome to point out where the application of mindless statistics on nucleotide data (gene expression is too strong of a term I think) has led to actual meaningful clinical applications). That would presumably be the claim, given it is necessarily a translational goal and not simply an endless fog of delusional clarity ("understanding"). By applications, I mean interventions and cures, not a best guess of which medicine one has to take when there is no meaningful outcome difference.

By the way, while it is possible that some rare genetic disease could be reproducibly categorized in this way, to what end, practically? Moreover, these investigators typically only guess what kinds of patterns to trace because of a priori empirical description - so the contribution is at best secondary.

Edit:

An informative commentary below:

http://atm.amegroups.com/article/view/19244/html

David Eugene Booth

Miky Timothy Why this has now appeared in my feed is quite strange but in any case I was not talking about OLS regression which is quite well characterized by Kelvyn Jones but rather adaptive lasso regression, which is well characterized in the literature that we cited. I think that if you have scientific concerns then publish them in scientific places where they may be judged on their merits in the appropriate manner. BTW your commentary

http://atm.amegroups.com/article/view/19244/html

is such an article. Unfortunately you didn't seem to read it. It criticizes the Stepwise methods just as our paper does. It does not mention any of the techniques mentioned in our paper that were actually used there. The paper that you cited discusses GWAS methods which are quite different than what we considered which was based on two human genes studied by adaptive lasso not OLS methods used in a GENOME WIDE study Thus your citation is simply irrelevant to a discussion of our work but well worth reading on it's own merits. BTW i suggest you also read our paper and it's citations which will explain how adaptive lasso prevents overfitting. That is one of its main reasons for use.

Best wishes, David Booth

Daniel Wright

David Eugene Booth , blame me for it popping up. It came up on mine and read the comments and replied, and then saw the date. btw, I think this is another case of the questioner not providing enough information and commentators assuming a lot about what the person meant (see my comment above). It might be useful if RG had a separate way to ask for the questioner to clarify things from people giving answers (and probably also a way to reply to a comment rather than the question it self). Anyway, hope everybody enjoyed Nevada Day yesterday and for those who celebrate Halloween, enjoy that today!

Miky Timothy

There is no need for blame Daniel Wright. You replied to the original question and added another take on the problem, which will be useful for those that search the topic. I would add that R2 is a rather uninformative and RMSE, MAE, and residuals diagnostics are of greater practical interest. The question is so opaque and without context that it is impossible to answer definitively.

David Eugene Booth, my comment appeared in your timeline because I was responding to what you wrote - it was from April but I fail to see what is so strange about this. Nor do I understand how you completely misread the point of my comment, given the context and specifically the in-text quote from you. Below is your statement in full:

James, has the use of microarray data not revolutionized cancer research?. n

1 votes 1 thanks

David Eugene Booth

Well Prof. Timothy I certainly appreciate your helpful comments. Perhaps you would care to comment on the over 5000 patients that provided data for the first and subsequent study that i was involved in. Full details of the study group are contained in the first portion of the study as published in Cancer Research by the group prior to my joining it. Perhaps these patients had some reason for consenting to join the study. It has been my experience, I have been involved in clinical trials myself, as a subject, i.e the VITAL study, and I had reasons for joining beyond the excellent monthly newsletter that the staff produced. If you have read our paper you have realized that the large SELECT trial was stopped early because the group treated with Selenium showed a an unexpectedly high rate of prostate cancer. Based on our work we have been able to propose a mechanism for that stoppage which if supported by current work could make Selenium again a possible treatment for certain types of prostate cancer. While we have not cured anyone of cancer We have proposed a reasonable candidate for treatment of a prostate cancer subtype. While we would like to be able to say more our current progress does not allow that. We hope that our work combined with that of others may revive the SELECT study which may actually provide a new treatment for this cancer subtype. This is how research progresses. If you consider the Cancer Research paper previously mentioned and cited in the SRP paper this later work would not have been possible when the first paper was published because the adaptive lasso methods used were not available at that point. There were no methods available to do that in 2008. By the way, would I be confident in signing a surgical release.? If this work ended the way we hope it would not lead to surgical intervention. However I would be willing to sign a release for any treatment successfully developed by this approach. At least as confident as I was in signing my release for open heart surgery two years ago.

Specifically my earlier reply was to James comment, on the use of modern selection techniques. The reason I made the comment is that I believe that our work and that of others is valuable because of the above mentioned reasons. I hope that this has settled your fears that we are irrelevant. Best wishes to you and your group in New York, David Booth

Miky Timothy

David Eugene Booth, my thanks for your thoughtful and detailed responses to my questions. It is heartening to hear that you and your collaborators have indeed put a lot of thought into the clinical ends of this work. This sadly is very often not the case, and has made me a cynic (justifiably I believe).

It is also instructive (and relevant in the context of this thread) to read your description of how a (seemingly abstract) statistical procedure is used in applied medicine.

Best wishes and the best of health to you and your colleagues in Ohio!

James R Knaub

So David, n>5000.

Is n

Badges
Science topic

More Indrajeet Indrajeet's questions See All

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about Uranium ore deposits in world.

11 August 2024 6,720 0 View

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about diamond ore deposits in world.

11 August 2024 2,167 1 View

What is the difference between mathematical R^4 space and physical 4D unit space?

We assume that the difference is huge and that it is not possible to compare the two spaces. The R^4 mathematical space considers time as an external controller and the space itself is immobile in...

10 August 2024 6,678 14 View

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

10 August 2024 8,198 5 View

Controlling for pupil light reflex when analyzing pupil size time course?

I used eye tracking to examine how participants from two different populations (A and B) react to an image. Participants in population A exhibit larger pupil sizes over time, but they also have...

10 August 2024 3,229 0 View

What are a “Farmers Producer Organization” (FPO) and its essential features?

10 August 2024 477 5 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How to get moment output in Abaqus Standart?

I have input a moment load in module load Abaqus, i put my moment load on the node surface (using reference point). I have define moment in history output and make a set for moment too. But the...

08 August 2024 4,831 4 View

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

08 August 2024 8,162 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

GC-MS retention index prediticon?

Hello experts, Does anyone know any free software about retention index prediction ?

08 August 2024 7,403 2 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View