when regression is done with multiple independent variables say 10 or more, and there are large number of data points like for example more than 2000 data points, can we still have the chance to get high value for R squared or the adjusted R square. Does the value of R sqaured depend on the number of tested parameters and data points in the study?
@Josh:
"With R2 = 10%, it means 90% of variation is residing in the residual meaning the fitted line or model is bad/wrong. R2 of 60% above is worthwhile."
I don't think that one can give such general advice. It depends very much on the subject what R² is worthwile. When there are many uncontrollable, undeterminable and unknown factors influencing the response, an R² of 10% can be pretty cool. In a well controlled physical or chemical lab experiment an R² of 0.99 may be too bad.
The multiple R² explains how much variance of the data is "explained"(*) by the model. The more complicated the model, the better it can "explain" any noise and fluctuation in the data. Thus, the more parameters, the highr will the R² value will be (for the same data). This is "corrected" by the adjusted R² that includes a penalty for the number of predictors, so that the expected contribution to the "explanation" of variance by a completely uncorrelated predictor is removed.
The value of R² further depends on the distribution of the values of the response and of the predictors. So there con not be a general answer to your question.
(*) "to explain" means to reduce the (residual) variance. The reference for the R² value is the model without predictors (y.hat(x1,x2,...) = mean(y)) where the residual varance is equal to the variance of the data: V0 = var(y) = E[(yi-mean(y))²]. The model with predictors (y.hat = f(x1,x2,...)) will have a lower (residual) variance of Vm = E[(yi-f(x1i,x2i,...))²]. The value of R² is (V0-Vm)/V0, (= the relative decrease in the variance) and this will be 0 when V0=Vm, and it will be 1 when Vm=0.
value of R square from .4 to .6 is acceptable in all the cases either it is simple linear regression or multiple linear regression.
if you want to good value then according to the standards minimum value of R square must be .6 as it will increase it will be the more good and even the best value till .9
if the value of R square increases from .9 then it will be due to the auto correlation.
Khalid -
Jochen gives you a good description of these statistics. I don't really like these measures though, because, as I feel Jochen also indicated, they don't give you a measure that is really as informative as they seem to purport. Also, I think that the variance of these two measures is higher than one would implicitly assume. Further, with a small number of observations, I think it too easy to have a higher than warranted R-square occur.
What strikes me in your example, however, is that you have a great deal of data, and possibly a lot going on. There could be numerous approaches here. I did notice the paper (Thuiller, etal) at the first link below that addresses some possibilities in likely another context, and I suspect with your problem there are many options.
Rather than even an adjusted R -square, I think you are better served by considering the "variance of the prediction error." It is found in econometrics books, and I know that the square root is simply found in SAS PROC REG as SDSI, and I assume in other PROCs and in other software. This would be for a given 'predicted' y-value. For a sum of such predicted y-values, you do not simply sum the variances. This is discussed in the "Estimation of Variance" section, early in my paper at the second link, in the context of estimation for finite populations.
Cheers - Jim
Article Generalized models vs. classification tree analysis: Predict...
Article Using Prediction-Oriented Software for Survey Estimation
For the question as asked, I cannot improve on the answers given.
With ten or more variables there inevitably arises questions about identifying which ones are "important." One of the first approaches is to try forward/backward/stepwise regression. Common criteria are p-values or R-squared values. You could compare models using the Akaike information criterion (AIC) or one of several similar statistics. I have not seen a method that allows stepwise regression based on AIC in SAS. So I usually start with stepwise and then take models from the final few steps and look at AIC. You could also try multivariate methods an look at factors or Eigen vectors. Depending on your data, you could use imputation to fill in occasional missing values (see http://www.missingdata.org.uk/).
In reducing the regression model keep it hierarchical. If a main effect is not significant and an interaction is significant, then both remain in the model. Also be aware of coefficients that change dramatically when a variable is removed. This will occur if you happen to find an example of Simpson's Paradox as mentioned in Paul's #3 reference. A graphical representation is in another of Paul's articles: http://optimalprediction.com/files/pdf/V1A13.pdf.
It sounds like you have enough data to have a great deal of fun. I tend to find these large data sets to be both a joy and a curse.
On my experience, what is a good value of "Coefficient of determination", we have no general answer. I depends on the data you use, or depends on the characters of the object you study. In some case, for example, modeling a tree volume equation, the R^2 may more than 0.95 even 0.98. and in some other cases, the R^2 may less than 0.5 even 0.3, for example, modeling a quantative site index equation. In special case, the R^2 may less than 0.1, but the regression model may be good. We can say, for a given data, if you developed a regression model with best fitting results, then even the R^2 is very low, it would be a good value.
What is the goal of your model? Most of the time, I use R^2 Adj and R^2 Pred as a few of many methods for determining the quality of the model.
If you have 10+ predictors, it will be possible that some of them will look significant but really are random variables. (I had this happen before.) With that said, you should look over the different terms in your model that are significant and try to make sure they really belong in the model.
I would also be worried about the correlation among the different variables in the model. If the VIF is too high for a variable(s), that can lead to higher standard errors for the terms and coefficients that are nonsense. (Had this happen to me too.)
Andrew,
Thank you for your feedback.
indeed, I have 12 independent parameters, and some of them do have significant impact on the dependent parameter, and some other are not. This is happenning at the time when R2 is10%, and adjusted R2 is 9%. So from one hand analysis shows that there is a significant impact from some tested parameters, at the same time however R2 and adjusted R2 are showing insignificant relationship. How can we resolve this.
That is not quite the way to think of this. What you have is a model with some IVs being significant. Therefore the model is significant. However, it explains very little of the observed variability. Thus it may not be useful. There are many possible explanations for this. Two of them are; 1) You have missed some variable(s) that are important. 2) You are working with a system that is highly variable. I could measure 12 weather variables at my house, but I would not be able to predict the weather with much accuracy because there are a large number of external forces that influence the weather at my house. I may have the correct variables, but not the right scale.
There is also the possibility that the model is at fault. The problem could be as easy as transforming some of the variables. Analyzing the residuals might give clues. Possibly also change the model, but do so based on a theoretical relationship rather than statistical necessity. I might measure the length of fish and find that the relationship between length and weight is linear. However, I might still use an exponential model because most published models are exponential and I can see that my result could be an artifact of my sampling methods.
An adjusted R-squared of 9% is a bit disappointing. However, this may still be important. Maybe the system you are studying is exceptionally difficult, or just something no one as looked at. If nothing else it may help others learn from your problems.
If you are interested to estimate accuracy of the predictions or possibilities of certain data sets or variables sets you may use a cross-validation technique. The latter reveals practical significance of the R square value, etc.
One tool in regression that has not been mentioned in the discussions before this one is the use of the SCATTER PLOT which adds value to your hypothesized equation. The visual tool obviously helps you see the big picture better. The size of R2 , the significant IVs only confirm statistically your picture. The linear equation is the simplest but important way we think of how the response variable is affected by the IVs. Therefore in your discipline as well as others, models have been proffered (listen to what Timothy above is saying) which EXPLAIN (i.e. how close is the fitted line to the observed or measured data in short low Residual) the behaviour of the response to the IVs. That's the second stage after using the scatter plot is to try the known models - is linear, exponential, etc. As said elsewhere the more the IVs the higher the R2 because the IVs are simply pulling variation from the Residual (see Jochem). With R2 = 10%, it means 90% of variation is residing in the residual meaning the fitted line or model is bad/wrong. R2 of 60% above is worthwhile.
@Josh:
"With R2 = 10%, it means 90% of variation is residing in the residual meaning the fitted line or model is bad/wrong. R2 of 60% above is worthwhile."
I don't think that one can give such general advice. It depends very much on the subject what R² is worthwile. When there are many uncontrollable, undeterminable and unknown factors influencing the response, an R² of 10% can be pretty cool. In a well controlled physical or chemical lab experiment an R² of 0.99 may be too bad.
R2 is an overused statistics for linear regression analysis. Additional metrics are needed to complete the picture. In order to measure the stability of a regression model (for prediction purposes) it is possible to define a different "R2" that is based on the PRESS statistic. Another issue if the presence of multicollinearity. Models with high R2 can have high levels of multicollinearity. There are many statistics available that could be used to supplement the basic R2.
One fact I should mention and might help explain the low R2 value has to do with the fact that large number of the data points of the dependent varialble are zero. So out of the 2300 data points there are only 300 non-zeros and the rest are zeros. I am wodering how that can affect the R2 regression.
Additionally, when the significance test is performed, 5 idependent variables out of the 10 variables are significant, with a P-value far below 0.05.
Khalid -
The p-value is a function of sample size, so using a level of 0.05 for all sample sizes is not really what you want to do. Actually, I do not like using either R-square or p-values. Confidence intervals are generally more practically interpretable. Scatterplots are often helpful, as Josh said.
Jim
Article Practical Interpretation of Hypothesis Tests - letter to the...
Unless your target variable can be valued as negative, zero, and positive, most points with 0 values should be removed in regression analysis. If only 5 of 10 variables are significant, your model would be better.I agree with Jim, scatterplots are very helpful. Also, just as Josh said, we cannot give general advice on R², whicht depends very much on the subject what R² is worthwile.
Khalid -
You said that a "...large number of the data points of the dependent variable are zero." But do you mean that x does not have to be zero when y is zero in many of those cases? I did put together some notes including that situation in the context of estimation/imputation/prediction for finite populations. I had a great deal of experience in testing results and using prediction to estimate for missing data - both for nonresponse and out-of-sample - and seeing what worked, how and when, and one of the topics i had to come to grips with was how best to model when x or y data were missing or zero. The short note at the link attached may carry over to your application. The note is for one regressor, but in your multiple linear regression case, for x you could substitute a linear combination of the regressors (say predicted y) and see how those notes may relate to your case.
It would seem odd though, if you have a number of regressors, and they would give you quite a few predicted values of y that are positive, with corresponding observed y values that are zero. It would then appear that perhaps yet another regressor, not present, is still needed. ???
Jim
Technical Report A Note on Regression Through the Origin: What to do with mis...
Khalid,
You need to provide more information. What are you measuring? Is this a survey, or a mechanical process? Do the zeros contain information? Are zeros actually zero or just below the limit of detection? Are the sizes of the non-zero outcomes important, or could you code the response variable as zero and non-zero?
According to Duke U (n.d.)
"It depends on the variable with respect to which you measure it, it depends on the units in which that variable is measured and whether any data transformations have been applied, and it depends on the decision-making context. If the dependent variable is a nonstationary (e.g., trending or random-walking) time series, an R-squared value very close to 1 (such as the 97% figure obtained in the first model above) may not be very impressive. In fact, if R-squared is very close to 1, and the data consists of time series, this is usually a bad sign rather than a good one: there will often be significant time patterns in the errors, as in the example above. On the other hand, if the dependent variable is a properly stationarized series (e.g., differences or percentage differences rather than levels), then an R-squared of 25% may be quite good. In fact, an R-squared of 10% or even less could have some information value when you are looking for a weak signal in the presence of a lot of noise in a setting where even a very weak one would be of general interest. Sometimes there is a lot of value in explaining only a very small fraction of the variance, and sometimes there isn't. Data transformations such as logging or deflating also change the interpretation and standards for R-squared, inasmuch as they change the variance you start out with.
However, be very careful when evaluating a model with a low value of R-squared. In such a situation: (i) it is better if the set of variables in the model is determined a priori (as in the case of a designed experiment or a test of a well-posed hypothesis) rather by searching among a lineup of randomly selected suspects; (ii) the data should be clean (not contaminated by outliers, inconsistent measurements, or ambiguities in what is being measured, as in the case of poorly worded surveys given to unmotivated subjects); (iii) the coefficient estimates should be individually or at least jointly significantly different from zero (as measured by their P-values and/or the P-value of the F statistic), which may require a large sample to achieve in the presence of low correlations; and (iv) it is a good idea to do cross-validation (out-of-sample testing) to see if the model performs about equally well on data that was not used to identify or estimate it, particularly when the structure of the model was not known a priori. It is easy to find spurious (accidental) correlations if you go on a fishing expedition in a large pool of candidate independent variables while using low standards for acceptance. I have often had students use this approach to try to predict stock returns using regression models--which I do not recommend--and it is not uncommon for them to find models that yield R-squared values in the range of 5% to 10%, but they virtually never survive out-of-sample testing. (You should buy index funds instead.)
There are a variety of ways in which to cross-validate a model. A discussion of some of them can be found here. If your software doesn’t offer such options, there are simple tests you can conduct on your own. One is to split the data set in half and fit the model separately to both halves to see if you get similar results in terms of coefficient estimates and adjusted R-squared.
When working with time series data, if you compare the standard deviation of the errors of a regression model which uses exogenous predictors against that of a simple time series model (say, an autoregressive or exponential smoothing or random walk model), you may be disappointed by what you find. If the variable to be predicted is a time series, it will often be the case that most of the predictive power is derived from its own history via lags, differences, and/or seasonal adjustment. This is the reason why we spent some time studying the properties of time series models before tackling regression models.
A rule of thumb for small values of R-squared: If R-squared is small (say 25% or less), then the fraction by which the standard deviation of the errors is less than the standard deviation of the dependent variable is approximately one-half of R-squared, as shown in the table above. So, for example, if your model has an R-squared of 10%, then its errors are only about 5% smaller on average than those of a constant-only model, which merely predicts that everything will equal the mean. Is that enough to be useful, or not? Another handy reference point: if the model has an R-squared of 75%, its errors are 50% smaller on average than those of a constant-only model. (This is not an approximation: it follows directly from the fact that reducing the error standard deviation to ½ of its former value is equivalent to reducing its variance to ¼ of its former value.)
In general you should look at adjusted R-squared rather than R-squared. Adjusted R-squared is an unbiased estimate of the fraction of variance explained, taking into account the sample size and number of variables. Usually adjusted R-squared is only slightly smaller than R-squared, but it is possible for adjusted R-squared to be zero or negative if a model with insufficiently informative variables is fitted to too small a sample of data.
What measure of your model's explanatory power should you report to your boss or client or instructor? If you used regression analysis, then to be perfectly candid you should of course include the adjusted R-squared for the regression model that was actually fitted (whether to the original data or some transformation thereof), along with other details of the output, somewhere in your report. You should more strongly emphasize the standard error of the regression, though, because that measures the predictive accuracy of the model in real terms, and it scales the width of all confidence intervals calculated from the model. You may also want to report other practical measures of error size such as the mean absolute error or mean absolute percentage error and/or mean absolute scaled error.
What should never happen to you: Don't ever let yourself fall into the trap of fitting (and then promoting!) a regression model that has a respectable-looking R-squared but is actually very much inferior to a simple time series model. If the dependent variable in your model is a nonstationary time series, be sure that you do a comparison of error measures against an appropriate time series model. Remember that what R-squared measures is the proportional reduction in error variance that the regression model achieves in comparison to a constant-only model (i.e., mean model) fitted to the same dependent variable, but the constant-only model may not be the most appropriate reference point, and the dependent variable you end up using may not be the one you started with if data transformations turn out to be important.
And finally: R-squared is not the bottom line. You don’t get paid in proportion to R-squared. The real bottom line in your analysis is measured by consequences of decisions that you and others will make on the basis of it. In general, the important criteria for a good regression model are (a) to make the smallest possible errors, in practical terms, when predicting what will happen in the future, and (b) to derive useful inferences from the structure of the model and the estimated values of its parameters."
The full article is below
https://people.duke.edu/~rnau/rsquared.htm
Dear Khalid
First of all, for experimental data, 0.7 is very acceptable value for coefficient of determination.
Secondly, the value of the coefficient of determination depends on the number of your data points that could be explained by your model (your equation). Therefore, I think it has not thing to do with the size of data.
The following reference could help you:
https://www.mheducation.co.uk/openup/chapters/0335208908.pdf
Regards
Coefficient of determination (R^2)
• The coefficient of determination is a measure of the amount of variance in the dependent variable explained by the independent variable(s). A value of one (1) means perfect explanation and is not encountered in reality due to ever present error. A value of .91 means that 91% of the variance in the dependent variable is explained by the independent variables.
• The amount of variation explained by the regression model should be more than the variation explained by the average. Thus, R2 should be greater than zero.
• R2 is impacted by two facets of the data:
o the number of independent variables relative to the sample size. For this reason, analysts should use the adjusted coefficient of determination, which adjusts for inflation in R2 from overfitting the data.
o the number of independent variables included in the analysis. As you increase the number of independent variables in the model, you increase the R2 automatically because the sum of squared errors by regression begins to approach the sum of squared errors about the average.
@jochen
Thank you for your response
Do you have a reference article to recommend regarding the relative value of r2 ?
@jochen I meant that a low value of R2 might be acceptable (for example, in the social sciences and humanities). Unfortunately, I have not found a scientific reference that supports this idea. Thanks a lot for your help.
You won't find a serious scientific reference, because this is highly context-specific and requires substantial (subject-matter) interpretation.
Many years ago I participated in a consulting project (when I was a student) on identifying patterns of used language between travel agents and potential customers. The resulting R2=0.30(or so), and I was surprised how happy the client was. She told us that such a high R2 value in her discipline was higher than any other published value for such an application.
On the other extreme, I recently published an article in Journal of Clinical Pharmacology in which R2=.97. There actually exists a theoretical framework (that was hidden from us) in the animal science literature in which a similar model was used with data on the metabolic rate of some type of crabs. How "significant" a R2 value is in some application depends on the type of application. In the end, there are many additional statistics that should be used beside R2.
https://data.library.virginia.edu/is-r-squared-useless/
In this publication you'll find experts arguing and showing examples on why R2 may be considered useless.
I recommend you adopting other performance metrics to assess your results.
Thanks @Daniel_Althoff !
@Gaetan_Temperman, it's interesting ! :-)
@Daniel_Althoff,
Thanks for sharing such a valuable source to clear doubts about R2 value.
Overall there isn't a "best metric"... an ensemble of metrics will provide your scientific paper more reliable arguments.
Although covering the topic about hydrological modeling, I highly recommend this paper for its insights on R2: https://www.adv-geosci.net/5/89/2005/adgeo-5-89-2005.pdf
I strongly suggest reading about mean bias error, mean average error, root mean squared error, and ask yourself: are those used in similar scientific papers? what is the common ground? should I use the normalized version of error criteria?
Don't forget to discuss what are the impacts of those metrics performance.
Check for answer
https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit
Thanks guys this has provided more light and thank for the valuable argument and clarity as well as for sharing some references