In my regression analysis I found R-squared values from 2% to 15%. Can I include such low R-squared values in my research paper? Or R-squared values always have to be 70% or more. If anyone can refer me any books or journal articles about validity of low R-squared values, it would be highly appreciated.
You need to understand that R-square is a measure of explanatory power, not fit. You can generate lots of data with low R-square, because we don't expect models (especially in social or behavioral sciences) to include all the relevant predictors to explain an outcome variable. You can cite works by Neter, Wasserman, or many other authors about R-square. You should note that R-square, even when small, can be significantly different from 0, indicating that your regression model has statistically significant explanatory power. However, you should always report the value of R-square as an effect size, because people might question the practical significance of the value. As I said, in some fields, R-square is typically higher, because it is easier to specify complete, well-specified models. But in the social sciences, where it is hard to specify such modes, low R-square values are often expected. You can read about the difference between statistical significance and effect sizes if you want to know more.
low R-squared values are not always bad and high R-squared values are not always good!
http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit
Hi Abu, although I suppose I would be hard-pressed not to call 70% high, I don't think we should be caught in a trap of referring to r-square values as high or low without knowing much more. In pretty much any field I work in I would never expect to see an r-square of 70%. It really depends on what you are trying to predict, what your predictors are, and how reasonable it is that you can predict that. It's important to realize that the outcome could be made up of a lot of things. For example, if it is a psychological scale, part of it is always going to be measurement error. By definition you cannot predict that. But if it is some kind of human behavior, thought, or feeling, it is extremely unlikely that unless your predictors are very closely related to the outcome, in other words, nearly the same thing,you are going to be able to predict probably even half of it. Even something seemingly well related, such as suicidal ideation doesn't predict actual suicide attempts that well. Having a stressful job doesn't predict individual stress that well because individual responses to stress vary a great deal. I am not giving you any kind of a standard, but I wouldn't find an r-square of .25 in a psychological study unusual. And a small r-square could have important implications. Even if a combination of predictors representing modifiable behaviors predicted just a small amount of the variance in an important health outcome, that could be very important. I hope that helps. Bob
In my opinion, the question of "low" or "high" R-square is really subjective. You need to cross-validate the regression to measure the possible inflation of scores due to overfitting. It will gives also an opportunity to get a real estimate of prediction (if the predictor occur before the predictand) or specification (if they occur at the same time), but if there is a physical sense to associate the predictor and the predictand, and if you can get higher score than, for example, red noise of the same size, it warrants definitively to be interpreted, independently on the value of R-square.
I couldn't agree more with Robert Brennan. It's not common to get an r-square value of .70. The degree of "how much" of the variability of the dependent variable can be explained is based, often,on how you measure it! In this link you may find some usefull views on the subject:
http://people.duke.edu/~rnau/rsquared.htm
In social sciences values of 70% or above are not common. Social phenomena are complex and multidimensional, so it is very difficult to explain a very big amount of variation. I don't know what is your research area, but if your R squared is small it means that your model is poor and you have to be careful if you intend to use it to forecasting or to define strategies of work. However, you can certainly use it to better understand the phenomenon you are studying and you should report the R squared in your research paper.
If there are many omitted predictors that would help to explain the D.V., then R-sq can't be expected to be that high. In explaining human behavior, even small values of R-sq can be quite meaningful. R-sq is about explanatory power and not truly about "fit."
Whlle I very broadly agree with Rob and John, the value of R^2 very much depends on the context of your data. There are areas of social science, psychology and economics where R^2 of 70% wouldn't be unusual. Equally, you can find important effects with R^2 much less than 10% or indeed 1% or 0.1%.
That said, high R^2 values sometimes indicate problems with the modelling (e.g., ecological correlation) or arise because the sample sizes are small.
There is more to R-sq than meets the eye: First, pay particular attention to what field of research you are working in (i.e. social sciences, biological, clinical studies etc ...). E.g. in clinical studies an R-sq value of 70% might not be enough to explain the variation one intends to address, while in other fields of sciences, this value is sufficiently high. When running a regression model with multiple explanatory variables, it is possible to obtain relatively high R-sq values, but this has to be in observance to the law of Parsimony (in model fitting). Long story short, surely an R-sq value of 10% is low and 90% is high, but pay attention to what context the R-sq is being interpreted. [I hope this helps]. Good luck.
The interpretation of the R-squared will depend upon whether the output is significant or not. You can see the significance of this in the ANOVA output. In most instance is the ANOVA test that the R- and R-squared are significant, even if they are low in value, still you can use them in your research. The interpretation of the coefficients should be be solely based on the value itself. You have to consider the test statistic and for the R-squared the test is the ANOVA while for the beta coefficients, the test if the t-test.
There are lots of available materials in the internet. Studies or instructions as to how to interpret regression results. You just have to search them in the net.
Jimmy: "Long story short, surely an R-sq value of 10% is low and 90% is high, but pay attention to what context the R-sq is being interpreted."
This may often be true but doesn't always hold - the point is that R^2 is determined by error variance that may not be explainable and is unique to a sample. Thus one can not be certain that in two otherwise identical models with different samples and R^2 of 10% is smaller than an R^2 of 90% (though often it will be). Similarly, it may be that R^2 of 99% indicates catastrophically poor prediction in some domains and R^2 of 0.001% a dramatic effect in another domain.
I discuss some of this in the following paper:
https://www.researchgate.net/publication/23481657_Standardized_or_simple_effect_size_what_should_be_reported?ev=prf_pub
Article Standardized or simple effect size: What should be reported?
R square explains the predictive fit of the model. In case if you are using regression to explain the relationships alone and not prediction. Then it would not be a issue.
R2 is one of the most abused metrics in judging goodness of fit in multiple regression analysis. It can be grossly misleading in some situations. For example, you can get a very large large R2 (e.g. 0.999) as the number of predictors in your model approach the sample size (n). This is called Friedman paradox. You can also increase the R2 if include a predictor even if it has nothing to do with your response variable. A small R2 also does not mean poor explanatory power. It also depends on sample size; with the same number of predictors you increase the sample size, R2 values gradually decrease. In addition, just one influential value in your data can distort the R2. Please first check for influence statistics (e.g. outliers, leverage points) and collinearity (e.g. variance inflation).
Anyway, judging an analysis based on R2 is a serious error; we know that R2 is a very bad measure of goodness of fit. Instead you could check measures of "lack of fit" (e.g. Malow's C) or information criteria (e.g. Akaikes's or Bayesian) to select the best subset model for your data. Please see Lukac's paper for more info. The problem with these stats is that they assume the model is correctly specified. The wouldn't tell you whether you have autocorrelation, influence statistics or any other problem that distorts the analysis. If you are looking for predictive power it is even better to first apply partial least square regression or reduced rank regression and extract the most important factors.
Lukacs, P.M., Burnham, K.P., Anderson, D.R. 2010. Model selection bias and Freedman’s paradox. Ann Inst Stat Math 62:117–125.
I suggest starting with a simple x-y plot of your data. If it looks like a scatter plot, your model isn't going to be a very good predictor. If you still want to use the regression model, you absolutely must include the R**2 values. While I'm willing to accept that a low R**2 model might have predictive value under certain specific circumstances in theory, in practice I would never use the model. Depending on the domain, I might accept that x has an impact on y.
There are four problems I've encountered with use of regression analysis.
• The x-effect may be non-linear. Depending on the type of non-linearity, segments of linear behavior may occur. In general, I try to have some basis for picking the mathematical form of a model.
• The x-effect might be real, but it's not the dominant effect in terms of the behavior of y. In these cases, the model has little practical value.
• There may be no effect at all and I'm guilty of selection bias.
• x and y may be correlated, but there is actually no causal link. The model itself of course never "says that." However, that's the way we humans tend to interpret such results.
BTW, as a physical scientist I generally find that models with R**2 less than 0.6 aren't very well accepted. In fact, I still bear the scars of a battle I had with a statistician over a model I developed that predicted behavior over seven orders of magnitude of x but "only" had an R**2 of 0.7. Given the different and difficult circumstances surrounding phenomena of interest to social scientists, a lower standard is justified. However, sometimes a scatter plot is just a scatter plot.
I think there is a bit of confusion here about the objective of the regression analysis. Are we performing the regression to explain the variability in the depndent variable? Or are we looking for a model with consistent parameters? Call me a purist, but if your model explains only 2% to 15% of the observed variability, then even the most consistent and most significant parametes will calculate predicted value that has nothing to do with the modeled phenomenon. I know that the approach "something is better than nothing" is quite fashionable in the data mining and social sciences, but that does not change the fact that your model has more to do with wishful thinking than modeled variable. Ability to calculate precise estimator of betas that are significant because of large data sets involved in the calculation does not provide for high predictive power if the r value is small.
I think it is worth going back and seeing precisely what people have written. There is more to regression than prediction (though sometimes that is the goal). Tiny proportions of variance explained can equate to important theoretical or practical effects. If you come from a discipline where this is rare, that may be surprising, but it is nevertheless the case.
The proportion of variance explained by X on Y depends on several factors - notably the proportion of measurement error. As the measurement error isn't part of the phenomenon of interest, R^2 often underestimates the proportion explained as a consequence. Other factors such as range restriction are also important. For instance one can get a small R^2 by restricting the range of X sampled or manipulated. Sometimes this is deliberate (e.g., it may not be ethical or practical to give large doses of a treatment) and hence the R^2 may be lower than would be seen in real world application etc.
In addition to the many factors that bias R^2 up or down, there is good old fashioned sampling error. In small samples R^2 estimates (even if unbiased are very imprecise.
I'm personally not a fan of R^2 type measures at all, but if one is going to use them then the first thing to do is treat them cautiously. Assuming a model is bad or good _just_ because of high or low R^2 is in my view extremely silly.
In place of all the extreme case problems that have been raised about R-sq, I would flip things around and say that small R-sq inherently means large errors in the estimated values. There are basically two possible reasons for this: the results are indeed largely random with regard to the independent variables or the model is seriously mis-specified (e.g., non-linearities or omitted variables).
Niether of those situations is likely to make you happy, but I would like to offer a different suggestion with regard to interpreting the results "in context." Do you have good reliability and validity on your measures? Do you have strong theory that makes your results important to your field? If you can say yes,to both of those questions, then your next consideration is peer review. If your colleagues agree that your results are a contribution to knowledge in your field, then it is up to the next person to do better than you have.
David, I've always found it unfortunate that a model with omitted variables is labeled "misspecified". All models have omitted variables (except for some very simple physical systems).
There are two parts to the answer. First, you can, and probably should report whatever you have found, even if it refutes your (or somebody else's) hypothesis and wishes. Reporting is about findings and reality, not about what would be nice.
Second, it depends: if you calculated te regression in order to predict Y, then the low R may reflect a low ability to predict, using this set of variables. If what you want is to estimate the contribution of each independent variable (net of the effects of the other independent variables), then the R does not really matter.
R-Square is models proportional explained variation due to independent variables. Usually it is considered good if R-square is high enough let say more than 0.5, but it is not necessary, because model can have larger R-squared value even if overall model is insignificant and model may/ or may not full fill the necessary assumption of linear regression model such as normality of residual, homoscedasticity, multicollinearity, autocorrelation etc. I mean if model assumptions does not full fill then high R-square is misleading. Instead of focusing only on R-square some other model selection criteria such as AIC, BIC, etc and model assumption using at least these test Durban Watson, value, existence of autocorrelation, multicollinearity, and/or hetero should be checked.
Many of the author especially favor if R-square is above 10%, below 10% may be problematic, but more than 70% or more than 80% may also be problematic because of multicollinearity.
Repeating the comment above, report whatever you have. No relationship, or a very weak one, is still a result.
If you have masses of data (so the fit is not by pure chance), and get a low R^2 of say 10% that might still be very helpful in a decision-making process, e.g. if some factor explains 10% of the variation in life expectancy of cancer patients that would be a lot better than knowing nothing about that factor. Sometimes in decision-making we have to go for our 'best shot' without the luxury of being very sure.
That said, the R^2 is a reflection of the model assumption. Maybe a different relationship fits better (it probably still won't the the 'truth' just a closer approximation to the truth), and repeating another comment - take a look at the scatter plot, distribution of residuals, ask the question where the best fit is needed (low values, high values)
By combining responses from Muhammad, John and Gudeta, it appears clearly that we should always be very careful in the way we interpret R^2 and R. Most often, we are too concerned with the strength of these relational explanatory estimates. In correlation analysis for instance, two main values are used to explain the relationship notably the correlation coefficient and the P-value. While the correlation coefficient explains the strength of the correlation, which is the amount of change that a unit change in the independent variable caused to the dependent variable, the P-value tells us about how consistent is the change. This is the same with R^2 because while the value of R^2 give and idea of the strength of the relationship, other parameters that help appreciate the validity of the model as explain by Muhammad are equally very important to be considered. So in my opinion, it now depends on the study context to some extent because they might be contexts where the strength of the relationship is more important, some where one might be more interested in the consistency that is explained by small P-values and some where one might be interested in both.
Do not worry at all. R^2 value may be low for cross sectional regression when data are collected say from wide range of industries. Just ensure whether signs of the coefficients are on expected line and t- values of some variables are significant in case of multiple regression model. Very low value might signify omitted variable bias...
There are 3 purposes to regression: prediction of future values, reporting of an association between an independent and dependent variable, and explaining observed variation in a dataset. The R^2 statistic is important to the third purpose.
R^2 are useful for future prediction (purpose 1), but there are better approaches to this (see Breiman, 2001, Statistical Modeling: The Two Cultures, Volume 16, Issue 3, 199-309).
R^2 is largely irrelevant if your purpose in running regression is to test an association, such as whether there is a treatment effect on an outcome.
Thus, it is useful to think about your purpose in doing the regression. If the purpose is to explain lots of covariation, then in my view, it depends completely on the outcome, sample, and expectations in the field. For psychosocial outcomes, we rarely obtain R^2 better than 0.6 (if we do then something's awry). Cardiovascular epidemiologists expect higher.
R2 can even be negative. Good intepret it in relation to other summary stats. In the end, all stats will look "normal" under classical linear reg assumptions.
Negative R-square is sometimes a symptom of multicollinearity. It is generally viewed as "pathological."
if R- square value is high that means the Indepandant variable we have chosen they are capable enough to explain variation in Dependant variable. but if R- square value is low it means variable we have chosen are wrong . So high R-suare value is prefrable. you can refer book " introductory econometric book for finance" - by chris brooks.
R^2 can't really be negative in a least squares model unless it is defined in an unusual way.
e.g., R^2 = SSregression/SStotal.
As SSregression is a sum of squares it is always positive.
Adjusted R^2 and various pseudo-R^2 measures can be negative and I think you can also get negative values from certain versions of software (presumably where SSregression is calculated in an unusual way).
Have a look at this website:
http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit
It is a really good explanation what high and low R square values can mean, and what they do not mean.
Neither very high R square values are always good nor low R square values are very bad. In easy terms we can say the model of selected variables (chosen by researcher) fits the data well (or explains more of reality) if the data points are close to the fitted regression line and the difference between the observed data points and the predicted values is small.
R square is a measure which tells about this difference. Selection of the variables is dependent on researcher's choice. R square does not tell us whether the selected regression model is adequate or not.
In case of low R square values the coefficients of the variables can be checked. If the coefficient of a variable is highly significant that means it is an important variable yet the overall selection of all the variables is not good enough to explain the variability of dependent variable. Some important independent variable may be missing from the model. In such a case the selection of chosen variables should be justified (why only those variables were taken and what could be the other important variable required to better explain the research question in hand).
R squared results and model significance are in fact related, since both R squared and F significance test expressions can be derived from the ANOVA.
Therefore a low R squared (for instance 0,02) would lead to a low value of F, thus failing to reject the null hypothesis of non-significant coefficients for the proposed explanatory variables (of course the result of F would also depend on the sample size and the number of regressors)
In social sciences it is common to find authors justifying models with R squared very low saying that the phenomena is determined by factors not necessarily expressible in numerical form. But that argument is almost always false and gives evidence of a very limited framework . R square low values are indicative of both an inadequate model as-and this is the most serious, of a weak theoretical framework
Hi Voxi Heinrich Amavilah and others, R square cannot be negative, adjusted R square can be negative. If you obtain very low R square in OLS it is better to find new variables to improve R square. Though R square does not tell us whether the selected regression model is adequate or not, you have to examine significane level of individual coefficient and other statistical properties. Low R square often acceptable if you have very high sample size, coefficients are significant, plausible (sign and sizes are as expected) and statistical properties are met.
Don't we find R2 especially useful for assessing the fit of different models? That is, the R2 change when removing or introducing a variable or more in a sequence of model fitting steps? In this case, R2 teaches us about the size of the gain or loss in the model's fit, so we can make judgments about whether or not to make that addition or removal of a variable or more. But, though obviously related, we would generally prefer to use F to make statements about size of the effect in the relationship between the predicted and predictor variables. N'est ce pas?
It depends what you mean. R^2 can be useful for comparing models on the same data but it is dangerous to use R^2 change alone to decide on what models to include.
F is not a good measure of effect size as it depends on all sorts of factors that have no relation to the size of the effects (e.g., the reliability of the measures).
I'm surprised Thom at your comment about F. F is a pretty straightforward index of effect size. I'd refer you to Cohen's seminal work on effect size. If you mean that models have to make sense in more ways than merely the statistical output, I agree completely. It goes without saying that you have to know what the heck you're doing and why.
On your comment about R2, I'm puzzled. Naturally, there are no simple switches in multiple regression analysis, so unthinkingly doing anything can lead to errors. In fact, no statistical procedure should be selected and used and interpreted without thinking about all kinds of other issues surrounding the analysis: study purpose, research question design, variable selection, data collection and quality, the nature of the distributions of the variables in the data, on and on. Let's say that none of this should be done unthinkingly.
But about R2, I mentioned variables within a modeling sequence, not comparing completely different modeling efforts. It may not be the only use, but assessing model fit is the classic use of R2 in multiple regression modeling. I'd be interested in reading how you use R2 in multiple regression modeling.
OK. I think you are referring to Cohen's f rather than Fisher's F ratio (which I had mistakenly assumed).
Cohen's f is a standardised effect size metric - which a number of people (including me) think are generally poor choices as measures of effect size per se - as they are measures of discriminability (i.e., signal to noise ratios). They thus play a central role in power calculations (where discriminability is important) but are often problematic when trying to assess practical significance on an effect. Even in power calculations it is easy to go wrong (e.g., using an effect size stat from a study with a different design).
e.g., see
https://www.researchgate.net/publication/23481657_Standardized_or_simple_effect_size_what_should_be_reported
for my own views.
In terms of R^2 - it is a useful descriptive stat but simply maximising R^2 is a poor way to pick a model (e.g., through stepwise selection). I wasn't sure whether you were referring to its use as descriptive tool or its use in approaches such as stepwise regression.
Article Standardized or simple effect size: What should be reported?
Nope, I was not talking about Cohen's f. I was referring to the F that indexes the degree of association between the independents and dependent variables for a given model.
As to variable selection from among a set of variables that are determined, from the outset, not by any statistical procedure but by study design, R2 CHANGE/DIFFERENCE would be ONE of the pieces of information I would use within a modeling sequence to remove variables. I would not let the software do this for me. I would run each step in the modeling sequence independently.
Without going into other things, I would value reading what would be the recommended procedures/steps from your standpoint for removing variables from a model so as to arrive at the best fitting, most elegant solution?
Generally I would prefer a theory-driven approach to compare different models. How that works would depend on context. Mainly I work with experimental data and that simplifies things a bit. In more complex cases I'd try and set up models that capture the predictors of interest (e.g., structural or demographic factors and then adding in main effects of important theoretical predictors and then two-way interactions and so on).
The precise strategy would depend on factors such sample size and collinearity. For instance there is no point adding in all two-way or three-way interactions if the additional df exhaust or near exhaust the error df. Where possible I would leave predictors in the model (significant or not) if there is a theoretical reason for including them (e.g. demographic factors). For instance if you have an outcome variable that differs between genders in the literature it makes sense to keep a non-significant gender predictor in the model.
For me the regression coefficient is what matters as it estimates the size of the effect of the undelying relationship - an R-squared as a correlation coeffcient aseses the scatter around that relation. It is the correlation (squared) between the observed and predicted response values. You can get a close fit to a very shallow line.
As always estimates and effects have to be put in context - I have seen the following benchmarks stated for odds ratios: Small but not trivial: 1.5; Medium: 3.5; Large: 9. And for R-squared: Aggregate time series: expect .9+; Cross sections, .5 is good. and large survey data sets, .2 is not bad. [ see people.stern.nyu.edu/wgreene/.../Statistics-14-RegressionModeling.pptx]
And yet in a trial of the effect of taking aspirin on heart attack the odds ratio was so dramatic that the trial was stopped and placebo group advised to take aspirin. The odds ratio of a heart attack for placebo compared to taking aspirin was a lowly 1.83, while the R2 was a puny 0.0011; yet this was sufficient for action.
If regression analysis gives this kind of R**2 value, I suggest using Fuzzy Set Qualitative Comparative Analysis instead. Often more useful because it can identify causation rather than correlation, i.e., if there are just a few cases with a certain set of attributes, but they always [or almost always] lead to a result of interest, then fs/QCA can provide stronger justification for a relationship than any regression analysis might.
There is no clear cut-off at 70%, and even if it were, you still need to focus on the theory behind your model and the situation at hand. It is difficult to get a high R-square, for example, for a variable that is impacted by many factors, some of them not included in the model.
R^2 or adjusted R^2, Data is time series or cross sectional, in time series normally R6@ is very high but if make variable stationary than again it is low. in Cross sectional R^2 is always low. R^2 may be low due to multicolinearity problem. What about Individual variable significance. we have to check all aspects
Even R square value is low but F statistics is significant, I would say model is fit as F statistics talk about fitness of the model in the population while R square is merely a sample result. In this regard, I would ask you to join "Analysis" Facebook group to discuss about econometrics model further. The link is given below.
https://www.facebook.com/groups/449520105253087/
In my personal view, I will still report the low R-squared values in the multiple regression analysis as this is part of the empirical evidence. However, this will prompt me to think further what are the possible causes of this finding & how to rectify including: conceptual framework developed, theories adopted / adapted, instruments used, data collection approach, is the finding tally with literature reviewed etc. Low R-square also provide me the opportunity to explain further during the discussion section of an article / thesis to discuss why this is low e.g. there might be other invisible factors / variables that are influencing this dependent variable which are not part of the study (based on further literature review / latest literature), stating down as limitation & what are the future research to recommend etc. - do our best to find avenues how to contribute to knowledge in the midst of unfavorable empirical evidence.
The real question is what type of regression you are doing. If you have cross-sectional data - R^2 value ranging from 2% to 15% will be good enough, provided coefficients of variables have desired signs envisaged by the model specification, one or two variables have significant t values etc. On the contrary high R^2 value coupled with very low t values of all the coefficients denote doubt about the model specification and / data. If your model specification has got solid economical argument - then don't worry too much about R^2 value particularly in case of cross-sectional and pooled regression.
You need to understand that R-square is a measure of explanatory power, not fit. You can generate lots of data with low R-square, because we don't expect models (especially in social or behavioral sciences) to include all the relevant predictors to explain an outcome variable. You can cite works by Neter, Wasserman, or many other authors about R-square. You should note that R-square, even when small, can be significantly different from 0, indicating that your regression model has statistically significant explanatory power. However, you should always report the value of R-square as an effect size, because people might question the practical significance of the value. As I said, in some fields, R-square is typically higher, because it is easier to specify complete, well-specified models. But in the social sciences, where it is hard to specify such modes, low R-square values are often expected. You can read about the difference between statistical significance and effect sizes if you want to know more.
Yes, definitely we can incorporate low R-Squared values in research papers.
What that matters is whether the factors considered are found statistically significant or not?
Low R-Squared value with statistically significant parameters is more valuable (useful) than high R-Squared accompanied with statistically insignificant parameters.
Regards
Ashu
This question goes to the multiple uses of regression. If one's purpose is to build very efficient predictive models, then maximizing R2 or adj. R2 is key. In the social sciences, where most often we're interested in testing hypotheses about certain variables while adjusting for the effects of others, the significance levels of key variables are much more important (although that's perhaps a discussion for another forum). If in your model you determine that you are or are not able to reject H0s of interest, the amount of variance explained by your total model is more or less irrelevant. Hope that helps, and good luck with your analysis.
Best wishes,
Bill
(i.e. the b values ) for the predictors are significant. Please also check for influence statistics (outliers and leverage point). Given the large sample size you used there will likely be many of those that can reduce the R2. I recommend robust regression for that.
All the best
For some reason the first part of my answer was lost. What I was saying was small R2 is not a problem as long as the the omninbus test is significant, and if the b values for the predictor (s) are . It could be partly due to the large sample you used.
I now see the problem; i.e. the ED scores (X-variable) range between 0-6 and will definitely not be normally distributed for each (Y); hence bivariate normality assumed in ordinary least square (OLS) regression is already violated. I think OLS (that return R2) is inappropriate, and robust regression may not be useful because it is simply an extension of it to cater for influence stats. I suppose treating the EDs as categorical values, and running a multinomial regression will probably give a better picture. The problem however is that you have 7 categories making the interpretation difficult. Alternatively, creating social or ecological strata within countries (e.g. urban/rural, poor/rich, young old, any category ordered or otherwise) and using these as categorical variables may be more useful than the EDs. While this kind of analysis also gives the R2 (but it is called the pseudo-R2 and its interpretation is slightly different), you will have more flexible options to interpret the response if you applied either binary, ordinal or ordered regression because all these use a maximum likelihood framework. If you want to use OLS here is a short article I wrote on the problems with OLS regression, and rules of thumb to achieve satisfactory results. http://www.researchgate.net/publication/281371440
A small R2 also does not mean poor explanatory power A high R-squared does not necessarily indicate that the model has a good fit. our R-squared value is low but we have statistically significant predictors, therefore we can still draw important conclusions about how changes in the predictor values are associated with changes in the response value. Regardless of the R-squared, the significant coefficients still represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant. Obviously, this type of information can be extremely valuable.
R-squared cannot determine whether the coefficient estimates and predictions are biased. R-squared does not indicate whether a regression model is adequate. we can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data. high R^2 values sometimes indicate problems with the modelling (e.g., ecological correlation) or arise because the sample sizes are small.
Also ,The main assumption in a multivariate regression analysis is the significance of the whole regression.The interpretation of the R-squared will depend upon whether the output is significant or not. As noted, if the model is highly significant (based on the model F-statistic). Thus, The goodness of fit of the model was indicated by the high value of F-statistic.
In most instance is the F-statistic test that the R- and R-squared are significant, even if they are low in value, still we can Interpretation them in our research.
Finally, R square explains the predictive fit of the model. In case if we are using regression to explain the relationships alone and not prediction. Then low adjusted R-squares would not be a issue.
The F statistic is a test of all the coefficients simultaneously to see if they all differ from 0. Simultaneously, it is a test of R-square to see if it is significantly different from 0. Those two tests are equivalent.
A low R-square value does indicate small explanatory power--as an effect size, it would be a "low" or "small" effect. What is important is to realize why that might happen. Some phenomena would require complex models with many predictors to explain them (particularly human behavior phenomena). Other phenomena can be predicted or explained with just a few predictors. In the former case, we expect to see low R-squared values (even if they are statistically significant), but in the latter case, we would expect to see higher R-squared values (even if they are not significant).
R-squared does not indicate "fit" or "adequacy" of the model--that is a common misconception. Those are determined by detailed analyses of the residuals to see if there are potential outliers which can violate the underlying assumptions of OLS (e.g., distance measures, leverage measures, DFFits, DFBetas), as well as tests for normality, equal variance of the error terms, etc.
Multicollinearity does not affect R-squared, but it does affecct our ability to estimate your coefficients. If there is considerable multicollinearity, you cannot truly reduce it by centering variables (another common misconception), but you can consider whether you need predictors that are highly correlated or whether you might be able to combine scales that are highly correlated because they may measure the same construct. The use of orthogonal polynomials can also be useful.
When OLS assumptions are violated, use robust regression techniques. There are several to choose from. Any good linear models text will discuss them.
--Ramona Paetzold
Low R-squared could mean, generally speaking, one of two things:
a) There is a problem of misspecification in your model (omitted relevant variables, incorrect functional form, etc.). Therefore, in the error of the model there are components that explain a high fraction of the variance of the data you are trying to fit.
b) The data you are trying to fit has intrinsically a high random component. For instance, if you want to fit "y" to a linear model, and the data generating process of "y" is:
e = 3*u ;
y = 1 + 2*x + e.
Where x is distributed uniform(0,1) and u is distributed normal(0,1). Then, the variance of "y" will be:
Var(y) = Var(1+ 2*x + e) = Var(2*x + e) = 4*Var(x) + Var(e) = 4*Var(x) + 9*Var(u)
Var(y) = 4 * (1/12) + 9*1
Then, by construction:
Var(1+2x) / Var(y) = (4 * (1/12)) / (4 * (1/12) + 9*1) = 0,035
That means that your model (E(y|x)=alpha + beta*x), should explain only a 3,5% of the variance of y, given "y" has "naturally" a high random component.
For instance, in some specialized literatures is not rare to find low R2. I think is justified to be concerned about a low R2 if you survey the papers that make the same regression as you are (or very similar) and they find substantially different R2 results.
I have a similar question though mine is how to interpret R Squared. Does it mean that additional variables that produce higher R Squared are more perfect than those that produces lower?
Hello Muzaza Musangu
Interpreting R squared values require a proper understanding of the variables used and how the variables can be influenced by each other. Variables having the same interpretation or real life meaning turn to increase R squared values (the confounding effect) giving the impression that statistical effects are significant when actually they are not. Therefore, adding variables in a regression should not have as goal to increase the R squared values due to confounding that would propably falsify results.
Dear musango I hope you are satisfactorily clarified. What matters first in modelling is the quality of your model fitting satistics . your have to check for the extraction of variables and collinearity and then the significance of the overall variability explained by the predictors over the outcome variable and then the effect of individual predictors. The r-square or pseudo r-square helps more in comparing models and then grating them according to how fit they are to explain a given system.
Low R-squared values can be an issue. The R-squared value is simply an indicator of how much variance in one variable is explained by the other variable. It is possible to have a good p-value (less than .05) but still have a low R-squared value.
This is a help exposition into the issue.
http://statisticsbyjim.com/regression/low-r-squared-regression/
Good luck!
To increase R sq, you need more influential predictors that affect the dependent variable, and are also careful for multicollinearity problem in this context.
Best Wishes
In social science, to examine the effectiveness of a factor, the size of R2 does not matter. However, we need to explain why the R2 is low if it is the case. At least, we explain potential important covariates (independent variables) are included or not. If those covariates are included in the model and the R2 is still low, we would claim the measurement of dependent variable and some independent variables are not accurate. If those covariates are not included, we should mention why we cannot include those and claim “further research is necessary”.
Low Rsq reflects that a researcher needs more influential independent variables on dependent variable. Also, a severe multicollinearity in multiple regression analysis adversely affects estimates of the parameters. VIF should be less than 5 or 10.
More powerful approaches are recommendable i.e. artificial neural networks (ANNs) or multivariate adaptive regression splines (MARS). For my part, the second one is more informative. For more information, please see my book:
entitled "Application of Multivariate Adaptive Regression Splines in Agricultural Sciences through R Software"
Best Regards
Prof. Dr. Ecevit Eyduran
Chair of Business Administration, Quantitative Methods
Ridge Regression may solve this problem if your data suffer from multicollinearity.
In reality many models does include Low Rsq. It may still be very useful for model building ie. for precise estimates.
Şenol hocam, R2 değeri doğrusal regresyonda çok şey ifade eder. Özellikle düşük olması (sıfıra yakınlığı) modele alınan değişkenlerin yeterli olmadığını gösterir. Ancak doğrusal olmayan modellerde ise çok fazla bir anlamı olmaz. Orada hatanın kullanılması daha doğru sonucu verir.
I think if r 2 is low but the model is significant, that means go for it no problem. Thanks Ramona..