You should test the regression in different models (log, exp, linear...). I assume that you mean linear regression. If so, a value of 0,14-0,25 of R-square is a very poor linear correlation. In general a R-square value above 0,9 is considered suitable but if your variables are lineary proportional you should get R-squares above 0,99.
Carlos, are you an analytical chemist? Then you should have a look on biological, physcological, economical data. In all subjects (even hard core physics and astronomy) there are correlations that might be much much lower than 0.9 and those are still very very interesting, "significant" or relevant. It depends on the context. You should not give such general advices that turn out to be generally wrong.
Dear Obras, I want to discuss about it for more clarification. Your suggestion to get 0.99 obviously understood. I have read the articles in which it has been mentioned that R-square value 0.14 may be called as significant but it depends on the type of Model for the study. Here as my Model is concerned if any expert is there, Who can help for the same?
With my first post I just intended to push you towards thinking about this. Only experts in your field of study will be able to help you, and even then there might be different opinions. And if there is nothing known to get a hint about the "significance" or relevance of a finding, why not simply report the finding as it is, without deciding whether or not this is "significant" in any interpretation. It still will be a valuable result, interesting for other researchers (as long as the method was sound).
in reply to Rafael, please please please be careful with omitting outliers from the analyses, there might be biological (speaking as a biologist) reasons for the outliers to be there and omitting them will bias your interpretation of the results such that it is not a true representation of what you want to model in the end (e.g. biology).
Following on from Jochen;s first answer, and somewhat in reply to Rafael's comments, here is a quotation from Pete Turchin's book on modeling population dynamics: "... we should not be in the business of rejecting theories, as ecological Popperians would have us do, but in the business of contrasting two or more theories using the data as an arbiter, The corollary of this approach is that our best theory may not explain or predict the data very well, but we should still use it until we have something better. Even the theory that explains only 10% of the variation in the data is useful, because it sets a standard to be bettered."
It seems to me that this is the type of situation you might be in. In the specific context you mentioned, if disease resistance has a low heritability - equivalent to saying that there is a large component of environmental variance in the expression of disease, then you will find a low r-squared in a regression of disease intensity against inbreeding. This does not mean that the variance captured by the regression is not significant (in either a statistical or biological sense) but only that the genetic component of variance is relatively low in your system.
Given an outlier is beyond a physically/physiologically plausible/possible range, it must be a bad measurement and should in any way be removed from the analysis.
But all other "outliers" share a general problem: Either you have little data, than it is simply not possible to reliably judge if one of these few values is an outlier or the only value telling you the important part of the story. - - Or you have really lots of data, so that an outlier could be identified as such with high enough confidence, but then a removal would not change the results (the change will usually be negligible). [note that "outliers" have to be rare events! if you frequently facte "outliers" then it is likely that such values simply represent a valid part of the distribution! - I have see groups removing lots od "outliers" simply because they considered the data as normal distributed although it was in fact gamma distributed. Noone thought about this..]
So all one can do is to remove values are physically implausible/impossible. Any further study of "outliers" and fiddeling with their removal is at best worthless and at worst misleading. If you see outliers in small data sets or several outliers in large data sets, then your definition of "outlier" (i.e. your distributional model) is likely wrong and should be revised (instead of removing data that does not fit to your assumptions).
There is no single adjusted r-squared estimate that always represents a "significant" effect because you need to take the size of the sample into account. Moreover, a terribly biased model might still have a high r-squared value.
R-squared statistics are relied on too much in regression modelling. A model with extreme bias can still have a high r-squared estimate, and two models can have the same precision but vastly different r-squared estimates if one is estimated with a greater range of the dependent variable than the other.
Create some models using the attached dataset, known as "Anscombe's quartet". Make models of y1, y2 and y3 versus x1, and y4 versus x2. Note that all four models have virtually the same r-squared value, slope, intercept and F value. Now plot the residual values versus the predicted values for each model. Do you think that an r-squared estimate is an adequate criterion for the quality of a model?
In fact, we rarely do us a favor to simplify and summarize the world in a single number. As soon as we have such a summary, combining very different aspects of reality, we tend to take this as everything there would be to know or to consider. How stupid this can be is so obvious in many fields, but often not so in math or statistics (probably because people do not want to question it or are not able understand it). For instance, the efficiency of car might be defined largely by the fuel consuption and its weight. In Germany, the efficiency class of a car is in fact calculated as the ratio (consumption per weight) and has to be indicated. The aim is that more efficient cars are sold. But since this measure is so stupidly chosen, heavier cars are generally more efficient. This is practical and ecological nonsense, and here it is quite obvious that the combination of consumption and weight in one "statistic" about the car is insufficient and misteading (at least for this purpose). It becomes less obvious when we consider the signal-to-noise ratio (SNR) of instruments measuring the same but on different scales (with different methods). Although the SNR has some valuable properties, it simply is no help to judge the relevance of a measurement. And the whole story gets completely puzzeling when we consider a strangely normalized inverse SNRs - like p-values or R² values! We do not even understand what they actually tell us, but we think that this would be the main result or information we have to take. We must neither ignore the signal nor the noise. Only when we are clear about boths properties individually then we might go one step further and consider a SNR (or, quite eqivalently, p-values or R² values) - but then (we would recognize that) this would not add so much of information.
Any statistic suits a purpose. We should start thinking about the purpose and then find an approriate statistic. However, it is unfortunately quite common practice that statistics are calculated (because others do it) and then these statistis are "raped" to suit a purpose for which they were never invented.
The value of R-square in regression analysis is used to provide information about the percent variation in response (Y) explained by the independent variable(s). In simple linear regression model, the square of the correlation between response (Y) and independent variable (X) will provide the same results and can be interpreted like that of the R-square. Generally, greater the value of R-square better will be the model fit (obtained by regressing Y on X) but some time it provides misleading results. For example, taking the case of simple linear regression model where we have only one independent variable. In this case, let us consider that the value of R-square is 0.18, now it depends on the relationship between X and Y. If linear regression equation is applied in this case and the actual relationship between X and Y is not linear (which can be checked through a scatter plot) then the value of R-square will be ultimately low. In such a situation, the low value of R-square doesn’t indicate weak relationship between X and Y; may be the actual relationship between X and Y is non-linear (for example quadratic) but you have sketched linear equation. If you will change the equation from linear to quadratic or may be any other non-linear form of regression, the value of R-square will be increased. In such a case it is suggested to check the functional form of relationship between X and Y by using scatter plot (by taking the values of independent variable on X-axis and response on Y-axis) and then decide about the type of regression model i.e. simple linear regression, quadratic equation or higher degree polynomial, exponential form etc.
On the other hand, taking the case of multiple linear regression model containing more than one independent variables. The common problem arises here with the value of R-square is that it is an increasing function of the independent variables in a model. If you will increase independent variables in a model, then the value of R-square will also be increased irrespective whether the variable added to the model is relevant or irrelevant. In such a case, most common suggestions in many of Books on Regression Analysis (may be one can consult Gujrati, 2002; Draper and Smith, 1998 etc) is to report adjusted value of R-square. Similarly, many other problems associated with the regression model can also affect the value of R-square and ultimately adjusted R-square in multiple linear regression model. For example, in the presence of multicollinearity (when the independent variables are correlated to each other), the value of R-square will be too high indicating that the model fit the data well but very few or no significant t-ratios can be obtained for the regression coefficients. What to do in this case, you need to follow many alternatives in such a case. In addition, the value of R-square can also be affected (may be low) in the presence of outliers and influential observations in either Y or X or both. Sample size also matters in this case. By increasing sample size or to decrease it will ultimately affect the value of R-square of a model. Violation of many other assumptions related with the regression model can also affect the value of R-square, and many other test involved.
No I am coming towards your question that the R-square lies in between 0.14-0.25, how much it would be significant? R-square is not providing information about the significance of a model but it only tell us that how much variation in Y is explained by the independent variable(s). for significance of X-variable you can check the significance of regression coefficient (t-ratio and its p-value). Even, in some cases with very high value of R-square, the regression coefficient is still non-significant but having significant intercept or vice versa. In most of the cross sectional studies (survey related studies) the value of R-square is low (as you have mentioned) but still you can get significant t-ratios for the regression coefficient(s). You can consult literature and then compare your own results with the previous studies keeping in view all terms and conductions in which the experiment has been conducted.
Dear Rafael, thanks for correcting me. Yes, i am agree with you but linearity condition (which is defined in most of the standard books on regression analysis) has defined in three ways:
1. Linearity in parameters
2. Linearity in variables
3. Linearity in both parametrs and variables
My concern about quadratic equation (which is considered as non-linear in the above explanation) is linear in parametrs but not linear in variables of having degree 2 (second degree polynomial) and if this relationship exist you will always get a curvature in the scatter plot. On the other hand, if the relationship is linear you will get a straight line by joining the points of scatter plot (X vs Y).
More discussion on linear and non-linear models can be found in Gujarti, 2002.
We should say "straight-line model" if we mean to fit a straight line, and "parabola model" or "polynomial curve model" if we mean to fit some polynomial function. The term "linear model" should not be confused with the shape of the relationship between the variables. It should always and only be used to indicate that the fitted parameters of the model are untransformed (i,e, "linear").
Unfortunately this is a little relaxed for the term "regression". With "linear regression" (I agree to Yousaf here) usually straight-line models are meant, and the applied tool this is a linear model. With "non-linare regression" typically some curved relationship between predictor and response is meant, and this is also mostly done employing some linear model. The problems with this terminology become obvious when it comes to things like "logistic regression" and "Poisson regression" and alikes, which model clearly non-linear relationships but still are (usually) linear models.
It would be best to specify the model explicitly and possibly point out some charachteristics of this model. Without a reference to such an explicit description, every term like "linear regrassion", "non-linear model" and so on will neccesarily remain ambigious or even misleading. The worst side-effect to my opinion is that the ambigious use of such terms hinders students from an understanding of what actually is done and what actually is important.
Hi folks, may I know if there is any papers that support R square of 0.1x to o.3x is acceptable as that is a questionnaire? Or that is because of other models? Many thanks!
In some cases an r-squared value as low as 0.2 or 0.3 might be "acceptable" in the sense that people report a statistically significant result, but r-squared values on their own, even high ones, are unacceptable as justifications for adopting a model. R-squared values do not indicate whether or not a model is biased, nor do they indicate what the likely error of a prediction will be. In my view they are over-used, and I'd prefer to make judgements about a model using at least its standard error and plots of residuals versus predicted values and residuals versus independent variables. Moreover, r-squared values vary with the range of the dependent variable, so if you take a relation between y and x that has an r-squared of, say 0.8, and then restrict your data to the middle 2 quartiles of the Y variable, you'll get a vastly lower r-squared, but the standard error will be similar. R-squared values are very much over-used and over-rated.