The issue of this question is about statistical analysis concerning the size of R-squared. R-squared, also called coefficient of determination, is the measure of fitness of the proposed model to the observed data in the context of regression analysis. The uses of r-squared are either: (i) forecasting, or (ii) hypothesis testing. R-squared if the measurement of “goodness of fit.”
R-SQUARED
In a sample space called omega, we have a set of observation called Y events. The estimate of Y is the mean of Y. We are looking for a better way to predict Y; thus, an prediction function is introduced. Call that predictor function Y^ or Y-hat. Y-hat is given by:
(1) Y^ = b + b1X
… where b = intercept; b1 = slope and X is the explanatory variable (independent variable). The objective is to test whether Y^ is a better estimate than Y-bar where Y-bar is simply the mean of the set Y: (y1, y2, …,yn) and where Y-bar is given by:
(2) Y* = 1/n(sum Yi)
… where Y* = Y-bar or mean of Y; n = number of observations; and Yi is the set of observations Y: (y1, y2, …,yn).
R-squared is the test to determine how good is the predictor function through the use of ‘goodness of fit” analysis. R-square is given by:
(3) R^2 = (SSyy – SSE) / SSyy
The equation is reduced to:
(4) R^2 = 1 – (SSE/SSyy)
The terms are defined as:
(5) SSE = Sum (y – y^)^2
… where y = individual observations of y, and y^ = the predicted y from equation Y^ = b + b1X predicting the data set Y: (y1, y2, …,yn). SSE measurss the deviation of the observation from the predicted value.
(6) SSyy = Sum (y – y*)^2
… where y = individual observations of y and y* = the mean of y in the data set Y: (y1, y2, …,yn). SSyy measures the variability of y around the predicted value.
DOUBLE CHECK R-SQUARED
Whether R^2 = 0.15 is adequate? The answer to this question lies in the significant test of the correlation coefficient r.
(7) R^2 = r^2
… where r = b1(Sx / Sy); see predictor equation: Y^ = b + b1X. The test statistic for R is given by:
(8) t(r) = r(sqrt (n – 2)) / sqrt (1 – r^2)
… where the standard t(infinity degrees of freedom, 0.95) = 1.64.
With known sample size and observation series Y: (y1, y2, ..., yn) and series X: (x1, x2, …, xn), the standard deviation of X or Sx and standard deviation of Y or Sy may be determined. If the t(observation) < 1.64, the R^2 of 0.15 may be rejected as insignificant.
In this case, R^2 = 0.15. The question is whether it is significant? The table below illustrates the t-test for significance using 0.95 level of confidence: t(0.95) = 1.64
………………………………………………………….
N R^2 r t(obs) t(0.95) Conclude
………………………………………………………….
30 0.15 0.39 2.22 1.64 Significant
100 0.15 0.39 4.16 1.64 Significant
500 0.15 0.39 9.37 1.64 Significant
1000 0.15 0.39 13.27 1.64 Significant
………………………………………………………….
Note that is necessary to go from r^2 to r in order to test the level of significant. R^2 alone will not be able to answer the question whether R^2 of 0.15 is adequate to conclude whether the model is good enough. In the above tabulation where n = 30, 100, 500, 1000, the conclusion is that the model producing R^2 = 0.15 is statistically significant. Recall that the interval or range for r is between -1 and 1.
The interpretation may be counter-intuitive. We are looking for best fit. Best means that the data falls within the 0.95 significant level. If the result of the model falls outside of the 0.95 confidence interval, it means that the model produces “significant result” than what is considered best fit. In the present case, all four scenarios n = 30, 100, 500, 1000 shows that the predicted function produces the result that lies outside of the normal range. Review your original null hypothesis and alternative hypothesis statement. (1) What was your decision rule for t(obs) < t(0.95) and t(obs) > t(0.95) ? and (2) What was your intended use of R-squared: forecasting or hypothesis testing?
ATTACHED:
Excel file for the tabulation above is attached.
REFERENCES:
Draper, N. R.; Smith, H. (1998). Applied Regression Analysis. Wiley-Interscience. ISBN 0-471-17082-8.
Everitt, B. S. (2002). Cambridge Dictionary of Statistics (2nd ed.). CUP. ISBN 0-521-81099-X.
Nagelkerke, Nico J. D. (1992). Maximum Likelihood Estimation of Functional Relationships, Pays-Bas. Lecture Notes in Statistics 69. ISBN 0-387-97721-X.
Glantz, S. A.; Slinker, B. K. (1990). Primer of Applied Regression and Analysis of Variance. McGraw-Hill. ISBN 0-07-023407-8.
The issue of this question is about statistical analysis concerning the size of R-squared. R-squared, also called coefficient of determination, is the measure of fitness of the proposed model to the observed data in the context of regression analysis. The uses of r-squared are either: (i) forecasting, or (ii) hypothesis testing. R-squared if the measurement of “goodness of fit.”
R-SQUARED
In a sample space called omega, we have a set of observation called Y events. The estimate of Y is the mean of Y. We are looking for a better way to predict Y; thus, an prediction function is introduced. Call that predictor function Y^ or Y-hat. Y-hat is given by:
(1) Y^ = b + b1X
… where b = intercept; b1 = slope and X is the explanatory variable (independent variable). The objective is to test whether Y^ is a better estimate than Y-bar where Y-bar is simply the mean of the set Y: (y1, y2, …,yn) and where Y-bar is given by:
(2) Y* = 1/n(sum Yi)
… where Y* = Y-bar or mean of Y; n = number of observations; and Yi is the set of observations Y: (y1, y2, …,yn).
R-squared is the test to determine how good is the predictor function through the use of ‘goodness of fit” analysis. R-square is given by:
(3) R^2 = (SSyy – SSE) / SSyy
The equation is reduced to:
(4) R^2 = 1 – (SSE/SSyy)
The terms are defined as:
(5) SSE = Sum (y – y^)^2
… where y = individual observations of y, and y^ = the predicted y from equation Y^ = b + b1X predicting the data set Y: (y1, y2, …,yn). SSE measurss the deviation of the observation from the predicted value.
(6) SSyy = Sum (y – y*)^2
… where y = individual observations of y and y* = the mean of y in the data set Y: (y1, y2, …,yn). SSyy measures the variability of y around the predicted value.
DOUBLE CHECK R-SQUARED
Whether R^2 = 0.15 is adequate? The answer to this question lies in the significant test of the correlation coefficient r.
(7) R^2 = r^2
… where r = b1(Sx / Sy); see predictor equation: Y^ = b + b1X. The test statistic for R is given by:
(8) t(r) = r(sqrt (n – 2)) / sqrt (1 – r^2)
… where the standard t(infinity degrees of freedom, 0.95) = 1.64.
With known sample size and observation series Y: (y1, y2, ..., yn) and series X: (x1, x2, …, xn), the standard deviation of X or Sx and standard deviation of Y or Sy may be determined. If the t(observation) < 1.64, the R^2 of 0.15 may be rejected as insignificant.
In this case, R^2 = 0.15. The question is whether it is significant? The table below illustrates the t-test for significance using 0.95 level of confidence: t(0.95) = 1.64
………………………………………………………….
N R^2 r t(obs) t(0.95) Conclude
………………………………………………………….
30 0.15 0.39 2.22 1.64 Significant
100 0.15 0.39 4.16 1.64 Significant
500 0.15 0.39 9.37 1.64 Significant
1000 0.15 0.39 13.27 1.64 Significant
………………………………………………………….
Note that is necessary to go from r^2 to r in order to test the level of significant. R^2 alone will not be able to answer the question whether R^2 of 0.15 is adequate to conclude whether the model is good enough. In the above tabulation where n = 30, 100, 500, 1000, the conclusion is that the model producing R^2 = 0.15 is statistically significant. Recall that the interval or range for r is between -1 and 1.
The interpretation may be counter-intuitive. We are looking for best fit. Best means that the data falls within the 0.95 significant level. If the result of the model falls outside of the 0.95 confidence interval, it means that the model produces “significant result” than what is considered best fit. In the present case, all four scenarios n = 30, 100, 500, 1000 shows that the predicted function produces the result that lies outside of the normal range. Review your original null hypothesis and alternative hypothesis statement. (1) What was your decision rule for t(obs) < t(0.95) and t(obs) > t(0.95) ? and (2) What was your intended use of R-squared: forecasting or hypothesis testing?
ATTACHED:
Excel file for the tabulation above is attached.
REFERENCES:
Draper, N. R.; Smith, H. (1998). Applied Regression Analysis. Wiley-Interscience. ISBN 0-471-17082-8.
Everitt, B. S. (2002). Cambridge Dictionary of Statistics (2nd ed.). CUP. ISBN 0-521-81099-X.
Nagelkerke, Nico J. D. (1992). Maximum Likelihood Estimation of Functional Relationships, Pays-Bas. Lecture Notes in Statistics 69. ISBN 0-387-97721-X.
Glantz, S. A.; Slinker, B. K. (1990). Primer of Applied Regression and Analysis of Variance. McGraw-Hill. ISBN 0-07-023407-8.