It can be seen that the multiple correlation coefficient is used in data analysis even in cases in which the underlying models are highly nonlinear. Should this kind of use of the multiple linear correlation coefficient R^2 continue?
Absolutely not. Correlation analysis assumes that the relationship being examined is linear in the parameters. If the relationship is known to be non-linear, then you must move beyond correlation techniques to non-linear methods, such as system dynamics.
I think no. Linear Regression analysis is so often used that researchers tend to forget the limitations of its applicability. This can be done only if the model can be linearised by suitable transformations, for example by elementary operations plus logarithms etc.
I have observed that in their researches, people often use the multiple correlation coefficient without bothering to see whether the underlying model is statistically non-rejectable as linear or not. Indeed, even for the simple correlation coefficient r^2, the same kind of uses can be seen.
For example, r^2 expressed in percentage is actually the part that can be mathematically defined, so that (1 - r^2) expressed in percentage is the coefficient of non-determination associated to random errors. I would like to request you to work out the following case.
Consider for example the equation, Y = 1 + X + X^2. For X = 1, 2, 3, 4, 5, find the values of Y. Now use the usual formula for computing r^2. You will see that (1 - r^2) is not equal to zero here. In other words, some percentage of the variations in Y is associated to random errors. But there is hardly any random error here! The equation is absolutely mathematical leaving nothing to random errors. How can then the numerical value of the coefficient of non-determination be associated to random errors here? Indeed, this is a non-linear model, and therefore use of r^2 here is illogical.
Most of the current uses of the multiple correlation coefficient are of this type. In certain demographic matters, while the underlying model is highly non-linear, people use this coefficient to find the coefficient of determination!
Dear Hemantha, try this step: X1=1, X2=X, X3=X^2 (try to linearise the model) and then work with Y=b1*X1+b2*X2+b3*X3 as a linear model in your favorite program. Recompute r^2 using real data with no error for Yi and XJi.i=1,2,...n. If the problem remains then it is because 1,x,x^2 are sometimes collinear, look their graph for x in [0,1]. Of course it depends on your arithmetic precision, i did it with 32-digits and had residualsumofsquares~10^(-61) in maple.
Dear Hemantha, the problem you have mentioned is the general polynomial regression and it is often used in applications. When the number of x-powers increase, for example if you have Y=1+X+X^2+...+X^10, then you have multicollinearity problems because higher powers are correlated, see at the Figure close to 1. The fact that Y = Y(X) does not prohibit us from building a linear model, but the OLS requirements are not satisfied 100%. The same procedure can be applied when we have exponential like variable, after taking their logarithm properly.
the concept of linearity applies to the parameter space (the estimates) not to the data space. Therefore Y = beta_0 + beta_1*X + beta_2*X^2 +u is a linear model.
Yes, of course linearity is associated with the parameters. However, in the example that I have cited, there is no random error associated with the dependent variable, and yet the coefficient of non-determination being not equal to zero, a certain part of the variation would have to be attributed to errors. That is what I wanted to point out.
Indeed, if we consider various values of (X, Y) where for example Y = 1 + X, it will be seen that in this case r^2 will be exactly equal to 1, which is why the coefficient of non-determination will be zero in this case.
In other words, it is the presence of X^2 in the model, that is leading to such a confusion.
In the text books, the quartet is not mentioned. So those who use the concept of R^2, in Demography for example, simply say that the coefficient of non-determination is associated to 'errors'. That is the point I raised.
No, I am not saying about books. The researchers mostly write in their articles that R^2 is high enough, and therefore their model is statistically acceptable. But the models trhemselves are non-linear.
R^2 is the devision of SSR ( sum of squared of regressed data) by SSY ( sum of squares of total dependent variable), no matter what is the type of model.
Here is an example of an exponential model which I fitted using SAS program:
Y=e+a*exp(b*X) with 3 parameters (a, b and e).
Source---DF---Sum of Squares---Mean Square---F Value---ApproxPr > F
Ehsan has specifically mentioned that 91% is the 'explained variation'. You have in your letter said that R^2 reflects the percentage of linear relationship. According to Ehsan, it is simply the 'explained variation', which means that the rest 9% is the 'unexplained variation'. In other words, this part is due to observational errors.
This the point I raised. Can you see now how users use R^2?
When I was talking about a "linear releationship", I used a simple linear regression model y=a+b*x+u and could infer from R² < 1, that there is no perfect linear relationship between y and x (in this simple case R²=cor(x,y) ^2.
This interpretation does not hold if I extend the model by polynomial factors of x.
And I think, there is nothing wrong with Eshan's argument, so what is your point?
The simple linear regression model is nothing but a special case of the multiple linear regression model. There cannot be any logic in saying that what is true for a multiple linear regression model will not hold good for a simple linear regression model.
Absolutely not. Correlation analysis assumes that the relationship being examined is linear in the parameters. If the relationship is known to be non-linear, then you must move beyond correlation techniques to non-linear methods, such as system dynamics.
Indeed that is what I was trying to say. Correlation analysis is to be done only when it is statistically found to be non-rejectable that the underlying relationship is parametrically linear. In other words, regression analysis comes into the picture first, before we can move to correlation analysis.
Incidentally, most of the users start with assuming highly non-linear models. Then they conclude that as the coefficient of determination is high enough, their models are statistically 'acceptable'.
One more comment: Linearity always is the first degree approximation in Science. For example Einstein' s theory "added" more terms in the power series of kinetic energy:
***Let's see what is the difference from Newtonian Mechanics (NM) and Special Relativity (SR) of Einstein. In SR the kinetic energy of a particle with mass m and velocity v is:
R^2 is a measure of the linear relationship between the actual data and the model of the data. The higher R^2 is, the better the fit of model to data. That's true even if the model is non-linear in its arguments.
However, R^2 by itself is seldom the measure that you need to decide whether a fit is good enough, even in the case of linear relationships. What you need are measures of the lack of fit, such as the mean square error, largest squared error, mean absolute error and such, compared to the accuracy needed to make reasonable forecasts and, between-conditions inferences and such.
For example, here in California the Independent Systems Operator (www.caiso.com, link on "supply and demand") has a model that is used to predict electricity demand on the grid; it makes day-ahead and hour-ahead forecasts. The important point is not that the model has high R^2, but that the day-ahead and hour-ahead errors are consistently small com pared to the variation across the day and the margin (generating capacity minus actual load.)
In some fields, a small R^2 may represent a substantial improvement over what was known before; in others, like the electricity example, the important point is that the error is always small compared to what is necessary to manage the grid properly.
In short, R^2 is not really worse in the non-linear case than in the linear case.
If the theory indicates that a relationship is non-linear it is inappropriate to use OLS regression or correlation. Variance explained is only one measure of fit and can be misleading.
I do think that before hand going directly there are simple graphical rules to observe linearity in both dependent and explanatory variable. If one wants linear solution and parameters be retrieved in original metrics, there are several ways. There are also inbuilt computer algorithms to test collinearity and other requirements of multiple regression and elimination of explanatory variables not contributing significantly to explained variance in dependent variable. Otherwise, there are, of course, options of non-linear solutions.
The R-squared in the case of nonlinear models, where we use only maximum likelihood estimation and NOT OLS, is called pseudo R-squared. An very good example for the case of logit regression can be found here:
The relation between a dependent variable and one or more explanatory variables can be analyzed with the Regression analysis without problem or limits. The explanatory variables, which one the economic variables that are thought to affect the value of the dependent variable. In the Simple Linear Regression model you use the dependent variable for related to only one explanatory variable. Therefore, particularly I do not any problem if you to use the Regression Linear with the model nonlinear.
R^2 captures only correlations between co-linear vectors and misses non-linear, but also correlations between perpendicular vectors.
A good measure in such cases would be the generalized correlation coefficients based on the information theory measure of mutual information. These estimates are shown to be sensitive to all statistical dependencies, linear and non-linear, including any relative spacial orientation.
As David Gillespie pointed out correlation analysis assumes a linear relationship between explained variable and explanatory variables. I have recently got an industrial Ph.D. student who compared non-linear fitting statistical methods, neural networks and homotopies to model a non-linear physical phenomenon. Statistical techniques suffer from the curse of dimensionality. I would strongly recommend trying interval valued homotopies (continuous deformations of functions). In the case of my industrial Ph.D. student Ph.D. thesis, the power consumed by a cell phone has been modeled with an uncertainty between 2,5% and 3% in a five-dimensional space (4 explanatory variables and one explained variable).
I appreciate your concern with R^2 in non-linear solution. I also appreciate responses by other scholars, however, I shall refer you response of David Gillespie. Even if, you are bent upon to work out R^2, then you can get its robust value because at your hand are both predicted as well as original dependent variable. Or you may calculate Chi-square between original and predicted value, it will only give you level of significance of non-linear fit. But, Chi-square has another property that is it is convertible in Phi that not only give significance level but also strength of relationship between the two variables in question. If you treat it as correlation, its square will give you R^2. Test it. Best of luck.Regards,
@Mohammad, it is a matter of definition, I think. If we call it R^2 in the nonlinear case, we are wrong. If we compute something similar to R^2 (see the excellent link-article) and call it pseudo-R^2 AND we compare same things, we are OK.
In many problems, the domain of interest (or greatest impact) is often not the entire function, but rather a portion. In such cases, a piecewise linear approximation can yield sufficiently accurate answers. One example in pattern recognition is to have an accurate model for the region of overlap between two classes. If the entire probability density functions are nonlinear, but we are only interested in the overlap regions (the other regions are trivial for pattern recognition purposes), then we can apply a linear or piecewise linear fit to the "tails" of these functions to any prescribed accuracy. Having these linear models can then allow us to have reliable estimates of the classification error rates, given the data or features of the problem. This concept can be applied to any number of dimensions, provided there are enough data in the overlap region.
All statistical techniques including polynomial regression or polynomial interpolation suffer from the curse of dimensionality (as explained in the book by C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006, pp. 33–38). Thus, it is overoptimistic to consider that any statistical concept can be applied to any number of dimensions, since the size of the training set will explode exponentially in the number of dimensions. It is also overoptimistic to state that a statistical technique has no limitations. Homotopies constructed as cobordisms can gradually model a non-linear phenomenon in higher and higher dimensional spaces (one dimension at a time). Their only assumption is that the physical phenomena are continuous in the intervals of each input physical value. There is a substantial advantage in terms of modelling accuracy in using interval valued homotopies that control the uncertainty with respect to polynomial interpolation or regression or neural networks or any generalized least square technique. Even when a neural network is fed with the homotopies constructed as cobordisms, the accuracy of the neural network modeling does not improve (see attached Ph.D. thesis).
If windowed correlation of x and y changes in a wide range of variations of x and y, then there may be a nonlinear relationship between x and y. So the answer is "not always", to understand how, variation in the windowed correlation should be taken into consideration. In some cases, for limited ranges of variations the underlying nonlinearity has not effective influence in the analysis.
Windowing of data can be designed along all of the data dimensions; e.g. for a 2 dimensional time series data set {x(t), y(t)}, we may consider a 11-sample length window and compute correlation between {x(t-5) to x(t+5)} and {y(t-5) to y(t+5)}, leading to r(t) as the windowed correlation.
Similarly, selection of the samples may be done in a certain range of variations in x and y, such as {x | x0 - Dx < x < x0+Dx}, as well as for y, including {y | y0 -Dy < y < y0 + Dy }. Correlation between these two data sets is also a windowed correlation.
If the road to the top of the hill is serpentine why would anyone want to use a straight path laser light [hypotenuse on a right angled triangle] to measure the total distance traveled on the road. It may be the simpler thing to do but would be forcing a linear structure on an intrinsically non-linear lattice.
It would be however interesting to see if there are any papers that identify the modeling accuracy gained by switching to a non-linear estimation process. Maybe it's not enough to warrant switching out from a lin. multiple R-sq regime - even though not theoretically justified. A great question ... deal with it often but did not think of it. Cheers!