Regression whether single of multiple linear may exhibit multicollinearity problems. It is known that whenever it arises, one has to find ways of resolving it. How do we link multicollinearity and standard error values in a model?
Hi, good question, i hope the following answer is comprehensive and helpfull
Multicollinearity occurs when two or more predictor variables in a multiple regression are highly correlated (some textbook says r> .85), meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this case, the problem is that the highly correlated predictors are doing the same function, therefore, one of them should be dropped from the model.
You can assess multicollinearity by examining tolerance and the Variance Inflation Factor (VIF) are two collinearity diagnostic factors that can help you identify multicollinearity. Tolerance is a measure of collinearity reported by most statistical programs such as SPSS; the variable’s tolerance is 1-R2. A small tolerance value indicates that the variable under consideration is almost a perfect linear combination of the independent variables already in the equation and that it should not be added to the regression equation. All variables involved in the linear relationship will have a small tolerance. Some suggest that a tolerance value less than 0.1 should be investigated further. If a low tolerance value is accompanied by large standard errors and nonsignificance, multicollinearity may be an issue.
Standard errors represents the average distance that the observed values fall from the regression line. The more distance away from the line, the more the error in the regression model
Standard errors are also indicators of multicollinearity. A collinear system will have large standard errors, which makes the individual variables nonsignificant.
The link is through the VIF - the variance inflation factor.
The VIF gives how much the variance of the coefficient estimate is being inflated by collinearity. If the VIF for a variable is 16 the associated standard error is four times as large as it would be if its VIF was 1. In such a case, the coefficient would have to be 4 times as large to be statistically significant at a given significance the level.
The VIF can be conceived as related to the R-squared of a particular predictor variable regressed on all other includes predictor variables.:
VIF of X1 = 1/(1 - R-squared of X1 on all other Xs).
If you only have 1 X or that X is orthogonal with all the other Xs; then
VIF = 1/(1-0) = 1 - so no variance inflation
If two Xs are perfectly correlated
VIF = 1/(1-1)= 1/0 = infinity that is the estimate is as imprecise as it can be.
The VIF is efficiently calculated (not by running a series of regressions) but as the diagonal element of the inverse of the correlation matrix of the predictors.
For some guidance to how big a VIF is debilitating see
As a follow-up to Kelvyn's excellent summary I'd just add that the quantity 1/(1 - R-squared of X1 on all other Xs) is the tolerance - or equivalently tolerance = 1/VIF.
The linked blog post may also be useful.
Also more generally VIF > 1 indicates some collinearity or multicollinearity and always implies some loss of statistical power relative to a situation where VIF = 1. This loss may be negligible in some cases or irrelevant if you have a sufficiently large data set (or ignorable if it relates to an inference you have no interest in) but it is still present. Most text books (wrongly in my view) downplay the loss of power when collinearity is not severe ... or (occasionally) don't mention power concerns.
EDIT: corrected my mispelling of Kelvyn's name ...
Multicollinearity is indication of "potential" problem of imprecise estimate because of variance inflation by high degree of correlation among regressors; but it is not a "conclusion" of the problem. The ultimate judgment is how large the final standard error is after the variance inflation. Suppose you have VIF as large as 10,000 for the coefficient estimate; but the standard error "before inflation" would be 0.0001, then the "post inflation" standard error is 0.0001*sqrt(10,000)=0.0001*100=0.01, still a very small number. And if the coefficient estimate is 10, then the t-value is coefficient/SE=10/0.01=1000, an extremely large value indicating extremely precise estimate. In this case Multicollinearity is NOT a problem whatsoever. Many are confused and misled by the Multicollinearity presence per se without looking into the true impact of it. If the impact is small, then it is not a problem regardless how "severe" Multicollinearity is.
In case of "wrong sign" brought by Multicollinearity, congratulation because it is a blessing instead of a problem. The "wrong sign" is a useful information which you can embed in the estimation to yield a good corrected estimate. See my paper "A Simple Way to Deal with Multicollinearity", Journal of Applied Statistics.
There is nothing called single regression. We may have simple or multiple regression. To detect multicollinearity in MLR THEN compare The F - Test For the Full Model with the T - Test for testing the partial regression coefficients because multicollinearity leads to contradictory results (Global F- test is significant but partial T- test is not and vice versa)
If SE is the standard deviation of the sampling distribution of an estimator, the SE has nothing to do with any multicollinearity.
If SE is the mean square error(MSE) or the residual sum of squares (RSS), or even the the standard error of the regression (SER) , we all well know that each of those measures is actually affected by multicolinearity. [English is not my first language so borrowed these definitions from wiki page from https://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation.]
A quick graphical way to measure multicollinearity in a MLR Y~int+X_1+X_2...+X_p is to plot the estimated marginal correlations between Y and X_i , corr(Y,X_i), against the estimated semi-partial correlations(SPCC) corr(Y,X_i(X_1,...,X_{i-1})). If there is no multicollinearity there will be a visual 45 degrees line between the marginal correlations and the SPCCs. If multicollinearity is present the line will be distorted somehow.
SPCC are similar to partial correlations, just that they are not divided by an extra argument
see Huber 1981 for tri-variate SPCC, and Madar 2015 for multivariate SPCC. I think there are other places SPCC is mentioned.
*Huber (1981), Partial and Semipartial Correlation-A Vector Approach.
The Two-Year College Mathematics Journal, Vol. 12, pp. 151-153
*Madar (2015) Direct formulation to Cholesky decomposition of a general
I disagree. The standard error ( that is the standard deviation of the sampling distribution) of the estimate of a partial regression coefficient (that is a regression with more than one predictor) is highly affected by collinearity.
There are many versions of the relevant formula and here is a clear one
https://www3.nd.edu/~rwilliam/stats1/x91.pdf
And the intuition is very clear too- the partial regression coefficient is an estimate net of other variables ; consequently the more highly correlated the predictor variables are, the more difficult it is to determine how much variation in Y each separate predictor is responsible for. In estimating this partial effect there is imprecision so that the standard error become large.
Thanks for clarifying about what is the SE. Since SE could mean several different things. Such as the standard error of the regression estimates (betas) or even the standard error estimate for the standard deviation of each of the actual variables participating in the model.
What I meant was that the standard error of the dependent Y variable and the standard errors of each of the independent variables are not causing any multicolinearity.
However, if you talk about the SE(beta), the standard error of the regression estimate (beta) it is all a different case. The regression estimate also measures some of the colinearity, and sure its SE might be effected by the colinearity, as well. We all know that SE(beta) is a function of the MSE(the mean squares error)
The standard error of the estimates are directly related to the level of multicollinearity. A large standard error is a very good indicator of the presence of multicollinearity especially if none of the estimates are significant in explaining the variation in the dependent variable. The multicollinearity is between the independent variables.
Whatever SE causing p-value>0.05 is conventionally considered too large. In large samples such as millions of obs, however, SE causing p-value>0.001 can be viewed as too large. Multicollineariy is not measured by SE (although they are correlated), but by VIF or Condition Index. However, keep in mind that large VIF or Condition Index only indicates presence of multicollinearity; but multicollinearity (MC) does not necessarily indicate a problem. MC only means the standard error will be inflated; but it does not mean the inflated SE is too large. P-value tells if a SE is too large. So even with the presence of MC the regression can still be precise if the "inflated" SE is still small enough.
Hi there, I just wanted to know whether that is there any connection between Mean Squared Error and Multicollinearity? or whether multicollinearity affects Mean Squared Error?