After multicollinearity is detected, remedial measures include removing the problem variable(s) and transforming the variable to reduce mutlicollinearity. What kind of transformation can we apply to correct the problem.
Dropping variables is putting your head in the sand. The results of your study indicate that groups of variables are important. Because they are inter-correlated your experiment does not provide data to determine which member of the group, or untested lurking variables not tested are "causative". By ignoring this fact and just picking variables, you run the risk of misidentification of the causative variable(s).
Or perhaps the improved yield truly only happens when groups of variables are present in the correct combinations--an interaction.
I think we understand that you "desire" to identify individual causative factors, but your situation is limited to identification of groups of associated variables that need to to be tested in subsequent controlled experiments in order to ascertain causation. Mohammad has provided sound advice to work with what you have, but it is foolish to pretend that you have more.
This sounds like a homework problem, but I'll bite.
The answer is that it depends on the situation. One of the canonical examples is adding a quadratic term to a linear regression model. Consider, for instance, predicting y from x and x^2, where the x variables are integers from 1 to 10. In R code:
> x = 1:10
> cor(x,x^2)
[1] 0.9745586
The correlation between x and x^2 is extremely high. But we can "center" our x variable first, by subtracting off the mean. This most often doesn't affect any inferences we'd like to draw, but it does reduce the correlation substantially:
> x = 1:10 - mean(1:10)
> cor(x,x^2)
[1] 0
To 0, in fact :)
We can also take two variables that are highly correlated and reparametrize our model a bit. Consider a predicting something from height and weight. These will be highly correlated. We might, however, decide to create two alternative predictors: the sum of weight and height, and the *difference* between weight and height. These two new variables will not be as highly correlated as height and weight were, but the two variables together contain the same information as the other two (in fact, the two new variables are just linear transformations of the old variables). This will of course change the interpretation of the slopes, so be careful of that. Here's a demonstration in R:
"This will of course change the interpretation of the slopes, so be careful of that.". Sir if it is going to change the slopes what could be the use of transformations (may be layman question. But I have to find out standardized partial regressions based on the correlations I get after transformation).
"Sir if it is going to change the slopes what could be the use of transformations"
In the height and weight example, we've transformed the two highly correlated variables to two new variables that represent general size (the sum) and "stoutness" (the difference). If someone is very heavy for their height, then the difference will be large; likewise, if someone is very light for their height, the difference will be small. Instead of having a individual slopes for height and weight, which share most of the same information (and thus our regression is unstable, and the two slopes are not that interesting), we have "size" and "stoutness"; the transformation gives us a new perspective on the data.
I understand that multicollinearity is the problem in a multivariate regression analysis in the case of independent/explanatory variables. If I correctly recall, you know its solution i.e., Principal Components Analysis of independent/explanatory variables.
I did PCA which gave me three principal components. 5 variables loading on first, 3 on second and 2 on 3rd principal components. Now I want to do independent path analysis of these characters based on principal components. The 5 variables have multicollinearity.
I can't use factor scores or the new variables created through PCA, for the path analysis I have to do. Rather the actual variables will have to be used. Dr Richard's advice seems suitable but it also says that it may change the slopes. On the other hand, Path Analysis is based on standardized regression slopes.
I doubt that the PCA is not workable in your case, while it is the most widely used remedy of multicollinearity. However, another solution is Ridge Regression.
In using ridge regression a value of k is chosen such that the reduction in the variance term is greater than the increase in the squared bias. If this can be done, the mean square error of the ridge estimator will be less than the variance of the least-square estimator.
Hoerl and Kennard (1976) have suggested that an appropriate value of k may be determined by inspection of the ridge trace. The ridge trace is a plot of the elements of ridge regression coefficients versus k for values of k usually in the interval 0 – 1. If the multicollinearity is severe, the instability in the regression coefficients will be obvious from the ridge trace. As k is increased, some of the ridge estimates will vary dramatically. At some value of k, the ridge estimates of regression coefficients will stabilize. The objective is to select a reasonably small value of k at which the ridge estimates of regression coefficients are stable. Generally this will produce a set of estimates with smaller MSE than the least-squares estimates.
Several author have proposed several procedures for choosing the value of k. Hoerl, Kennard, and Baldwin (1975) have suggested that an appropriate choice of k is, k= [No. of explanatory variables x standard error]/ [regression coefficient x beta] ;
Where, beta and standard error are found by least squares solution.
Thank you for your detailed answer. I tried ridge trace with sas. Although, couldn't get a clue from the plot of the variables. But what I understand from your advice is that we have to depend on k, which again refers to the number of variables to retain. while my aim is to retain the variables i have and reduce the multicollinearity among them. I tried mean centering the data as well but it didn't work. However, there must be some way of doing it.
Either I misunderstood or you have not made your problem very clear. To the best of my knowledge the endogenous variables(dependent variables may be correlated, if independent variables are uncorrelated. As each endogenous variable is explained by one or more variables in the model (be exogenous or endogenous). It is my understanding, please get it confirmed.
Ok Sir. Let me explain once again. I had 16 variables. Through PCA I reduced the number to 11. Now I have to do path analysis. However, some of the variables have high/near perfect correlations. i.e. they have multicollinearity. the question is, how to correct this multicollinearity. One way is to remove variables of a pair having multicollinearity. the other way is to transform the variables. I want to know which transformation to apply to get the multicollinearity corrected.
It is still confusing. In the previous response you have reported,"I did PCA which gave me three principal components. 5 variables loading on first, 3 on second and 2 on 3rd principal components. Now I want to do independent path analysis of these characters based on principal components. The 5 variables have multicollinearity."
Now, you say that you reduced 16 variables into 11. further, if one counts 5,3 and 2 which have loaded on three PC and consider that 5 are dependent variables, even then, the total becomes 10 not 11. However, It will be clear to you that dependent and independent variables are to go under PCA separately. If so, set of PCs of dependent variables will not be correlated with each other and same applies to PCs of independent variables, though there is possibility that each of PC of dependent set and that of independent set may be correlated highly, at medium level or weakly.
However, if you have included all dependent and independent variables in a single PCA, there is no question of collinearity between PC as it is in the separate PCAs of dependent and independent variables.
Your reporting and explanation of the problem is not satisfactory. Further, why are you bent upon path analysis, which purpose it will solve, why not use canonical correlation analysis which is the most effective when there are two sets of variables whereby one set has dependent variable and the other consists of independent variables.
I don't know where is confusion in your communication of the problem.
Let me attach the picture of the variables so that I get my problem clear. Here, we see the variables in 3 separate groups. The direct effects which are encircled in red line have very high values (these should be in the range of 0 to 1 or a little higher than 1). These values are wrong. I did ridge regression for these variables. The variables in group 1 (from the top) and group 2 show high multicollinearity values (>10).
Eliminating one of the variables in a pair corrects the problem. But that I don't want. I applied mean transforming the data as well, but it didn't work..
I have to obtain those characters which have higher direct effect on Cyield. From the selected variables I will make selection indices.
I think the issue is that the experimental design generating these data does not provide a proper basis to separately identify the most important independent variables because they are inter-correlated. No transformation will change this. It is necessary to draw conclusions based on groups of related variables. PCA scores, ratios, sums and differences are all ways to do this, but in the end there is no way to elicit causative relationships from single distinct variables without additional experimentation, or professional judgement, physiological constraints etc. Additional experiments need to be designed in such a way that predictor variables are set at levels which break the co-linearity. These are called orthogonal designs. See for example Cochran and Cox
it is not possible to use the same variables and remove or decrease multicollinearity among them. However if multicollinearity is a problem, e.g. when you make a regression model, it can be tackled. One of the solution is to use PCA and multiple linear regression on the found principal components (a.k.a PCR — principal component regression). The regression model is built on principal components, using matrix of scores, T, instead of original matrix with independent variables X. But since every principal component is a linear combination of original variables you retain all of them. Another, even better solution is Partial Least Squares regression (PLS).
In both cases your data have to be mean centered and, if variables have different units, standardised.
Thank you for your valuable suggestions. The design i used for my experiment is Randomized Complete Block with 3 replications. However, the data in sugarcane happens to lie in wide ranges. That creates problems some times. However, as for collinearity is concerned, it may be rectified using suitable methods.
I used PCR with the newly created varaibles after PCA. It regresses the dependent variable on the principal components and shows proportion of variance accounted for by a variable. That part is done though.
I infer from the discussion so far is that removal of a variable from a mutlicollinear pair is the only solution to tackle the problem.
Going through the diagram provided by I am suggesting a strategy which I hope may work. I recommend giving it a chance.
The First Set of variables from below has two variables which are doing well on cane yield.
In the case of the Second Set, I find that out of three variables, two i.e., Gr1 and Gr2 are highly related and in the case of the Third Set of five variables all are highly related by Rec and Brix on one hand and POL and Pur on the other hand are most highly related.
My submission is:
1. Variables in the Second Set and the Third Set which are highly related should be made scale-free in the following way:
X= [X-Xmin]/[Xmax-Xmin or Range].
2. After making variables scale free add those variables which are highly related e.g., Gr1+Gr2 in the Second Set and Rec+Brix and POL+Pur in the Third Set. The theory is that addition of scale free variables will generate a third variable, the distribution pattern will be different from those from which it is drawn and is expected not to be so strongly related to other variables as the original variables are. If your theory suggests addition of other variables to obtain a new one, it is highly recommended.
3. Standardise all the variables[new as well as original which are not converted into new variables] including the dependent variable (Cyield) in the following way:
X= [X-Xmean]/Xstandard deviation (std)
Or
Y= [Y-Ymean]/Ystd
This has the advantage that you need not to calculate standardised regression coefficients.
I hope in this way your problem is likely to be solved.
Still I am seeing there 10 variables. If you have understood mu suggestion there should be 7 variables excluding dependent variables.
I have suggested after standardisation using formula 1, you have to add Gr1 with Gr2 (Gr1+Gr2), Rec with Brix ( Rec+Brix) and POL with Pur (POL+Pur). After this there will remain only 7 variable which are to be transformed using formula 2. It is not necessary that you should add the suggested variables, you may add as I recommended, "If your theory suggests addition of other variables to obtain a new one" that suggestion of the theory should necessarily be followed.
Yes I understand now. That means creating new variables from the existing variables.
My point of view was that of transforming the existing variables and include them in analysis as such.
I have to select variables on the basis of their higher direct effects and then make selection indices so that I can further select sugarcane genotypes on multiple variables (traits). So here inclusion of new variables will change meaning of the original ones. Therefore, I am going to the last remedy, i.e. exclusion of problem variables form their multicollinear pairs.
However, I am really grateful for your sincere help and guidance.
Dropping variables is putting your head in the sand. The results of your study indicate that groups of variables are important. Because they are inter-correlated your experiment does not provide data to determine which member of the group, or untested lurking variables not tested are "causative". By ignoring this fact and just picking variables, you run the risk of misidentification of the causative variable(s).
Or perhaps the improved yield truly only happens when groups of variables are present in the correct combinations--an interaction.
I think we understand that you "desire" to identify individual causative factors, but your situation is limited to identification of groups of associated variables that need to to be tested in subsequent controlled experiments in order to ascertain causation. Mohammad has provided sound advice to work with what you have, but it is foolish to pretend that you have more.