Suppose I have three predictors A, B, and C. I introduce A and B in the first block, and both are significant. Then I introduce C in the second block. C is significant but A and B are no longer significant. How would you interpret this situation? Are A and B significant predictors after all? Is it enough to write that C is a better predictor than A and B? Does it pertain to some sort of mediation? The correlation between A and B is -.20*, between A and C is .47** and B and C is -.27**. Thank you!
With correlated predictors, the standard error of the estimate of the regression coefficient can become very large. You should ask if there is an important change in the regression coefficients rather than just whether the coefficients crossed some magical line of "significance".
Dear Lukasz,
iwhy start with A and B in the first place?. what is the most important variable? can we assume a linear relationship between output and inputs? Have you tried stepwise regression?
A and B, are the "standard" predictors, and C is my new hypothesis.
It is obvious that C is a linear combination of A and B. This means that any variance that can be explained by A and B, can be expressed merely by C. You can test my comment by adding all three in one model and see the variance inflation factor (VIF) or tolerance (Tol).
Here, more important thing is to find the causal relationship between A, B and C. Are A and B outcomes of C, ore vise versa?
Lukasz, depending on what you are trying to say with the data, you might be interested in the idea of mediation. Of course most people refer back to the article by Baron and Kenny in 1986 in Journal of Personality and Social Psychology called "The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations." David A. Kenny is on ResearchGate and the paper is on page 7 of his profile although the date is wrong (1987). It defines all the issues, although the well-known testing procedure (the article is cited hundreds or thousands of times a year for use of the test by authors who have never read the article) is pretty out of date as it was based on 1986 computers and software. What you have already experienced is a major part of the test. If you find you are interested in mediation you can use "Sobel test" calculators on-line (easy to find) and ordinary regression results to test for mediation. If you are interested in the topic you can read more modern articles by David MacKinnon and his excellent book on the topic. I do not recommend stepwise regression methods as a means of figuring out which predictor is better or more important or anything like that; they are not very useful. I also would not use standardized regression coefficients in that way either. Both these things have been pretty much discredited. Bob
Ehsan, I am not sure I follow. Why is it obvious that C is a linear combination of A and B?
@ Alex, I've said that in terms of mathematics and not logically. It is, because C can explain the variations that can be explained by A and B, and it can be demonstrated by high VIFs for the three.
Logically they may not have the relations I said. As mentioned in comments, they may be mediators, moderators or even confounders.
@Lukasz, you can find more precise answer in this book;
Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, by Eric Vittinghoff et al.
With correlated predictors, the standard error of the estimate of the regression coefficient can become very large. You should ask if there is an important change in the regression coefficients rather than just whether the coefficients crossed some magical line of "significance".
Lukasz, did you try using the SEM (structural equation model) approach?
What are the zero correlation coefficient between each predictor and the criterion variable?
Another question, Lukasz: if you were to keep all 3 predictors in the first block, how would the picture change? The reason I'm asking is because you may have to reconsider your blocking if C was insignificant in the first block and became significant in the second.
Cheers!
While mediation/moderation may play a role as suggested, as Ehsan indicated this really says that whatever C is in the model supplants the predictive power of the other two. They add nothing to the prediction in presence of C. While this may be because they are a linear combination, it may be that C is the precursor of A and B (there are other possibilities). In the presence of C A and B no longer add new information, they are in some way redundant.
I believe that this is due to some sort of mediation, where one variable mediates the relationship between a predictor and an outcome.
Dear Mahfuz, as long as logical evidences do not lead to this idea, we can't judge. This may be very dangerous.
Lukasz, If and if you can build a path between A,B,C, and your independent variable, as Alex mentioned, you'd better use SEMs. I'd rather use path analysis amongst.
As it has been said above, this happens with correlated covariates. If d's are uncorrelated the new regressor eventually imporoves fitting without modifying previously introduced coefficients and std's.
If you are using correlated regressors, then spurious correlations with the response may arise and a new variable may drop out previously introduced ones.
The correct approach should be a full subset regression search using a penalized criterion such as aid or bic.
I do agree with Dale Pietrzak.
When this happens, it is useful to try and carry out stratified analyses instead than multivariate analyses.
And you would understand you are in the presence of a confounding variable.
It is not clear if you are talking about a linear regression, even if you are talking about correlation and this would exclude you are doing a logistic regression...
And I can suppose you have continuous variables, right?
Now, what you have is that the correlations of A and B with you outcome are actually "spurious" relations.
C is actually correlated with both the outcome and your other predictors. but as Dale wrote, C is a precursor, that means that C is what influences both your outcome and A and B.
The simple example you do in logistic regression is the association between lung cancer and yellow fingers. The association exists and is strong, but the real association is between lung cancer and smoke, and between smoke and yellow fingers. the association between lung cancer and yellow fingers is spurious.
Thank you all, RG invaluable community! You gave me lots of food for thought! I reconsidered the model statistics-wise and theory-wise, and it seems that the best idea is to have C as the predictor for A and B operating as mediators for the DV. Grateful to you! Lukasz
You need to include more sample observations. The degrees of freedom are probably to small to have more than one significant predictor variable. Remember; degrees of freedom is your effective sample.
A second possibility is that the new (third) predictor dominates the others. This often occure when the new variable is a categorical (dummy) variable. For example predicting the weight of one's clothing when you employ a categorical variable if the weather includes a snowfall or not.
To answer your question, their are three different situation which will influence your interpretation and what you would do in this situation.
1. You are confirming an existing model in which you assumed X would be independently influenced by A, B, and C.
2. You are developing a predictive model in which you do not care about causation links but only the model's ability to predict an event. In other words how can I best fit A, B, C to predict X?
3. You are exploring data to test potential hypothesis on causal links between factors. You have no idea on latent constructs linking A, B, or C together, and you are not certain either of these factors are directly linked to X.
-- 1. Testing models ----
I would suggest using structural equation modeling (-sem- or -gsem- commands in STATA) in which you can plan to also include covariants for the modeled error. You therefore need to think beforehand on how all these factors are most likely to be causally linked one to another. Then all you need is to test your model. By using covariates you would be able to model common error within correlated variables and the problem you mention would not appear unless your sample is too small or your model is not correct. For details, see http://www.cpc.unc.edu/training/seminars/BollenBauldry%20SEM%20JAN13.pdf
-- 2. Developing predictive models ----
You aim here is to optimise R2 but to make sure you are not overfitting. To do this, the best way is to make sure that motivations for placing or removing factors from the model are conceptually grounded. If collinearity appears between two factors, this will indeed cause some problems and you will need to remove one of the factors. Use Likelihood ratio tests between your models to chose which one to remove. Confounding can occur. if this is the case, your intervals for your coefficients will remain of similar width, but their magnitude will change. If this occurs, keep the variable that confounds the others and remove the others. Do not rely on p-values to make decisions! Rely on what changes in coefficients, 95%CI and once you have finished your model, then report p-values. As a rule of thumb, you can however decide to rule out factors with p-values < 0.2 or 0.1 as these are not likely to become significant at the end of the process.
-- 3. Exploring causation ----
This is like building a house pilling cards one on another... it will look nice but won't hold for long. Unless you plan to derivate a causality model you then wish to verify (situation 2, then situation 1), this approach is statistically and scientifically inappropriate. So I would have one single simple advice for this situation: "Dont do it!"
If you decide to go the SEM route be aware there is a fair bit of controversy about model fit interpretations. See http://davidakenny.net/cm/fit.htm the section reads,
"Controversy about Fit Indices
Recently considerable controversy has flared up concerning fit indices. Some researchers do not believe that fit indices add anything to the analysis (e.g., Barrett, 2007) and only the chi square should be interpreted. The worry is that fit indices allow researchers to claim that a miss-specified model is not a bad model. Others (e.g., Hayduk, Cummings, Boadu, Pazderka-Robinson, & Boulianne, 2007) argue that cutoffs for a fit index can be misleading and subject to misuse. Most analysts believe in the value of fit indices, but caution against strict reliance on cutoffs.
Also problematic is the “cherry picking” a fit index. That is, you compute many fit indices and you pick the one index that allows you to make the point that you want to make. If you decide not to report a popular index (e.g., the TLI or the RMSEA), you need to give a good reason why you are not.
Finally, Kenny, Kaniskan, and McCoach (2013) have argued that fit indices should not even be computed for small degrees of freedom models. Rather for these models, the researcher should locate the source of specification error."
If also A and B effects significantly C, than C is mediator variable
It seems you have mediator effect. You can examine this by following steps:
Step 1: Show that the initial variable is correlated with the outcome. Use Y (DV) as the criterion variable in a regression equation and X (IV) as a predictor (estimate and test path c). This step establishes that there is an effect that may be mediated (Model Y = X).
Step 2: Show that the initial variable is correlated with the mediator. Use M as the criterion variable in the regression equation and X (IV) as a predictor (estimate and test path a). This step essentially involves treating the mediator as if it were an outcome variable (Model M = X).
Step 3: Show that the mediator affects the outcome variable. Use Y (DV) as the criterion variable in a regression equation and X (IV) and M as predictors (estimate and test path b). It is not sufficient just to correlate the mediator with the outcome; the mediator and the outcome may be correlated because they are both caused by the initial variable X (IV). Thus, the initial variable must be controlled in establishing the effect of the mediator on the outcome (Model Y = M X).
Step 4: To establish that M completely mediates the X-Y relationship, the effect of X (IV) on Y (DV) controlling for M should be zero (estimate and test path c'). The effects in both Steps 3 and 4 are estimated in the same regression equation.
If all four of these steps are met, then the data are consistent with the hypothesis that variable M completely mediates the X-Y relationship, and if the first three steps are met but the Step 4 is not, then partial mediation is indicated. Meeting these steps does not, however, conclusively establish that mediation has occurred because there are other models that are consistent with the data. Some of these models are considered later.
Also You may have confounding effect. you want to examine both. Abbas
If you have fully quantitative endpoints, then there is no reason to disregard a stepwise regression as a first approximation. I disagree with Brennan's claim that stepwise regression is "discredited". The question is whether the model works better with the variables entering in different orders. It is quite possible that it makes no significant difference. This also tells you something: the endpoints are of equal value.
You might be interested in may paper "Illusions in Regression analysis"
j. Scott Armstrong
It sounds like factor C is related to factors A and B such that factor C includes, conceptually, elements of A and/or B. You don't say what those factors are, but if for example factor C is alcohol consumption and B is cigarette smoking, then factor C may be overriding factor B in significance because drinking and smoking commonly occur together. So try to disentangle the factors somehow so their concepts don't overlap. In logistic regression we call this collinearity---the factors are strongly related to each other, which can make one or more of them non-significant when all are included in the model.
Is this regression linear or is it logistic regression? If logistic, you must be sure to include all the first-order terms if you are including an interaction term (for example, include A and C if you are also including A*C as factors).
You should test for interaction; i.e., if you plot for example A vs. C on a graph, does one go up as the other goes down and do they cross? This is noteworthy. If there is effect modification, i.e., the effect of C (on the outcome) is modified by the level of B or by the level of A, then you should control for that.
If one of the factors is a continuous variable (such as number of cigarettes smoked), then perhaps you can dichotomize it into a yes/no variable (such as 20 or more cigarettes a day, or not) in logistic regression. I hope this helps.
Before start to interpret the results, check is there any multicolinearity problem or not? Also check the other assumptions of regression analysis like normality, constant variance, linear relations between dependent and independent variables, no outlier, adequacy of sample size etc. After these tests if everything seems to be ok in this case you can apply a variable selection methods like stepwise regression, best regression etc. Suppose you applied stepwise selection and you got the Y=a+bX1 in this case we can understand:' there is a significant relation between Y and X1' in other words changes in X1 significantly affect Y. So, if you wanna estimate the values of Y in this case you have to consider X1.
You probably have confounds, i.e. the variables are naturally interdependent as opposed to independent. This could be easily checked in R, if you plot each variable vs. each other on scatterplot
The problem of multi-colliearity is well done in most books on multiple regression. This discussion should be ended now.
If goodness of fit is of main concern, you may simply perform a model selection between
model I: with predictors A & B;
model II: with A, B & C
using AIC say. But, for model interpretation, causality analysis may be needed.
I agree with Jerry Miller, under collinearity in logistic regression. If the factors are strongly related to each other and are put together as in your case, this can actually compel the factors to exhibit hidden characteristics which can make one or more of them non-significant.
The answer id very simple - the tests of significance of single predictors in various models do not test teh same thing and, therefore, there is no contradiction:
If you test the significance of Ba in the model
Y = Bo + Ba*A
you test is A is a significant predictor of Y by itself
If you test the significance of Ba in the model
Y = Bo + Ba*A +Bb*B
you test is A is a significant predictor of Y in the presence (or in addition) to variable B.
If you test the significance of Ba in the model
Y = Bo + Ba*A + Bb*B + Bc*C
you test is A is a significant predictor of Y in the presence (or in addition) to variables B and C!
These "apparent contradictions" are discussed and nicely illustrated in a paper by Yoshio Takane and Elliot Cramer in MBR, 1975, vol 10, pp 373-384
I agree with the interpretation by David. Having established that a relationship exist between the variable under correlation analysis, and given that you are dealing with multiple variables, now adopt the use of hierarchical regression analysis.
Using Baron and Kenny (1986) approach and regression analysis, variable C is a partial mediator in the relationship between the independent variables (A and B) and the dependent variable.
It seems to me that the linear combination of A and B lies in the same two-dimensional subspace (plane) as the criterion variable (Y). Predictor C lies in this subspace too and more closely to Y than linear combination of A and B.
So if you tune your model with A and B you have not so bad approximation of the criterion variable. But C gives you this better and you no longer need A and B.
P.S. I don't know if it is possible to add some pictures to answers. Anyway if you need you can ask me for a graphical representation.
I guess this is a multicollinearity problem. I agree with Ronán Conroy that standard errors were inflated due to the collinearity of predictors, and hence the test statistic (e.g. t or ch-square) becomes non-significant. If you check the variance inflation factor (VIF), at least two of the b parameters must have VIF>10. Please check. Correlation between A and B and A and C, etc is not a good test.
As Robert says, stepwise regression is useless because it applies arbitrary cut-of points of P to retain or drop a variable. If you think all A, B and C are necessary, you may wish to remove collinearity by centering or partial orthogonalization. However, if you want to have a model with the optimum number of variables, try information criteria (e.g. Akaike, Bayesian, Deviance) as these balance bias with variance. However, in a stuation like yours (i.e. with only 3 variables) information criteria like Akaike's are likely to select the model with all predictors (A, B and C) even if one or two is redundant (collinear). They are more effcient when the number of predictors is larger than that.
A better approach is to use partial least square (PLS) regression. When there are many collinear priedictors and your sample sample size is small, PLS is very reliable for identifying relevant predictors and their magnitude of influence. It does this by cross validation, i.e., fitting the model to part of the data and minimizing the prediction error for the unfitted part. There are different methods but the MM is the best.
The following article may be helpful.
Carrascal, L.M., Galván, I., Gordo, O., 2009. Partial least squares regression as an alternative to current regression methods used in ecology. Oikos 118, 681–690.
R squared values may be helpful in determining the level at which the dependent variable is predicted by the independent variables. This could be easily generated using regression, based on variable type.
this problem could be due to the multicolinearity.
this could be detected bu pairwise correlations between A, B and C, if these variables are treated as continuous variables.
If levels of three variables have been used under certain designed experimental conditions then interaction effects should be tested.
If it's mediation effect sorry but some authors in consumer research shows that the method of B K is not the best to test the mediation it' would preferable to use method of bootstrap and check the sgnificativite beetween the upper and lower born confidence or to use SEM. If you want to know more about test of mediation see the paper: reconsideration of Baron et Kenny:Myths and truth about mediation analysis. Zhao Lynch and Chen JCR 2010.
What about the mean square error? I mean for (1) y vs A and B (2) y vs C. You must choose either one with the smaller MSE. The smaller MSE tends to be the better model.
In addition to what has been mentioned so far, I would clarify if your goal is a prediction oriented regression model or not. The partial t tests for the significance of individual regressors when added last to a model are not as useful as other statistics, such as the PRESS or Mallow's Cp, say. It could be that the two regressor model has a better fit, but the three regressor model is more stable and maybe it has a smaller prediction variance.
For the case of multicollinearity, use Ridge Regression which is designed for this situation.
I think that is due to the highly significant correlation coefficients between C with each of A and B due to overlapping between the effects of C with each of A,B.
On my opinion It would be better to have only c .
I think it is mediating effect. Those two factors have effect on dependent variable through the third factor.
Amos is a user friendly software with a graphical interface, quite easy to test if you have a mediation effect there.
It seems full mediation. C is responsible for the change in outcome variable
I agree with Huda and Naeem. C alone is determining the outcome. However, the influence of C on the outcome is not entirely a direct influence; it is at least partially accomplished INdirectly through C's effects on A and B and, in turn, their effects on the outcome. There is apparently NO additional influence of A or B beyond their role as intermediaries for C. Of course, all of this presumes that influences are linear, no important fourth factor D, etc., etc.
I agree with the last comment and the punt of variance explained by the third variable should be sufficient to give conclusion about the contribution of each ither variables.
Test a situation that will give big difference with two models and you will see where the truth is
Are your 3 variables continuous variables?
If so, I think you have to go back and start to find the "functional form" of each of your variables independently. It seems to me you have just assumed linear effect of all three variables.
This is probably not true.
I suggest you split each variable in equidistant intervals, ex 5 intervals. First thing is to plot the regression estimates for the intervals against their interval number to see if the effects look linear. If not, you might have to include a quadratic term, square root, log, exp , etc . Alternatively you can get at an indication of the functional form using fractional polynomial regression.
When you have found a satisfying expression for each of your 3 variables then you can proceed to multi-variable regression, possible with the inclusions of some interaction terms between the 3 variables.
Then you might still get similar results as those you presented in the first place, but now on more solid ground.
Regards Kim
Are you certain your predictors are exogenous? Should any of your predictors be endogenous, the resulting parameter estimates will be biased and potentially uninterpretable. This can be determined using specification tests (e.g., the Hausman test). Should you discover that you've added an endogenous predictor, regression methods like two-stage least-squares regression or judiciously-applied structural equation modeling should help you reduce the degree of bias in your parameter estimates.
As part of the original set of questions, it was asked, " Are A and B significant predictors after all?" Ronan Conroy noted "significance" is not "magical." I prefer to just consider whether a regression coefficient is 'substantially' bigger than its standard error. If each predictor, is used one at a time, you might just compare the variance of the prediction errors if you just want to see which appears 'best,' but the minute you use them in one multiple regression equation, things do get much messier. If you want to know if the level of multicollinearity has become problematic, you might really think so if a regression coefficient changes signs from when it was used alone. - If A, and B are standard, and you found C to be better, it sounds like your only problem is whether to use C alone or not. Again you might look at variance of the prediction errors, though as others have noted, there is a lot else to consider. Thank you for the stimulating question.
I had a similar issue while trying to design a score for quality assessment of protein structures (Bagaria et. al. 2012 Protein Science). While using a multiple linear regression (MLR), it is important that the variables are mutually INDEPENDENT and therefore the mutually dependent ones need to be removed from the model. This can be done following the scoring of "Akaik Information Criterion" and / or "P-value" of the contribution of the individual scores (parameters) for example. These functions are straight forward to run via freely available statistical tools like "R". Techniques like Principle Component Analysis or Clustering could also be used in such cases instead of the MLR.
With respect, I differ from Anurag Bagaria's reply. Multiple linear regression (MLR) DOES take into account linear dependence among the explanatory variables, so it is NOT necessary for them to be independent. (In fact, if they are independent, then MLR and individual LR's will yield the same results, so MLR would be redundant and unnecessary.) It is true, however, that if the interdependence among explanatory variables is very strong, the uncertainties in their contributions to the response variable variations will be large. To understand the interactions, it is often helpful to look at an (N+1)-dimensional plot of explained variance vs all N explanatory variables, or sections of that space if N>3.
Eric
I will opt for the later response (C) simply because when conducting MLR, simultaneous stratified analyses do occur. any IV that does not remain significant thereafter does not explain any association.
It will be difficult to fully support or contradict the comments of Eric and Raid. I think the removal or non-removal of parameters will also depend on the quantity and quality of data being used for the model. For a smaller data-sample it would be advisable to use only the most suitable and apt parameters rather than all possibly available ones, especially to rule out any possibilities of an Over-Fit. Also in a particular model if a MLR with 4 parameters gives same / similar quality predictions as that of when done with 5, 6 or more parameters for example, it would lead to instability of the model. That is my understanding so far. However, here is a disclaimer: though I have been using statistics extensively for my scientific studies in consultation with statistical experts, I am not an expert myself but on the learning path.
One way to get at understanding your IVs here is to partition all of the joint and unique sources of variance in prediction: Take [A], [B], [C], [A,B], [A,C], [B,C], [A,B,C] as the R2 values from 7 different linear regressions with A, B, and C as the IVs, and then variance is partitioned as follows:
R2 unique to A = [A,B,C] - [B,C]
R2 unique to B = [A,B,C] - [A,C]
R2 unique to C = [A,B,C] - [A,B]
R2 shared by A and B = [A,C] + [B,C] - [C] - [A,B,C]
R2 shared by A and C = [A,B] + [B,C] - [B] - [A,B,C]
R2 shared by B and C = [A,B] + [A,C] - [A] - [A,B,C]
R2 shared by A, B, and C = [A,B,C] - [A,B] - [A,C] - [B,C] + [A] +[B] + [C]
Of course, these variance components are estimates specific to the sample (small values may be more fragile and disappear in a new sample). And they can be subjected to significance tests (see Nimon et al. 2008, Behavioral Research Methods: http://goo.gl/DcLJqA)
Of course, what predictor is 'important' then is a matter of judgment as informed by these estimates, your substantive knowledge of the variables, whether regression is an appropriate model, etc. Good luck!
Two notes to go with Frederick's comments: 1) R2 is redefined when you change models, and not directly comparable. Willett and Singer had a couple of articles on that - I think in the American Statistician. 2) Any time you use a p-value, remember it is a function of sample size. This can be misleading. Changing the sample size alone could change your conclusions. Best to compare competing point level hypotheses. I use cvs (standard error of estimate divided by estimate) wherever possible, in decision making.
I suggest adding predictors to the model one at time and look at the change in the regression coefficients. For example, if there was a substantial change (>=20%) in the estimate of predictor C effect after adding predictor A, this means that A is acting as a confounder to the association between the outcome and C, and in that case A has to be in the model even though it is non-significant. Similar thing for predictor B.
Most stat software have step-wise procedures for this, and some check for multicolinearity first, e.g., by computing the variance inflation factor (VIF).
I appreciate seeing so many suggestions here. This is good.
On the other hand, I feel that people are going in circles.
I have already asked before this question: what are variables A, B, and C and what is the response variable? What is the sample size? Give us information on the data. How can we figure out what is needed when such information is not given here? Maybe I am missing something.
I would like to better understand your research design. What theory are your drawing your constructs and therefore your independent and dependent variables? Did you perform Confirmatory Factor Analysis on each construct to eliminate insignificant variables before you finalized your final predictor--->outcome model for overall significance? If so, why are you introducing an additional predictor into the model? If not, it may be necessary to redesign your research model in order to consider the variables you are using in a full graphic using such software as AMOS (Analysis of Moment Structures) or other software that can model and support data. I have attached a segment from my HIMSS 14 eSession presentation to demonstrate this notion. Further, if you use this methodology, you can test for Baron and Kenny's concept of mediation/moderation.
- Try B & C as independents, A is rendered insignificant possibly because it shares a lot of predictive correlation with C
- Then test the nested models (~ B+C vs. ~ A+B+C) to find whether it makes sense to introduce (keep) A as a predictor
- Multicollinearity is a major pain in MLR: the most important statistic for me would be Adjusted R square, followed by the significance of the variables. If the Adj R2 is good, then I would not bother too much with borderline "insignificance" @ 5% -- p-value > 0.01
Depends on the causal structure between all your variables.
i) Your third variable "C" may be a strong confounder in the relationship between your predictor/exposure/independent variable and the outcome/dependent variable. The "simplified" definition of a confounder is a variable that is associated with both the independent and dependent variable, but is not an intermediate. Controlling for C would block the backdoor paths from A and B to the dependent variable.
ii) If your sample size is small, it may be due to chance.
iii) Variable "C" may encompass both A and B in explaining the variability in the dependent variable. i.e. there may be multicollinearity
iv) Variable "C" may be an intermediate between A/B to the dependent variable. Controlling for C would cut-off the front-door path from A/B to the dependent variable.
Although I agree that there are alternative methods methods for mediation/moderation with their own sets of limitations, I would not negate Baron and Kenny. Using the excerpt from the HIMSS 14 presentation that I added above that is based on my 2011 dissertation, I have provided an unpublished manuscript that was submitted to Education Research. I think you will see how useful Baron and Kenny can be in the determination of Moderation/Mediation if you are using advanced statistical techniques like Confirmatory Factor Analysis and Structural Equation Modeling. Unfortunately, I did not maintain a copy of the final submitted version so forgive the notes. But it should suffice as an example.
Raid: yes. Briefly, factor analysis and SEM are rarely covered much to the detriment of existing published work, in my opinion. Often, existing research does not have a basis in a theoretical premise and therefore cannot offer practical solutions or analysis because there is no comparative foundation from which to interpret results for predictors or outcomes since these were not used in the first place. Unless, of course, you are just using an Analysis of Variance (ANOVA) or Data Envelopment Analysis (DEA) which may tell you what concept is better but cannot tell you why they are better or even if they are statistically significant in relation to a X-->Y path. Respectfully, I will also add that CFA/SEM modeling that requires a theoretical premise and uses the process of defined constructs which can be measured and compared using a literature review is not limited to psychology or other social sciences. I offer these suggestions as an alternative research design for this problem since these are practical applications and methods for determining a solution to his added variable and the potential moderation/mediation impact.
I understand people might not agree with my assessment. No worries. I would state that this platform is an opportunity to think outside the limitations of our present resources! But please don't reject the option out of ignorance or because your institution does not currently have the statistical faculty to teach CFA/SEM. There are software programs such as SPPS, SAS and others in which analysis procedures provide guidance from data cleansing, requirements for sampling, conditions for analysis, etc that can be found in resource books for that particular software version. Keep an eye out for workshops to get your started. Though it is not a cake walk without an instructor, it is possible for those who understand basic underlying statistics. Certainly, the benefits of results are worth the additional labor.
I am very open minded when it comes to trying out some statistical methodologies that I have not used before in a certain manner. Lately, I have used CFA with disease surveillance (spatial epidemiology). Statistics is a great discipline to be working in as it has very few borders that cannot be crossed.
Hi colleagues
Is there any manual that you can advise me to use to help understand CFA.
I use SPSS Survival Manual for EFA. I could not find a manual for CFA.
With respect t to the Ridge-Regression can you expand more on this colleague Huda please ?
Saba: EFA versus CFA:
ref: http://www2.sas.com/proceedings/sugi31/200-31.pdf
====================================================
In the case of multicollinearity,some of the eigenvalues of X'X can be very small. Since the inverse of X'X is needed in many of the formulas for estimation in regression, the reciprocal of the eigenvalues can be huge, resulting in extremely large prediction variances. This means that the regression model becomes unstable, and prediction is poor.
In Ridge Regression, a small amount (k) is added to the eigenvalues, which creates some estimation bias, but prediction is greatly improved. This is what happens in Ridge Regression. It is the methodology to use when you expect multicollinearity. There is no need to drop any of the predictor variables.
You can look up almost any statistics topic in Google.
Example: type in Google: example spss factor analysis
Pennsylvania State University have some excellent online course material that anyone can get access to for free. In particular, their multivariate statistics course is loaded with useful information, which includes factor analysis.
Whenever I teach a statistics graduate course, I tell my students " there has to exist someone out there who knows how to teach a certain topic better than me", and I then show them how to use Google to find such material online as an extra source of information.
Ridge Regression Models using to solve Multicollinearity Problem to avoid the resulting large sampling variances of the ßs. This URL will be helpful.
http://www.m-hikari.com/ijcms-2011/9-12-2011/rashwanIJCMS9-12-2011.pdf
Montgomery, Peck, and Vining (Introduction to Linear Regression
Analysis, John Willy and Sons, New York) is an excellent text on regression with lots of detailed information on the effects of multicollinearity. So is the text by Raymond Myers (Classical and Modern Regression with Applications, PWS
– KENT Publishing Co)
Pairwise correlations don't always reveal multicollineararity. I am guessing that when you introduced C, A is not significant because A can be predicted from B and C. Likewise, B is not significant because B can be predicted from A and C. I suggest that you regress A on B and C and the B on A and C to see if this could be the case.
principal component analysis (PCA) might be considered for handling multi-collinerity. in your case, on eigen value is probably near zero. The corresponding eigenvector will show the relationship among A. B and C. Then transformation, generate a new factor to repalce the correlated factor.
principal component regression can also be used.
The thir choice is ridge regression.
i agree with Yuanzhang Li · PCA is the proper solution for such a case, PCA not only reduced the no of independent but also loose no varaition.
Saba Kassim, for EFA and CFA please check the book of Mutivariate statistics by any author special Dean W. wichern and co auther
Hi Lukasz,
besides correlations between predictors A,B,C, could you please give the correlations of them with dependent variable Y ?
thanks,
Stan Lipovetsky.
In fact, the slope in regression b
y=a+bX
and the correlation of r=corr(X,Y)
satisfies
b=r*Sy/Sx
where Sy and Sx are standard deviation of Y and X.
.Hence correlation and regression slope describe same thing in nature.
Under your case, at least two of A, B and C are highly correlated, suppose they are A and B. Then if A and Y are correlated in marginal, so do Y and B.
However, if you put both A and B in regression model, the slopes describe partial correlation of Y and A control for B, and the partial correlation Y and B control for A.
It can be shown as following
regression
y=a+ba*A+b0*B (1)
you get the two slopes (regression coefficients) for A and B simultaneously,
Then run
y=a2+b2A (2)
set Y(A) are the residuals of (2)
Then run
B=a3+b3A (3)
set B(A) be the residuals of (3)
Then you run
Y(A)=a4+b4B(A) (4)
you will find b4=b0
This means the regression coefficients in multiple regression describes prtial correlation between the outcome and independent factors, the model can't distinguish the effect among the correlated factors.
Hence you need state clearly about the correlation: Usually, you use the marginal or unconditonal .
The answer lies partly outside of the field of statistics (althought some of the answers above are quite relevant), but rather in the field of causal analysis. In short, you should wonder why you call A, B and C "predictors" and what are, according to what is known on the subject, the causal relationship network between A, B, C and Y. Should really introduce C ? Should you remove A or B ?
You may whish to check Judea Pearl's works (http://bayes.cs.ucla.edu/jp_home.html). They're a great help in thinking about these difficult subjects.
http://bayes.cs.ucla.edu/jp_home.html
Is there any one who can advice me how to download the R for statistical analysis and any manual to help using it please?
I'm not sure that this is right forum ... but start with www.r-project.org/
There is a manual, but I would start with a tutorial. Google "online R tutorial" and you will probably get a large number of options.
I believe this is one of the symptoms of multcollinearity. Check when you add the last variable but removing one of the old two variables. If the regression coefficients keep jumping up and down, then you have multicollinearity.
I know this is an old thread but I wondered if anyone had a reference re: if no multicollinearity, that situations like this may be suggestive of mediation? thank you in advance.
You might be interested in the Regression analysis checklist at ForPrin.com
One need to distinguish between explanation and prediction. For prediction you are not concerned with significance but only best fit. For explanation one no longer runs a stepwise regression but includes all explaning factors into one regression equation and interprete them. And then you run a analysis that calculates the variance inflation factor in order to infer whether your results are plagued by multicollinearity.