I have five independent variables and one dependent. I have run a correlation matrix, and 2 of them have a correlation with the DV. if I will run multiple regression , shall I add all the variables or just correlated variables??
It will depend on the hypothesis you want to test. When you make regression, you basically form a model to explain the dependent variable. Regression is a step up than correlation and has different aims.
You can include all your variables into your model.
A step wise multiple regression will manipulate the whole 5 variables. Take first 99% confidence limits this will probably (but not sure ) exclude the two independent variables. Then go to the 95% CL ,this will let one of the variables to be considered (or not ). Good luck dear.
that is very true. No correlation simply mean that the r values were zero. This a very rare occurrence when running the data. Some of my friends once misinterpreted negative r-values to mean no correlation.
Correlation coefficient is a quantity measuring the extent of interdependence of variable quantities. The closer the coefficient to absolute 1, the higher the interdependence. Regression presumes some correlation. If you are doing a multiple linear regression, and the purpose of doing the correlation test was to determine which variables to use or not to use by significance , then any insignificant variable should be left out when doing the multiple regression.
Vania Todorova I have studied your correlation matrix and identified the following:
There is no correlation between certain variables. Statistically, we cannot subject any debate on whether to run regression on such variables because if correlation coefficient is zero, then there is no correlation. Such variables should be expunged from regression analysis. Remember, in linear regression the R in the model summary should be the same as r in the correlation analysis for simple regression. Therefore, when there is no correlation then no need to run a regression analysis since one variable cannot predict another.
Some correlation coefficient in your correlation matrix are too small, simply, very low degree of correlation. In a layman's view, they seem useless and not important but statistically they are important however small. It is imperative to run a regression for such to show whether there is significant prediction even though at very degree of correlation.
Stephen Politzer-Ahles
I am in support of your point number three in my argument above.
Join this interesting debate Richard M Kiai John King'athia Karuitha, Paul Kiumbe
thank you! i thought that was the case but wanted to check with others too.. i will let you know if it gives me better results when i clean up the data a little bit
I am looking for anyone with guidelines on installation and use of seaborn.heatmap which output presented by Vania Todorova was so impressive hence wish to apply in analysis of data I have already corrected. Can it work in window 8.1?
Hi, Stephen. My data is also in a similar situation that one variable has a small, non-significant, but positive correlation with the DV. However, when I entered it along with another variable into the regression model (these two IVs are moderately correlated), the coefficient of the first IV became negative and significant.
You've mentioned above that" a significant correlation is not a pre-requisite for running regression". May I ask why would this happen given the nature of regression? How should I explain this to my reviewers if they challenge? Many thanks!
One important point to be cleared is that linear regression is not necessarily linear!
Assume that your simple correlation does not show any association because the relation between variables is approximately like y = 3x + sin(x). However, you can do the linear regression and find this relation, as long as you have a good guess on the behaviour of your data if you consider two independent variables:
x1 = x and x2 = sin(x)
In fact, one can do the same thing changing the variables with which he/she computes the correlation but it is not as easy and "automatic" as the linear regression.
Kamila said "I have five independent variables and one dependent. I have run a correlation matrix, and 2 of them have a correlation with the DV. if I will run multiple regression , shall I add all the variables or just correlated variables??"
First, any models that are reasonable from a subject matter perspective could be checked for model fit, if best predictive power is what you want. You could compare model performances, placing results on scatterplots for "graphical residual analysis," with perhaps more than one model's results on a given scatterplot. To guard against overfitting to a particular sample, one could use "cross-validation." Any model selection technique might be thwarted by a given sample if you overfit your model to it, so cross-validation can help. Perhaps you could do your graphical residual analyses for more than one sample.
Bias-variance tradeoff may occur where a more complex model may tend to have higher variance, unless the added variable is really needed. (See Ken Brewer(2002), Combined Survey Sampling Inference, Arnold, pp. 109-110.) But if there is less complexity than needed, bias may result, I suppose especially if there is omitted variable bias. It is best not only not to have too few variables, nor too many, but also one should have the right combination of variables.
In finite population sampling, one should expect heteroscedasticity, as larger predicted values should be associated with larger sigma for the estimated residuals. (See https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity, for a justification, as determined by Ken Brewer.) Perhaps a reason for not having heteroscedasticity sometimes may be not having the best combination of variables for predicted-y. (But sometimes that can cause more heteroscedasticity.)
You do not need to establish correlations between variables that you want to include in your regression analysis because it is possible that variables which may not have any correlation could show some kind of relationship when you use them as independent variables in a regression run.
Instead of answering your question, why don't you try an experiment. Run a regression with all the independent varibales. Then run another regression eliminating the variables that have no correlation with you supposed dependent variable. See what happens. That would be some sort of your answer to your question. Cheers!