There are so many assumptions to fulfil before running linear regression (Linear relationship, Multivariate normality, multicollinearity, auto-correlation, homoscedasticity, independence). How do we check all these assumptions using SPSS?
Graphs are generally useful and recommended when checking assumptions. There are very different kinds of graphs proposed for multiple linear regression and SPSS have only partial coverage of them. You need to make them manually. I recommend a book like the one "Regression analysis by example" Hadi and Chatterjee for further reference. Added a link with some example graphs.
Regarding your question (How do we check all the assumptions of linear regression using SPSS?) please check the following links, I hope useful and clarify your doubt,
We recommend inspecting a scatter plot to look for an underlying linear relationship and the Koenker test for homoscedasticity for larger samples. I have some instructions for the latter in SPSS but cannot access them at present. The normality assumption is less important in the robust use of linear regression.
Thanks for your replies. We have gone through the theoretical part. However, when we are trying to put these in practice in SPSS, they are not very understandable.Therefore, I would like to know how to check these assumptions using SPSS.
The assumptions mentioned by you are all very much theoretical. Yes you are right.
When you run the linear regression model, you can see the p-value of F test is > .05 it clearly proves the fact that the linearity and the relationship amongst the variables is ruled out.
In such case, you will look for the p-values of the individual variable to eliminate the least effect on the independent variable. In other words, you will eliminate the independent variable whose p-value is highest among those whose p-value is > .05. Re-do the linear regression test again - until your p-value under the ANOVA table next to F-value become < 0.05 and then you will carry out the procedure of eliminating the variables again and again, until the p-values of all the variables under the coefficient table appears to be < 0.05.
The above takes care of all your theoretical assumptions - Linear relationship, mltivariate normality, collinearity, homoscedasticity and independence.
Thanks, that was a great explanation. However, I would like to know what happens to the credibility of my study if I start removing predictor variables one by one. What happens when one of them is the main input variable in the hypothesis which I want to prove.
Well. When you are eliminating one variable, it means you are partially ruling out its impact over the dependent variable. In other words, it has either no impact or the impact is negligible. Your hypothesis itself is a statement - assumption. It can be proved to be right or wrong.
One of the variable which you have assumed to impact proved to be false will be the outcome thence.
Further, you can use PLS (Partial least square) to get the real picture - how far is the impact and the related coefficients, linear equation etc.
Graphs are generally useful and recommended when checking assumptions. There are very different kinds of graphs proposed for multiple linear regression and SPSS have only partial coverage of them. You need to make them manually. I recommend a book like the one "Regression analysis by example" Hadi and Chatterjee for further reference. Added a link with some example graphs.
You now ask: "I would like to know what happens to the credibility of my study if I start removing predictor variables one by one?".
The question is changed quite a lot!
I do not understand why you are doing this.
I presume that you have thought long and hard about what you think are theoretically relevant variables and you have laboured to collect data on them and now you want to drop them! Why? Why not simply report what you have found so that other researchers can see your full results.
But back to your original question. I prefer to see assumptions related to issues and problems with models and data and it is these problems that need to be tackled
1 ) outliers in the response show up on a normal probability plot - these can have a big effect - are they errors? or do a with and without analysis and report both
2 collineaity between predictors - calculate a variance inflation factor to see how bad the problem is - is the same underlying variable being included more than once?
3 model mis- specification - partial residual plots will reveal non-linearities and thinking hard and knowing the literature will help with omitted variables,
4 Distributional assumptions - non - normality I would not worry about ( except outliers) but I would be concerned with heteroscedasticity and especially residual dependence - the latter can be handled by explicit more advanced models designed to tackle such situations,
In general significance testing ( t and F etc) are not useful (they assume the model is correct) for they are designed to answer the extent to which a result found in a sample is significant in the wider population - the extent to which you could have got an estimate this big if the true underlying value is zero, This is greatly affected by sample size, the variation in the predictors and collinearity between predictors and not just the size of the true effect.
I fully agree with Professor Jones. There is a tendency to drop insignificant variables from the analysis. Test of assumptions in regression analysis is different from that as suggested by Prof. Jones
Thanks for the reply, Prof. Jones. Actually I am not a fan of removing non-significant variables as well. I was just wondering what would happen to my results if I just start removing non-significant variables one by one, as suggested by someone.
I'll try to include your suggestions in the analysis. Thanks a lot