Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
A note about sample size. In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 cases per independent variable in the analysis.
In the free software below, its really easy to conduct a regression and most of the assumptions are preloaded and interpreted for you.
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present.
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.
Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.
Multicollinearity may be tested with three central criteria:
1) Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.
2) Tolerance – the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.
3) Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100 there is certainly multicollinearity among the variables.
If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from each score) might help to solve the problem. However, the simplest way to address the problem is to remove independent variables with high VIF values.
Fourth, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price.
4) Condition Index – the condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 indicate strong multicollinearity.
If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors to insure independence of the factors in the linear regression analysis.
Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x).
While a scatterplot allows you to check for autocorrelations, you can test the linear regression model for autocorrelation with the Durbin-Watson test. Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly auto-correlated. While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data. However, the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbors, which are first order effects.
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The following scatter plots show examples of data that are not homoscedastic (i.e., heteroscedastic):
The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two groups and tests to see if the variances of the residuals are similar across the groups. If homoscedasticity is present, a non-linear correction might fix the problem.
Linear regression simply does what it says on the label, and makes no assumption that the relationship is really linear – that's not its job. It is the researcher who needs to be sure that it makes sense to model the relationship as linear.
I've attached a graph showing a relationship that looks linear, but the linear regression implies that number of clients drops to zero at age 35. A lowess smoother shows that the relationship plateaus at the later ages.
Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
A note about sample size. In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 cases per independent variable in the analysis.
In the free software below, its really easy to conduct a regression and most of the assumptions are preloaded and interpreted for you.
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present.
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.
Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.
Multicollinearity may be tested with three central criteria:
1) Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.
2) Tolerance – the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.
3) Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100 there is certainly multicollinearity among the variables.
If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from each score) might help to solve the problem. However, the simplest way to address the problem is to remove independent variables with high VIF values.
Fourth, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price.
4) Condition Index – the condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 indicate strong multicollinearity.
If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors to insure independence of the factors in the linear regression analysis.
Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x).
While a scatterplot allows you to check for autocorrelations, you can test the linear regression model for autocorrelation with the Durbin-Watson test. Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly auto-correlated. While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data. However, the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbors, which are first order effects.
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The following scatter plots show examples of data that are not homoscedastic (i.e., heteroscedastic):
The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two groups and tests to see if the variances of the residuals are similar across the groups. If homoscedasticity is present, a non-linear correction might fix the problem.
The F‐test reported with the R2 is a significance test of the R2. This test indicates whether a significant amount (significantly different from zero) of variance was explained by the model.
Coefficient of determination (R2): The coefficient of determination is a measure of the amount of variance in the dependent variable explained by the independent variable(s). A value of one (1) means perfect explanation and is not encountered in reality due to ever present error. A value of .91 means that 91% of the variance in the dependent variable is explained by the independent variables..
You mention free software – what are you using? I'm at present looking for something for a short course whose participants will lose the will to live if I try to introduce R!
According to Hair in his book"Multivariate Data Analysis". the assumption of a linear relationship among the predictor variables, gives the model the properties of homogeneity. Hence coefficients express directly the effect of changes in predictor variables. When the assumption of linearity is violated, a variety of conditions can occur such as multicollinearity, heteroscedasticity, or serial correlation (due to non‐independence or error terms). All of these conditions require correction before statistical inferences of any validity can be made from a regression equation.
Basically, the linearity assumption should be examined because if the data are not linear, the regression results are not valid.
@Ronán, Blue Sky Statistics and JASP might be worth looking in to. JASP has attractive output and is reasonably complete with output options (e.g. for anova, it can give you an interaction plot, partial eta-squared, q-q plot pf residuals, post-hoc tests).
A decisive linear regression model assumption is the linearity of observations (Green & Salkind, 2014; M. Williams et al., 2013). The coefficient of determination (R2) measures how much variance in the criterion variable occurs through the linear combination of predictor variables (Fritz et al., 2012; Green & Salkind, 2014; Nathans et al., 2012). If a researcher violates the linearity assumption, then the calculated coefficients will lead to erroneous conclusions concerning nature as well as the strength of the relationships between regression model variables (M. Williams et al., 2013). Moreover, a linearity violation breaches the conditional mean of zero for errors assumption that can result in bias regression coefficients (M. Williams et al., 2013).
References
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2-18. doi:10.1037/a0024338
Green, S. B., & Salkind, N. J. (2014). Using SPSS for Windows and Macintosh: Analyzing and understanding data. Upper Saddle River, NJ: Pearson Education.
Nathans, L. L., Oswald, F. L., & Nimon, K. (2012). Interpreting multiple linear regression: A guidebook of variable importance. Practical Assessment, Research & Evaluation, 17(9), 1-19. doi:10.3102/00346543074004525
Williams, M., Grajales, C. A. G., & Kurkiewicz, D. (2013). Assumptions of multiple regression: Correcting two misconceptions. Practical Assessment, Research & Evaluation, 18(11), 1-14. Retrieved from http://pareonline.net
Note, that linearity of regression implies that no other functions of x variables (regressors) are relevant in explaining the expected value of response variable y. In the case of violation of the assumptions underlying linear regression model is misspecified. Generally functional form misspecification causes bias in the remaining parameter estimators.
Perform diagnostic tests for violations of the linear regression assumptions (for example RESET test). If violations are found, use appropriate corrections.
I'm sorry, but the question by Adhikari V V Subba Rao amounts to half of a course in basic statistics! Could answer it, but he would have to attend my classes for a few weeks…
I'm a bit late to this party, but I don't think anyone has mentioned yet that linear regression is linear in the coefficients, or linear in the parameters. Haitham, if you Google those phrases, you'll find examples of fitting curvilinear functional relationships via linear regression. Here's one example.
What are the properties of instrumental variable regression and when do we say that instrumental variables are weak?
Dear Respected colleague,
First a dataset should always be explored to see if it meets the assumptions of the statistical methods applied. The multivariate data analyses we are intending assume normality, linearity and absence of multicollinearity.
Normality refers to the shape of the data distribution for an individual variable and its correspondence to the normal distribution. the assumptions of normality can be examined by looking at histograms of the data, and by checking skewness and kurtosis. The distribution is considered normal when it is bell shaped and values of skewness and kurtosis are close to zero.
The linearity of the relationship between the dependent and independent variables represents the way changes in the dependent variable are associated with the independent variables, namely, that there is a straight-line relationship between the independent variables and dependent variable. This assumption is essential as regression analysis only tests for a linear relationship between the independent variables and dependent variable. Pearson correlation can capture the linear association between variables.
If the assumptions of regression analysis are met, then the errors associated with one variable are not correlated with the errors of any other variables . Independence of residuals can be examined via the Durban – Watson statistic which tests for correlations between errors. Specifically, it tests whether adjacent residuals are correlated. As a rule of thumb, reserchers suggests that Durban – Watson test values less than 1 or greater than 3 are definitely cause of concern, however, values closer to 2 indicate that the residuals are acceptable.
Multicollinearity is the existence of a strong linear relationship among variables, and prevents the effect of each variable being identified. Researchers recommend examining the variable inflation factor (VIF) and tolerance level (TOL) as a second tool for of multicollinearity diagnostics. VIF represents the increase in variance that exists due to collinearities and interrelationships among the variables. VIFs larger than 10 indicate strong multicollinearity and as a rule of thumb VIFs should be less than 0.1.
The next step is to assess the overall model fit in supporting the research hypotheses. This is done by, firstly, examining the adjusted R squared (R2) to see the percentage of total variance of the dependent variables explained by the regression model. Whereas R2 tell us how much variation in the dependent variable is accounted for by the regression model, the adjusted value tells us how much variance in the dependent variable would be accounted for if the model had been derived from the population from which the sample was taken. Specifically, it reflects the goodness of fit of the model to the population taking into account the sample size and the number of predictors used .
Next, the statistical significance of the regression analysis must be examined through ANOVA and F ratios. Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for tests of significance. F-testing checks the statistical significance of the observed differences among the means.