Why is it important to examine the assumption of linearity when using Regression?

Haitham Hmoud Alshibly @Haitham_Alshibly

02 February 2018 24 8K Report

Adriana Santos-Caballero Popular answer

Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions:

Linear relationship

Multivariate normality

No or little multicollinearity

No auto-correlation

Homoscedasticity

A note about sample size. In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 cases per independent variable in the analysis.

In the free software below, its really easy to conduct a regression and most of the assumptions are preloaded and interpreted for you.

First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present.

Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this issue.

Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.

Multicollinearity may be tested with three central criteria:

1) Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.

2) Tolerance – the tolerance measures the influence of one independent variable on all other independent variables; the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis. With T < 0.1 there might be multicollinearity in the data and with T < 0.01 there certainly is.

3) Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100 there is certainly multicollinearity among the variables.

If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from each score) might help to solve the problem. However, the simplest way to address the problem is to remove independent variables with high VIF values.

Fourth, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price.

4) Condition Index – the condition index is calculated using a factor analysis on the independent variables. Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 indicate strong multicollinearity.

If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors to insure independence of the factors in the linear regression analysis.

Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x).

While a scatterplot allows you to check for autocorrelations, you can test the linear regression model for autocorrelation with the Durbin-Watson test. Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly auto-correlated. While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data. However, the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbors, which are first order effects.

The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The following scatter plots show examples of data that are not homoscedastic (i.e., heteroscedastic):

The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two groups and tests to see if the variances of the residuals are similar across the groups. If homoscedasticity is present, a non-linear correction might fix the problem.

Ronán Michael Conroy

Linear regression simply does what it says on the label, and makes no assumption that the relationship is really linear – that's not its job. It is the researcher who needs to be sure that it makes sense to model the relationship as linear.

I've attached a graph showing a relationship that looks linear, but the linear regression implies that number of clients drops to zero at age 35. A lowess smoother shows that the relationship plateaus at the later ages.

Adriana Santos-Caballero

Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable. The regression has five key assumptions:

Linear relationship

Multivariate normality

No or little multicollinearity

No auto-correlation

Homoscedasticity

A note about sample size. In Linear regression the sample size rule of thumb is that the regression analysis requires at least 20 cases per independent variable in the analysis.

In the free software below, its really easy to conduct a regression and most of the assumptions are preloaded and interpreted for you.

Thirdly, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.

Multicollinearity may be tested with three central criteria:

1) Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables the correlation coefficients need to be smaller than 1.

Haitham Hmoud Alshibly

Is the the assumption of related to the multicollinearity and heteroscedasticity.?

Salvatore S. Mangiafico

Essentially, no. Those are all different assumptions. You could have any one or two without the others.

Adhikari V V Subba Rao

What F value indicates in regression analysis and how to interpret it.

How to frame H0, H1 and what is p value. I am sorry to ask these basic information but kindly help me with a numerical example

Haitham Hmoud Alshibly

The F‐test reported with the R2 is a significance test of the R2. This test indicates whether a significant amount (significantly different from zero) of variance was explained by the model.

Haitham Hmoud Alshibly

Coefficient of determination (R2): The coefficient of determination is a measure of the amount of variance in the dependent variable explained by the independent variable(s). A value of one (1) means perfect explanation and is not encountered in reality due to ever present error. A value of .91 means that 91% of the variance in the dependent variable is explained by the independent variables..

Ronán Michael Conroy

Adriana – a very comprehensive reply!

You mention free software – what are you using? I'm at present looking for something for a short course whose participants will lose the will to live if I try to introduce R!

Haitham Hmoud Alshibly

Ronan, I am interesting to know such a free software too.

Haitham Hmoud Alshibly

Ronan , kindly please answer Adhikari V V Subba Rao question . Many thanks

Haitham Hmoud Alshibly

According to Hair in his book"Multivariate Data Analysis". the assumption of a linear relationship among the predictor variables, gives the model the properties of homogeneity. Hence coefficients express directly the effect of changes in predictor variables. When the assumption of linearity is violated, a variety of conditions can occur such as multicollinearity, heteroscedasticity, or serial correlation (due to non‐independence or error terms). All of these conditions require correction before statistical inferences of any validity can be made from a regression equation.

Basically, the linearity assumption should be examined because if the data are not linear, the regression results are not valid.

Salvatore S. Mangiafico

@Ronán, Blue Sky Statistics and JASP might be worth looking in to. JASP has attractive output and is reasonably complete with output options (e.g. for anova, it can give you an interaction plot, partial eta-squared, q-q plot pf residuals, post-hoc tests).

Robert E. Davis

A decisive linear regression model assumption is the linearity of observations (Green & Salkind, 2014; M. Williams et al., 2013). The coefficient of determination (R2) measures how much variance in the criterion variable occurs through the linear combination of predictor variables (Fritz et al., 2012; Green & Salkind, 2014; Nathans et al., 2012). If a researcher violates the linearity assumption, then the calculated coefficients will lead to erroneous conclusions concerning nature as well as the strength of the relationships between regression model variables (M. Williams et al., 2013). Moreover, a linearity violation breaches the conditional mean of zero for errors assumption that can result in bias regression coefficients (M. Williams et al., 2013).

References

Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2-18. doi:10.1037/a0024338

Green, S. B., & Salkind, N. J. (2014). Using SPSS for Windows and Macintosh: Analyzing and understanding data. Upper Saddle River, NJ: Pearson Education.

Nathans, L. L., Oswald, F. L., & Nimon, K. (2012). Interpreting multiple linear regression: A guidebook of variable importance. Practical Assessment, Research & Evaluation, 17(9), 1-19. doi:10.3102/00346543074004525

Williams, M., Grajales, C. A. G., & Kurkiewicz, D. (2013). Assumptions of multiple regression: Correcting two misconceptions. Practical Assessment, Research & Evaluation, 18(11), 1-14. Retrieved from http://pareonline.net

Adhikari V V Subba Rao

Thank you very much gentlemen. Particularly whole heartedly I would like to thank Dr.Haitham Hmoud Alshibly for his explanation and endorsement.

Sergiy Prykhodko

There are four principal assumptions which justify the use of linear regression models:

(i) linearity and additivity of the relationship between dependent and independent variables:

(a) The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.

(b) The slope of that line does not depend on the values of the other variables.

(ii) statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)

(iii) homoscedasticity (constant variance) of the errors

(a) versus time (in the case of time series data)

(b) versus the predictions

(iv) normality of the error distribution.

For more information, see the link below.

http://people.duke.edu/~rnau/testing.htm

Haitham Hmoud Alshibly

Thanks Dr. Sergiy Prykhodko

Hanna Dudek

Note, that linearity of regression implies that no other functions of x variables (regressors) are relevant in explaining the expected value of response variable y. In the case of violation of the assumptions underlying linear regression model is misspecified. Generally functional form misspecification causes bias in the remaining parameter estimators.

Perform diagnostic tests for violations of the linear regression assumptions (for example RESET test). If violations are found, use appropriate corrections.

Ronán Michael Conroy

I'm sorry, but the question by Adhikari V V Subba Rao amounts to half of a course in basic statistics! Could answer it, but he would have to attend my classes for a few weeks…

Bruce Weaver

I'm a bit late to this party, but I don't think anyone has mentioned yet that linear regression is linear in the coefficients, or linear in the parameters. Haitham, if you Google those phrases, you'll find examples of fitting curvilinear functional relationships via linear regression. Here's one example.

http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-difference-between-linear-and-nonlinear-equations-in-regression-analysis

HTH.

Ronán Michael Conroy

No one minds how late you are, Bruce, so long as you bring an interesting bottle.

Haitham Hmoud Alshibly

What are the properties of instrumental variable regression and when do we say that instrumental variables are weak?

Dear Respected colleague,

First a dataset should always be explored to see if it meets the assumptions of the statistical methods applied. The multivariate data analyses we are intending assume normality, linearity and absence of multicollinearity.

Normality refers to the shape of the data distribution for an individual variable and its correspondence to the normal distribution. the assumptions of normality can be examined by looking at histograms of the data, and by checking skewness and kurtosis. The distribution is considered normal when it is bell shaped and values of skewness and kurtosis are close to zero.

The linearity of the relationship between the dependent and independent variables represents the way changes in the dependent variable are associated with the independent variables, namely, that there is a straight-line relationship between the independent variables and dependent variable. This assumption is essential as regression analysis only tests for a linear relationship between the independent variables and dependent variable. Pearson correlation can capture the linear association between variables.

If the assumptions of regression analysis are met, then the errors associated with one variable are not correlated with the errors of any other variables . Independence of residuals can be examined via the Durban – Watson statistic which tests for correlations between errors. Specifically, it tests whether adjacent residuals are correlated. As a rule of thumb, reserchers suggests that Durban – Watson test values less than 1 or greater than 3 are definitely cause of concern, however, values closer to 2 indicate that the residuals are acceptable.

Multicollinearity is the existence of a strong linear relationship among variables, and prevents the effect of each variable being identified. Researchers recommend examining the variable inflation factor (VIF) and tolerance level (TOL) as a second tool for of multicollinearity diagnostics. VIF represents the increase in variance that exists due to collinearities and interrelationships among the variables. VIFs larger than 10 indicate strong multicollinearity and as a rule of thumb VIFs should be less than 0.1.

The next step is to assess the overall model fit in supporting the research hypotheses. This is done by, firstly, examining the adjusted R squared (R2) to see the percentage of total variance of the dependent variables explained by the regression model. Whereas R2 tell us how much variation in the dependent variable is accounted for by the regression model, the adjusted value tells us how much variance in the dependent variable would be accounted for if the model had been derived from the population from which the sample was taken. Specifically, it reflects the goodness of fit of the model to the population taking into account the sample size and the number of predictors used .

Next, the statistical significance of the regression analysis must be examined through ANOVA and F ratios. Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for tests of significance. F-testing checks the statistical significance of the observed differences among the means.

I hope this answer your question,

Best Wishes

Haitham Hmoud Alshibly

Thank you Abdullah

Haitham Hmoud Alshibly

Thank you for your insightful comments.

Badges
Science topic

More Haitham Hmoud Alshibly's questions See All

What are the difference between the notional learning hours and credit hours?

What are the difference between the notional learning hours and credit hours? how can we convert the credit hours to notional?

04 May 2018 8,080 1 View

Is there something called good luck or bad luck?

Dear RG colleagues, Is there something called good luck or bad luck. What do you think? Regards,

03 April 2018 1,141 29 View

What does it take to become a great teacher?

what does it take to become a great teacher?

01 February 2018 3,784 45 View

What Is The Purpose Of Cluster Analysis And When Should It Be Used Instead Of Factor Analysis?

WHAT IS THE PURPOSE OF CLUSTER ANALYSIS AND WHEN SHOULD IT BE USED INSTEAD OF FACTOR ANALYSIS?

01 February 2018 2,824 13 View

Do you think being a University Professor is a stressful job?

31 December 2017 9,964 26 View

Why did you decide to get a PhD degree?

31 December 2017 3,596 18 View

It may be a joke, but quite possible with TODAY'S technology ?

It may be a joke, but quite possible with TODAY'S technology of Big Data, AI, Predictive Analytics, Social Applications, Recommendation Engines, Next Best Action etc., All coming Together.....

31 December 2017 1,554 5 View

What are some pros and cons of advertising toward children?

What are some pros and cons of advertising toward children? How comfortable are we with advertising to children?

31 December 2017 3,620 0 View

What is the most interesting research paper you have read in the last year?

31 December 2017 8,321 4 View

The total number of languages spoken globally is decreasing. Is this good? Why?

The Summer Institute of Linguistics (SIL) International has compiled a catalog of the world’s languages, called the Ethnologue, which lists 6,909 distinct languages being spoken. By 2115,...

31 December 2017 5,791 24 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View