I am perfomring linear regression analysis in SPSS , and my dependant variable is not-normally distrubuted. Could anyone help me if the results are valid in such a case? If not, what could be the possible solutions for that?
Thank you in advance
Here are a few observations.
1. The normality assumption for linear regression applies to the errors, not the outcome variable per se (and most certainly not to the explanatory variables). The usual statement is that the errors are i.i.d. (i.e., independently and identically distributed) as Normal with a mean of 0 and some variance. Independence and homoscedasticity are more important assumptions than normality.
2. As George Box famously noted: “…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” (JASA, 1976, Vol. 71, 791-799) Therefore, the normality assumption will never be exactly true when one is working with real data.
3. Non-normality of the errors will have some impact on the precise p-values of the tests on coefficients etc. But if the distribution is not too grossly non-normal, the tests will still provide good approximations.
4. As Michael suggested, it is useful to look at diagnostics, including residual plots. But note the distinction between residuals and errors (see link below). The former are observable, whereas the latter are not. (I would also look at measures of influence, such as Cook's distance.)
HTH.
p.s. - Of course, depending on the nature of your outcome variable, some other form of regression may be far more appropriate--e.g., Poisson or Negative Binomial regression for analysis of count variables.
http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics
There are whole textbook chapters on this issue, so it's hard to cover fully, but here's a short answer that you may want to explore issues of further.
Technically speaking, they are not valid, though often times if your sample is large enough and the deviation from normality is not too big, your results should be reasonably close to what you would obtain if you weren't violating the assumptions of the test. That said, you should examine the various diagnostics that SPSS and other software offers like residual plots, leverage plots/statistics, box plots (for outliers) to see if there are other issues in your data.
Specific to violations of normality, you can also transform your dependent variable (log and square root transformations are common, though it depends what your distribution of your outcome variable is as to which is appropriate), and compare your results. With transformed variables it's harder to interpret the results since they are no longer in the units in which you measured the variable, so if the results are similar you'll often present the untransformed results for ease of interpretation with a note that you compared them to those with the appropriate transformation.
Hope this helps
Here are a few observations.
1. The normality assumption for linear regression applies to the errors, not the outcome variable per se (and most certainly not to the explanatory variables). The usual statement is that the errors are i.i.d. (i.e., independently and identically distributed) as Normal with a mean of 0 and some variance. Independence and homoscedasticity are more important assumptions than normality.
2. As George Box famously noted: “…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.” (JASA, 1976, Vol. 71, 791-799) Therefore, the normality assumption will never be exactly true when one is working with real data.
3. Non-normality of the errors will have some impact on the precise p-values of the tests on coefficients etc. But if the distribution is not too grossly non-normal, the tests will still provide good approximations.
4. As Michael suggested, it is useful to look at diagnostics, including residual plots. But note the distinction between residuals and errors (see link below). The former are observable, whereas the latter are not. (I would also look at measures of influence, such as Cook's distance.)
HTH.
p.s. - Of course, depending on the nature of your outcome variable, some other form of regression may be far more appropriate--e.g., Poisson or Negative Binomial regression for analysis of count variables.
http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics
A linear regression is valid whenever you can ask the question "What is the increase in the predicted variable for a one-unit increase in the predictor?"
This implies that the effect of the predictor will be the same throughout the range of the predicted variable. And this may be an untenable assumption.
But, of course, there are many linear models, and part of the solution is going to be choosing a linear model that matches the research question. If you are counting, say, episodes of illness, then a Poisson model or related model will be more informative than least-squares regression.
But first decide if the theory behind your research allows you to ask a linear question.
Just to emphasize the Bruce's most important message: it is the error terms that should be normal, not the dep. variable.
Also, if you have reasonably many data, then homoskedasticity (equal variances of the residuals) is more important than normality.
If you worry about heteroskedasticity (unequal variances), then employ "robust errors" (this will not influence the point estimates of the coefficients, only standard errors and confidence intervals.)
Cheers -- Harald
The basic answer is no. If you think/know the outcome is not normally distributed, then it's not okay to use OLS (without correcting for that). The choice of alternative depends on the distribution that you have.
Dear Simin,
It is a common misbelief that the outcome variable in linear regression needs to be normally distributed. Only residuals need to be normally distributed.
In SPSS, you can check the normality of residuals using histogram and p-p plot of standardized residuals (Analyze--Regression--Plots--Standardized Residual Plots--Histogram & Normal probability plot).
Cheers,
Zeljko
Thank you so much for your answers and help.
I think I'll go firs for checking the normality of my residuals...
Simin - You will find in perhaps most cases there is heteroscedasticity in your residuals, which is especially true in regression through the origin. People often transform this to OLS so that they can use hypothesis tests. In my opinion, this is unnecessary and not very useful. The two attachments here are notes on WLS regression (sorry, notation is not good on the first page), and a letter on hypothesis tests. I think it is often better to use confidence imtervals. Note that normality does not always hold there either. But the standard errors are important and can be worked with.
Article Properties of Weighted Least Squares Regression for Cutoff S...
Article Practical Interpretation of Hypothesis Tests - letter to the...
Dear all,
I was just wondering, how should the p-plot look like when the residuals are normally distributed?
Simin,
the p-p (or q-q) plot should be linear with all the points along the line. If the points start curving away from the line at one end for example, then your residuals don't follow the normal distribution.
Simin - You can look at the distribution of residuals to study both nonlinearity and heteroscedasticity. Some econometrics books could be helpful - say by Maddala, for instance. Also, Carroll and Ruppert, Transformation and Weighting in Regression (I think), Chapman and Hall, 1988, later CRC Press - I think. - Anyway, that was a good, insightful question. But I think nonlinearity and heteroscedasticity are what you are after. - Jim
There are two major points here.
The first is that regression fits a line using a least squares criterion that minimizes residuals. This does not make any assumptions regarding the probability distribution of the residuals.
The second point if that the assumption of normality is used to compute confidence intervals. For this to be meaningful, you need to demonstrate characteristics such as normality, identical distribution, and independence.
I suggest finding, and then plotting, the regression line and these residuals.
Then you can decide what to do next.
Yes there are a new tools for non linear relationship and when it is non normal distributed.
SEM-PLS: Structural Equation Modeling- Partial Least Square consider a new tool for exploring approach studies, PLS can be used in three conditions:
1. Exploratory research propose.
2. Non Normal distribution.
3. Small Sample Size.
4. Theory is not fully fitting the theoretical model
5. Non-linear relationships (Quadratic relationship and polynomial relationships)
For further information i suggest to read the following papers:
Kock, N., and Lynn, G. S. (2012). Lateral Collinearity and Misleading Results in Variance-Based SEM: An Illustration and Recommendations. Journal of the Association for Information System, 13(7), 546-580.
Hair Jr, J. F., Hult, G. T. M., Ringle, C., & Sarstedt, M. (2013). A primer on partial least squares structural equation modeling (PLS-SEM). SAGE Publications, Incorporated.
Hair, J. F., Ringle, C. M., and Sarstedt, M. (2011). PLS-SEM: Indeed a Silver Bullet. Journal of Marketing Theory and Practice, 19(2), 139-151.
Hair, J. F., Sarstedt, M., Ringle, C. M., and Mena, J. A. (2012). An Assessment of the Use of Partial Least Squares Structural Equation Modeling in Marketing Research. Journal of the Academy of Marketing Science, 40, 414-433.
All the best
There's a really good paper which explains this quite clearly : Xiang Li, Wanling Wong, Ecosse L. Lamoureux & Tien Y Wong (2012) The title of the paper is the same as your question Jnl - Invest. Opthalmol. & Vis .Sci.
As ever in statistics there are no black and white answers to your question and you have to use your judgement based on advice from others and your own analysis following that. Basically, I have found that it is OK to undertake regression on non-normal DVs as long as the sample sizes are large enough - these should have been determined by sample power calculation.
Good luck with your analysis - I hope you find a sensible solution
I recommend using normal quantile transformation, also called normal scores. This is a well documented approach in the literature, and one of the first versions of this approach was by Van der Waerden (can be Googled). It is available in SAS under proc rank. Basically, the raw data are arranged in order from smallest to largest, and then each observation is mapped into a standard normal curve, so x(i) becomes z(j), where z~N(0,1). This is a robust procedure and not a true nonparametric procedure, but it does the job very well. In addition to overcoming the normality issue, you will also get added clarity in understanding the outcomes since each z(j) is simply deviations from the mean. Also, in multivariate analysis, if one of the variables is much larger in magnitude than the rest of the variables,it can dominate the analysis. Such scores will "make variables equal in weight".
Here is a link for the background:
http://en.wikipedia.org/wiki/Van_der_Waerden_test
Normality will only come into importance if you do the statistical tests (inferences). You can still do the simple linear least squares estimation without the inferences. However the suggestions by the other writers should allow you to do the further step of testing the significance.
When the dependent variable is a dicotom, one is often adviced to use logistic regression instead of OLS. However results are often equal anyway. For the non-statistical audience, OLS is more easy to understand. See
Qual Quant (2009) 43:59–74 DOI 10.1007/s11135-007-9077-3
ORIGINAL PAPER Linear versus logistic regression when the dependent
variable is a dichotomy
Ottar Hellevik
Following up on Erik's post, with logistic regression, you get the odds ratio as your summary measure, whereas OLS linear regression with a dichotomous outcome gives you the risk difference. There's nothing wrong with that. But note that the standard errors are probably not going to be correct. You can fix that by using a robust SE. See the paper by Cheung (2007), for example (link below).
HTH.
http://aje.oxfordjournals.org/content/166/11/1337.full
Simin,
The model selection depends on the test hypothesis and data structure. As you mentioned, you outcome dependent variable should be continuous. You do not check the distribution on dependent variable itself. You need check it in the regression process. As many answers mentioned, the residuals are independent and identical normal distributed rather than the outcome itself. You also need check, the collinearrity of independent factors, which should be independent without errors, and the relationship between the outcome and independent factors, which should be linear by assumption. The whole topic is MODEL building, the process is:
1. Check the type of co factors, if some have a lot of missing value or typo, correct them or let them out. Then for continuous factors, check the person correlation coefficients, if Pearson correlation near 1 or -1 among any two, one should be gone in multiple regression. For category factors, check independency of any two. if most count appear in the diagonal of the contingence table, one of the two category variables should be gone.
2. Draw the scatter plots between the continuous outcome and each independent quantitative factor to see the association: if linear trend is shown, the factor is in, if non-linear effect is shown, transformation is needed. If no trend, like random, the independent factor could be out.
3. Suppose the dependent variable should be independent. If the dependent variable is related to time, check the auto correlation, if autocorrelation exists, time series modeling say Autoreg might be used.
4. Do PCA analysis to see if there are still multicolinearity among the independent factors, if some eigen-value is near zero, you may drop one of them, or define a new factor (transformation) .
5. If the sample size is large enough, say 10 (at least) times higher of the unknown parameter number, you can do multiple regressions, you may use auto select option, such as forward backward or best, which will select independent factors for you.
6. Number of parameters: interception: count 1, continuous factor, each counts 1, categorical factor with level of k, count as k-1.
7. Check outliers by leverage or CooksD or Residual, if exists you may delete them or do both model with and without the outliers.
8. Check normality of residuals from the multi variable regression, if violated, do transformation, variance homogeneity exist: transformation on some independent factor, variance homogeneity doesn’t exist: transformation on dependent factor. Those may improve your model fitting.
9. You may use Akaike’s information criterion or Bayesian information criterion or Mallows’ CP to decide how many factors should be in. Using them is better than comparison of R2.
10. Check interaction among the independent factors, the interaction among two quantitative predictors, means there is joint effect; the effect of one factor varies across the level of another factor. If the interaction between an quantitation factor and a category factor, say gender, it means the effect of the quantitative, slope, is different between males and females.
If the outcome is categorical, you may consider logistic or other model.
For sufficiently large samples, violations of normality in the outcome may not be an issue:
Diehr P, Lumley T. The importance of the normality assumption in large public health data sets. Annual Review of Public Health. 2002;23:151-169.
"The t-test and least-squares linear regression do not require any assumption of Normal distribution in sufficiently large samples. Previous simulations studies show that “sufficiently large” is often under 100, and even for our extremely non-Normal medical cost data it is less than 500. This means that in public health research, where samples are often substantially larger than this, the t-test and the linear model are useful default tools for analyzing differences and trends in many types of data, not just those with Normal distributions. Formal statistical tests for Normality are especially undesirable as they will have low power in the small samples where the distribution matters and high power only in large samples where the distribution is unimportant."
http://www.annualreviews.org/doi/full/10.1146/annurev.publhealth.23.100901.140546
You could try transforming the DV until it is as close to normal as you can get it, re-run your model, then compare the results. If the transformed variable gives you results that lead to the same interpretation and conclusions about the data, then its probably a pretty robust relationship.
Simin! As a normal distribution comprises an area under the famous "bell shape", necessarily we always consider a continuos raw data as a dependent variable. Otherwise, if the raw data (dependent variable) has other class of measures (like the numbers of fruits counted) we have to consider other distribution to analyze a possible relationship, such a Poisson distribution and others. If the dependent variables that you measured has binomial behavior (e.g. death or not death) you should consider a Binomial distribution (or Negative Binomial distribution if you data are zero-inflated).
So, if your raw data (dv) are a continuos variable, you are allowed to START the analyzes. First of all, you do not need to check the assumption of Normality of the raw data (as mentioned the colleagues).
I usually consider the following steps to building a model:
a) Scatter plot between the dependent variable and each independent variable. These plots will advise you to select potential independent variables to consider into the model. Be careful and analyze each plot in the context of biological expectation (e.g. we expected that the trees increases the height as increases in diameter; we do not expect a tree with a 1 meter of diameter with a 5 meter of height);
b) A correlation analyzes (Pearson correlation) will provide you a mathematical relationship. However, not necessarily a high correlated explanatory variable will significantly enter to the regression model;
c) Check the assumptions of the regression for the errors.
d) others steps ....
If you use SAS System I can send you a Datajob to verify the regression assumptions.
Best regards.
Thiago discussed with you about the model selection/ dignosis. The common process for regression should do as follwoing:
The model selection depends on the test hypothesis and data structure. As you mentioned, you outcome dependent variable should be continuous. You do not check the distribution on dependent variable itself. You need check it in the regression process. As I mentioned earlier, the residuals are independent and identical normal distributed rather than the outcome itself. You also need check, the collinearrity of independent factors, which should be independent without errors, and the relationship between the outcome and independent factors, which should be linear by assumption. The whole topic is MODEL building and diagnosis problem, there are many way to do it, according your statement I think the following will be easy. Assume the dependent variable is continuous and normal distributed as you stated
1. Check the type of co factors, if some have a lot of missing value or typo, correct them or let them out. Then for continuous factors, check the person correlation coefficients, if Pearson correlation near 1 or -1 among any two, one should be gone in multiple regression. For category factors, check independency of any two. if most count appear in the diagonal of the contingence table, one of the two category variables should be gone.
2. Draw the scatter plots between the continuous outcome and each independent quantitative factor to see the association: if linear trend is shown, the factor is in, if non-linear effect is shown, transformation is needed. If no trend, like random, the independent factor could be out.
3. Suppose the dependent variable should be independent. If the dependent variable is related to time, check the auto correlation, if autocorrelation exists, time series modeling say Autoreg might be used.
4. Do PCA analysis to see if there are still multicolinearity among the independent factors, if some eigen-value is near zero, you may drop one of them, or define a new factor (transformation) .
5. If the sample size is large enough, say 10 (at least) times higher of the unknown parameter number, you can do multiple regressions, you may use auto select option, such as forward backward or best, which will select independent factors for you.
6. Number of parameters: interception: count 1, continuous factor, each counts 1, categorical factor with level of k, count as k-1.
7. Check outliers by leverage or CooksD or Residual, if exists you may delete them or do both model with and without the outliers.
8. Check normality of residuals from the multi variable regression, if violated, do transformation, variance homogeneity exist: transformation on some independent factor, variance homogeneity doesn’t exist: transformation on dependent factor. Those may improve your model fitting.
9. You may use Akaike’s information criterion or Bayesian information criterion or Mallows’ CP to decide how many factors should be in. Using them is better than comparison of R2.
10. Check interaction among the independent factors, the interaction among two quantitative predictors, means there is joint effect; the effect of one factor varies across the level of another factor. If the interaction between an quantitation factor and a category factor, say gender, it means the effect of the quantitative, slope, is different between males and females.
Yuanzhang li
Lots of good advice above.
Ordinary least squares (OLS) estimators is unbiased independent of the error distribution (except from the independence), so the estimates are valid. Equality of variances is an issue about the t-test, but experience is that the t-test (enough observations!) performs reasonable even when the outcome is dichotomous. Also regression estimates are themselves weighted means, and means tends to be nearly normally distributed even at modest number of observations.
See example “Linear versus logistic regression when the dependent variable is a dichotomy” Ottar Hellevik. Qual Quant (2009) 43:59–74 DOI 10.1007/s11135-007-9077-3
Thiago - [If you use SAS System I can send you a Datajob to verify the regression assumptions.] Please email me the Datajob - thanks.
Simin - you did not tell us of the nature of the dependent variable ie are the observations auto-correlated (one measurement is dependent on the next or previous one) or not. This is the independence being referred to in the discussions in which if its not satisfied the ordinary regression will be invalid. Li says use autoreg.
To everyone, if the residuals, Єi = yi – ŷ, are niid, why are the observations, yi, not also niid. ŷ, is fixed by experimenter.
If you have large sample (n >= 30) your model will not suffer from normallity violations. or you can transform the DV.
good luck
Normality has nothing to do with linear regression, except if one wants to stick to the maximum likelihood estimation principle to justify the use of a least squares solution (and regression is such a solution) with weight matrix being the inverse of the covariance matrix of the errors (except for a positive scalar factor).
Maximum likelihood may be attractive to some statisticians but most applied scientists since the time of Gauss prefer a different type of optimal estimation the so called Best Linear (Uniformly) Unbiased Estimation, BLUUE or more usually BLUE.
This means that among all liner functions of the observations we seek as estimate of any unknown parameter (or even any linear function of the unknowns) the one which unbiased (i.e. the mean = expected value of the estimate equals the true value) uniformly (i.e. whatever the true values of the unknowns are) and best. Best here means that the mean square error of the estimation is minimized. The mean square error is the mean = expected value of the square of the difference between the estimate and the true value.
The famous Gauss-Markov theorem asserts that BLUUE (or simply BLUE) estimates are obtained by the least squares method when the used weight matrix is proportional (up to a positive scalar multiplicative factor) to the inverse of the covariance matrix of the errors (no matter what their probability distribution is).
The BLUE principle is much more attractive than the maximum likelihood one, because we minimize the mean square error so that different estimates under repetitions of the experiment (with different error outcomes) are more concentrated and are also concentrated around the true value. How can maximum likelihood beat that? Furthermore BLUE estimation applies equally well also to the cases where the errors are not normally distributed, while maximum likelihood (if applicable) leads to different that the usual least squares (in our case liner regression) solution.
Both principles give the same answer. So why bother about normality? Not to mention that the normal distribution is a myth. Every scientist that has left the convenience of his office to go out to the field and make repeated measurements under the same conditions knows that very well.
1- Try data transformation - then check normality
2- If standardised residuals are normal- no worries...
Pieter -
The central limit theorem is invoked when you want to look at the distribution of estimated means. Here, in this application of regression, the salient point is the distribution of the residuals. The y-values can have any distribution, and actually, I worked for many years, modeling establishment survey continuous data, which are very highly skewed.
We may often want errors or residuals to be normally distributed, but Simin's dependent variable can have any distribution.
Cheers - Jim
PS - As indicated elsewhere, heteroscedasticity is a more important consideration. It is naturally occurring, and in my long experience with this, is handled very well with weighted least squares (WLS) regression - with the heteroscedasticity thus accounted for in the error structure. (See the paper found through the link attached.)
Article Properties of Weighted Least Squares Regression for Cutoff S...
Pieter - I worked with continuous data for that modeling. People may have different points-of-view, but I did a great deal of work with heteroscedasticity. If this is not an area in which you have much experience, you may want to start with the Sage Pub Encyclopedia entry attached here.
I have an Australian friend who might use this expression:
Hooroo! - Jim
Article HETEROSCEDASTICITY AND HOMOSCEDASTICITY
As most of my colleagues have rightly said, the assumption of normality applies to error term and not the dependent and independent variables
Firstly, the assumption of normality applies to error term/ residuals and not the variables (response & predictor). Let's review assumptions of OLS
Assumptions of OLS:
Here, I've not written 'normality of residual' as an assumption of OLS. Yes, if residuals are not normal, then too we can get** best linear unbiased estimator (BLUE)** of the coefficients is given by the ordinary least squares (OLS) estimator but it will create problem in statistical inference. That's why sometimes it's included in the assumptions.
In case of Violations of normality statistical tests for determining whether model coefficients are significantly different from zero and for calculating confidence intervals for forecasts may not be reliable.
Read more about it in my answer at
http://stats.stackexchange.com/questions/243705/estimate-generalized-linear-model/243718?noredirect=1#comment463514_243718
OLS regression also often works fine when the depended variable is binomial (where the variance of portions vary with portion size). OLS is unbiased regardless of the error distribution. OLS estimates are equal to cross-table proportions and can easily be applied in multitable situations. Logistic regression might give more correct p-values, but often the difference to OLS p-values is small, if the number of observations is not too small
The assumptions for regression are that: Residuals are to be independent identically distributed with zero mean and equal variances.These assumptions are necessary for drawing inference and computing p-values for F and t. If inference is not the goal and all what is needed is fitting an OLS line , then the above assumptions are not necessary. In addition to the above, CLT is enough to establish normality of the mean if N>30 and P_P plot , Q-Q plot and Histogram with Normal curve in addition to K-S or Shapiro tests can be used to establish normality.
You can check the normality and homoscedasticity assumption of the error term. if the variance of the residuals changes , you can transform the dependent variable using some transformation, Try Box Cox!
Normality of errors is required for inference, and not for point estimation (thanks to Gauss-Markov theorem). So the only concern is that if errors are non normal, the test may be misleading especially for small sample sizes. I suggest use bootstrap based p-values to carry out tests.
A remark on the answer of Javed Iqbal : I would fully agree with your answer, if the term "normality" is replaced with the term "knowledge of the distribution (probability density function) of the errors". If one knows the distribution he can still make statistical inference and employ statistical tests, even if the distribution is not the normal one. The only problem is that he cannot use the ready made recipes of statistical textbooks which follow the normality assumption.
Since I posted my earlier response 4 years ago, my thoughts on the normality requirement for OLS regression have changed a bit, and I think they are probably in line with the views expressed by Javed Iqbal and Athanasios Dermanis. I have been influenced by Jeffrey Wooldridge's book, Introductory Econometrics. What he says about the assumptions for OLS linear regression is summarized in the attached PDF.
Note especially this excerpt (from p. 168 of the book) on slide 9:
This suggests to me that normality of the errors is really a sufficient condition, but not a necessary condition. Sufficient for what? Sufficient to ensure (approximate) normality of the sampling distributions of the model parameters (i.e., the coefficients). And it is those sampling distributions, at the end of the day, that really need to be (approximately) normal.
Cheers,
Bruce
Old post, but nonetheless. No one seem to have commented on how to interpret the normal probability plot as @ Simin Mahinrad asked about:
In the NORMAL P-P plot, you are hoping that your points will lie in a reasonably straight diagonal line from bottom left to top right. In the SCATTER plot (see enclosed), you are hoping that the residuals will be roughly rectangularly distributed, with most scored consentrated in the center (along the O point). You can also see outliers, i.e., cases that have a std residual >3 or
Log-transform data and apply linear regression afterwards, I guess.
We have to apply generalized power transform (x^lambda -1)/lambda to transform data into Normal form.
The point estimate will be valid, but testing of hypothesis and confidence interval can not be found for non -normal case. Regression diagnostic is to be done before proper analysis can be performed.
It is sufficient to check for normally distributed errors. Check if the residuals seem to be covered well by a normal curve.
If this is not the case, you could do one of the following:
1. Bootstrap
2. Use normal quintile transformation
3. Figure out an appropriate transformation
4. Use the raw data with caution.
5. (Anything else?)
Realize that not all published articles or texts on regression are 100% acceptable views and findings on regression. Take them as “input”.
Then make up your mind.
Find R -Square value . Suppose if the R square value is 80% then it is interpreted as 80% of the relationship between dependent and independent variables are explained and only 20% are unexplained.
Then there is test for Linearity of regression . Use this test and come to the conclusion.
Simin Mahinrad Bruce Weaver
The article linked below from the field of ophthalmology, where they seem to work with non-normal dependent variables often, shows a relationship between sample size and the degree to which violating the normality assumption for the dependent variable. The view expressed is that the dependent variable may have a conditional normal distribution across the data set.
I hope this helps.
Article Are Linear Regression Techniques Appropriate for Analysis Wh...
It is good to use the Minitab to get the best model regardless of whether it is linear or not. Although the dispensation is not normal, it can also be used with a linear model.
I do agree with Bruce Weaver.
Regarding the validity of the results of the linear regression analysis , my suggestion is as follows:
Test the significance of the deviation of linearity of regression of the data. If the deviation is found to be insignificant then it is to be understood that the results are valid. On the other hand, if the deviation is found to be significant then it is to be understood that the data set do not follow linear regression model which implies that the results are not valid. In this case, non-linear regression analysis will have to be performed for obtaining valid results.
As I have suggested earlier, regression diagnostic (analysis based on residuals) will give the real picture. It will tell whether linear regression is valid or not. It will also give the idea whether quadratic or higher degree polynomial will be required.
I do agree with Girdhar Agarwal .
The answer given by him is , in the true sense, the same with that already given by me.
If the PDF of errors (residuals) differs significantly from the Gaussian law, then the variance of OLS-estimates of the parameters of regression models can be reduced. One possible approach is to use higher order statistics. For example, Chapter Polynomial Estimation of Linear Regression Parameters for th...
I view it as important not to get cornered into one approach or methodology because some book "said so". Statistics is a wonderful field, and its major charm to me is the fact that it can open up the world to us statisticians. We can pick and choose what we want to analyze and in which way (as long as we stay with reasonably acceptable and defensible choices).
In this thread we are debating and discussing options on what to do when "
the outcome (dependant variable) not normally distributed?". A healthy exchange of thoughts can never hurt. This is good to do.
Hello Dr. Simin,
I think you should do any transformation on your dependent variable such as;
- Log. transformation
- Standardize it using mean centered method
- Lag transformation
I hope my response maybe useful for you
The suggestion provided by Nader Mohamed, as I assess, may be fruitful.
you can try the Box-Cox family of transformations to transform the dependent variable you are dealing with , to a response variable that is normal/nearly normally distributed.
You can decide after examining the regression residuals and the abline. If log, sqrt or standard mean centered transformation, or box-cox transformation still do not converge towards normality, i suggest you use more suitable approach that is to say GLM methods and then you can run gamma regression which allows you to get more reliable estimates under flexibility assumption. Linear Regression in such case ( non normality will impose normality and this basically violate OLS assumption and yield baised estimates. So let's conclude that though Linear Regression is well recognised as the best model ( i hereby confirm it), you need to explore the data on hand by looking at outcome distribution and in the meanwhile predicted outcomes vs observed outcomes. After this process, you can decide on the suitable model.
Dear\ Simin Mahinrad
pls read this article :-
good luck
