I used a 710 sample size and got a z-score of some skewness between 3 and 7 and Kurtosis between 6 and 8.8. This shows data is not normal for a few variables. Can I still conduct regression analysis?
The way you've asked your question suggests that more information is needed. You mentioned that a few variables are not normal which indicates that you are looking at the normality of the predictors, not just the outcome variable. Regression only assumes normality for the outcome variable. Non-normality in the predictors MAY create a nonlinear relationship between them and the y, but that is a separate issue.
You have a lot of skew which will likely produce heterogeneity of variance which is the bigger problem. Unless that skew is produced by the y being a count variable (where a Poisson regression would be recommended), I'd suggest trying to transform the y to normality.
A standard regression model assumes that the errors are normal, and that all predictors are fixed, which means that the response variable is also assumed to be normal for the inferential procedures in regression analysis. The fit does not require normality.
If y appears to be non-normal, I would try to transform it to be approximately normal.A description of all variables would help here.
1. Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. "Power-law distributions in empirical data." SIAM review 51.4 (2009): 661-703.
2. Colin S. Gillespie (2015). Fitting Heavy Tailed Distributions: The poweRlaw Package. Journal of Statistical Software, 64(2), 1-16. URL http://www.jstatsoft.org/v64/i02/.
Misconceptions seem abundant when this and similar questions come up on ResearchGate.
Basic to your question: the distribution of your y-data is not restricted to normality or any other distribution, and neither are the x-values for any of the x-variables. (You seem concerned about the distributions for the x-variables.) If the distribution of your estimated residuals is not approximately normal - use the random factors of those estimated residuals when there is heteroscedasticity, which should often be expected - then you may still be helped by the Central Limit Theorem.
Often people want normality of estimated residuals for hypothesis tests, but hypothesis tests are often misused. Prediction intervals around your predicted-y-values are often more practically useful. Some say use p-values for decision making, but without a type II error analysis that can be highly misleading. The estimated variance of the prediction error for each predicted-y can be a good overall indicator of accuracy for predicted-y-values because the estimated sigma used there is impacted by bias. The estimated variance of the prediction error for the predicted total is useful for finite population sampling.
One key to your question is the difference between an unconditional variance, and a conditional variance. You are apparently thinking about the unconditional variance of the "independent" x-variables, and maybe that of the dependent variable y. But the distribution of interest is the conditional variance of y given x, or given predicted y, that is y*, for multiple regression, for each value of y*. You generally do not have but one value of y for any given y* (and only for those x-values corresponding to your sample). But you assume that the estimated random factor of the estimated residual is distributed the same way for each y* (or x). This has nothing to do with the unconditional distribution of y or x values, nor the linear or nonlinear relationship of y and x values. If you have count data, as one other responder noted, you can use poisson regression, but I think that in general, though I have worked with continuous data, but still I think that in general, if you can write y = y* + e, where y* is predicted y, and e is factored into a nonrandom factor (which in weighted least squares, WLS, regression is the inverse square root of the regression weight, which is a constant for OLS) and an estimated random factor, then you might like to have that estimated random factor of the estimated residuals be fairly close to normally distributed.
Note that when saying y given x, or y given predicted-y, that for the case of simple linear regression with a zero intercept, y = bx + e, that we have y* = bx, so y given x or y given bx in that case amounts to the same thing.
The unconditional distributions of y and of each x cause no disqualification. It does not even determine linearity or nonlinearity between continuous variables y and x. You may have linearity between y and x, for example, if y is very oddly distributed, but x is also oddly distributed in the same way. Nonlinearity is OK too though.
Non-normality for the y-data and for each of the x-data is fine.
I agree totally with Michael, you can conduct regression analysis with transformation of non-normal dependent variable. Another issue, why do you use skewness and kurtosis to know normality of data? You have some tests for normality like Shapiro–Wilk test .
The ONLY 'normality' consideration at all (other than what kind of regression to do) is with the estimated residuals. (With weighted least squares, which is more natural, instead we would mean the random factors of the estimated residuals.)
Consider the various examples here of linear regression with skewed dependent and independent variable data:
When people say that it would be best if y were 'normally' distributed,' that would be the CONDITIONAL y, i.e., the distribution of the (random factors of the) estimated residuals about each predicted y, along the vertical axis direction. The actual (unconditional, dependent variable) y data can be highly skewed. All data can be skewed. Not a problem, as shown in numerous slides above.
Thus we should not phrase this as saying it is desirable for y to be normally distributed, but talk about predicted y instead, or better, talk about the estimated residuals. (The estimated variance of the prediction error also involves variability from the model, by the way.)
The following is with regard to the nature of heteroscedasticity, and consideration of its magnitude, for various linear regressions, which may be further extended:
The fact that your data does not follow a normal distribution does not prevent you from doing a regression analysis. The problem is that the results of the parametric tests F and t generally used to analyze, respectively, the significance of the equation and its parameters will not be reliable. In those cases of violation of the statistical assumptions, the generalized least squares method can be considered for the estimates.
I wrote above that "If the distribution of your estimated residuals is not approximately normal ... you may still be helped by the Central Limit Theorem."
The central limit theorem, as I see it now, will not help 'normalize' the distribution of the estimated residuals, but the prediction intervals will be made smaller with larger sample sizes.
I think I've heard some say the central limit theorem helps with residuals and some say it doesn't. But consider sigma, the variance of the estimated residuals (or the constant variance of the random factors of the estimated residuals, in weighted least squares regression). In statistical/machine learning I've read Scott Fortmann-Roe refer to sigma as the "irreducible error," and realizing that is correct, I'd say that when the variance can't be reduced, the central limit theorem cannot help with the distribution of the estimated residuals. (Anyone else with thoughts on that? The central limit theorem says means approach a 'normal' distribution with larger sample sizes, and standard errors are reduced. But if we are dealing with this standard deviation, it cannot be reduced. Any analysis where you deal with the data themselves would be a different story, however.)
Other than sigma, the estimated variances of the prediction errors, because of the model coefficients, are reduced with increased sample size. -To some extent, I think that may help to somewhat 'normalize' the prediction intervals for predicted totals in finite population sampling.
James R Knaub is right in speaking of the possibility of reducing the confidence interval of the regression coefficients. It is possible to construct a nonlinear rule for estimating the regression coefficients, which takes into account the difference from the Gaussian model (for example, asymmetry and excess coefficients).
Chapter Polynomial Estimation of Linear Regression Parameters for th...
Its application reduces the variance of estimates (and, accordingly, the confidence interval)
I noted above that I was not sure about the Central Limit Theorem with residuals. Because of the "irreducible error" (http://scott.fortmann-roe.com/docs/BiasVariance.html) from the sigma for the estimated residuals, it seems that prediction intervals could never be fully normalized, no matter the sample size, but yes the distribution of the mean of estimated residuals could be. Estimated regression coefficients too. (See https://lmyint.github.io/155_spring_2020/central-limit-theorem.html and https://www.econometrics-with-r.org/4-5-tsdotoe.html.) Anyway, I recently found the following on ANOVA: https://sites.ualberta.ca/~lkgray/uploads/7/3/6/2/7362679/slides_-_anova_assumptions.pdf. In there is the following:
"Assumption #1: Experimental errors are normally distributed
'If I was to repeat my sample repeatedly and calculate the means, those
means would be normally distributed.'”
I thought the assumption was on the distribution of the estimated residuals, not the distribution of the mean of the estimated residuals, but in that case, the Central Limit Theorem is about the distribution of the mean.
This is important here because regression with categorical independent variables is the same as ANOVA. (See https://www.theanalysisfactor.com/why-anova-and-linear-regression-are-the-same-analysis/.)
In ANOVA the y-values in a given category are actually the same as the y|x values in regression, where x is a category. In https://www.theanalysisfactor.com/checking-normality-anova-model/ it says "Residuals have the same distribution as Y|X. If residuals are normally distributed, it means that Y is normally distributed within a value of X...."
I suppose that is the origin of people thinking that the y-variable should be normally distributed as an assumption for regression, but the y-data distribution in regression is not conditional, and that is not true. It is desirable for estimated residuals to be normally distributed, though even that is not a very strict requirement. The unconditional data can be of any distribution. Establishment survey energy data I worked with for many years were (and are) highly skewed for y-data and x-data, which is perfectly fine for regression.
If anyone has any comment on what was said above on "Assumption #1" in https://sites.ualberta.ca/~lkgray/uploads/7/3/6/2/7362679/slides_-_anova_assumptions.pdf, such as anything contradictory, I would like to know. Cheers.
, I think you are probably correct in your post above (13 June 2020). However, could you expand a bit - even if only a small amount - about that post, please?
Regression used in model-based sampling and prediction with establishment surveys works well, and establishment survey data are generally quite skewed, as in this example from https://www.researchgate.net/publication/261947825_Projected_Variance_for_the_Model-based_Classical_Ratio_Estimator_Estimating_Sample_Size_Requirements, where a quasi-cutoff sample results in only one of the smallest cases taken when considering multiple attribute sampling of which the data shown here are for only one item/question.
1. When residuals of a "dependent" variable (=outcome factor) are not distributed normally, linear regression remains a statistically sound technique in studies of large sample sizes.
By the law of large numbers and the central limit theorem, the ordinary least squares (OLS) estimators in the linear regression technique still will be approximately normally distributed around the true parameter values, which implies the estimated parameters and their confidence interval estimates remain robust. Hence, in a large sample, the use of a linear regression technique, even if the dependent variable violates the “normality assumption” rule, remains valid.
2. Non-normality of the errors will have some impact on the precise p-values of the tests on coefficients etc. But if the distribution is not too grossly non-normal, the tests will still provide good approximations.
Establishment surveys have skewed data distributions. The prediction approach (model-based approach), which uses regression, is a valuable tool for establishment surveys.
See Valliant, R, Dorfman, A.H., and Royall, R.M.(2000), Finite Population Sampling and Inference: A Prediction Approach, Wiley Series in Probability and Statistics
And
Chambers, R, and Clark, R(2012), An Introduction to Model-Based Survey Sampling with Applications, Oxford Statistical Science Series.