I am trying to perform linear regression, but my response variable seems not to follow a normal distribution (rejected H0 in ks test) with positive skewness (the distribution is attached - Fig1). Log transformation of the data, did not change the normality as well. I've read that normality of residuals (and as the result, response variable) should not be a problem (due to CLT) as far as sample size is sufficiently large, which seems to be the case in my example (N ~ 400,000). Moreover, when I try to fit a linear model (with only one predictor), normal probability plot of my residuals does not follow a straight line (Fig2).

I tried to remove the outliers (I didn't have a reason why the outliers should not be in the data! I only wanted to test the output) by removing the values more than three scaled median absolute deviations (MAD) away from the median, and fitted another linear regression model. So, as expected, the response became normally distributed, and p-value (and test statistics) changed!

That being said, I am wondering if I can trust my linear regression output, even the sample is non-normal with skewness. Or, should I follow another non-parametric test?

Thanks for your time and response in advance.

More Oveis Jamialahmadi's questions See All
Similar questions and discussions