For examining the association between two variables, say X and Y, using the Pearson correlation coefficient, the assumption commonly stated in text books is that both variables need to be normally distributed, or at least a reasonable approximation to that distribution.

On the other hand, the assumption for a parametric OLS regression model is that the residuals are normally distributed. In such a regression analysis, unless there is a very strong relationship between the independent and dependent variables (say X and Y, resp.) the distribution of the residuals is very close to that of the dependent variable, Y. (That is why the commonly stated assumption for regression is that the DV needs to be normally distributed.) So, in a situation where Y, (and hence the residuals) are sufficiently normal, but the predictor, X, is very non-normal (say skewness outside the range of plus and minus one), the parametric regression model would still satisfy the required criterion and so allow the standardised regression coefficient, beta, to be validly estimated.

However, the Pearson correlation coefficient is precisely the same as the standardised regression coefficient, beta, derived from a simple regression analysis. So there seems to be a conflict in the commonly stated assumptions regarding distributions for the two types of analyses, each of which allows the estimation of the same statistic (correlation or beta).

Does the above imply that it should be valid to use the Pearson correlation coefficient when only one of the two variables is normally distributed, but not the other?

More John D Crawford's questions See All
Similar questions and discussions