Your data is your data. Obviously you should try to ensure that it is as accurate as possible, but non-normal data is not necessarily a problem, even with parametric analysis. This is know as the robust use of parametric tests. You should certainly not manipulate your data so that it is normally distributed.
Also, be careful how you interpret these two normality tests: they tend to be under-sensitive for small sample sizes and oversensitive for medium to large sample sizes. For the latter you should also look at histograms and QQ plots, etc.
Your data is your data. Obviously you should try to ensure that it is as accurate as possible, but non-normal data is not necessarily a problem, even with parametric analysis. This is know as the robust use of parametric tests. You should certainly not manipulate your data so that it is normally distributed.
Also, be careful how you interpret these two normality tests: they tend to be under-sensitive for small sample sizes and oversensitive for medium to large sample sizes. For the latter you should also look at histograms and QQ plots, etc.
I agree completely with Peter. These tests are not very accurate and highly influenced by the sample size. Anymore, parametric tests are often robust, if the sample size is big enough and the size of the groups (if there are categorical predictors) is as far as possible equal. Nonetheless, most parametric tests assume a normal distribution.
There are some transformations which are standard methods, but consider that these are non-linear transformations and strictly speaking they are not allowed, if you want to retain the metric scale of your data! Common methods are logarithmization (taking the logarithm of your data) and square root transformation (taking the square root).
Totally agree with Brian and Peter samuels. Another suggestion is that kindly just check for outliers in your data. Removal of these type of responses can also impact normality unless or until data is enough for further analysis after deletion of such responses.
As the answers presented above, I think you should check for the outliers first and then consider the data transformation. A outlier could attribute to a mistake but significantly affect the outcome of statistical analysis.
I would cautiously agree with the answers provided above. However, it would be useful to know what the purpose of your analysis is. i.e. is it to assess association between risk factors and and outcome, or is it to predict an outcome (the 2 are quite different).
For assessing associations, you are probably better off not transforming your data just to make it normal prior to model fitting. Regression techniques do not require your data to be 'normally' distributed. You should not worry too much about this as long as your residuals are well behaved.
However, if you're looking to predict the probability of an outcome, there might be some merit in looking at transformations. The caveat here would be that it has to be done wisely. Sometimes, some outlier values can drastically distort the relationship between risk factor and outcome. But these values are real and not due to experimental/observational error. If you remove them, there is a risk of biasing your analysis. A good example would be carcinoembryonic antigen (CEA) levels measured in colorectal cancer patients. CEA values can range anywhere between 0.5 and several thousands in some patients. Overall, the distribution will be skewed. In this case, it might be of use to apply a 'normalizing' transformation such as a log transformation. If you want greater flexibility in transforming your data, look to Box-Cox transformations that might help you achieve what you want to do. Box-Cox transformations can be quite easily performed in R or Stata.
While I partially agree with the above, and especially liked Peter's response, i think that there are some misconceptions which you will often encounter, and I especially want to comment below on the use of a p-value.
Is there some reason that you expect your data should be "normally" distributed? I think it is a common misconception that this should be true in many cases where there is no reason to expect it, or need it. One very useful place to expect to approach 'normality' is not for your data distribution, but for distribution of the estimated mean due to the Central Limit Theorem. (The nearness of such an approximation depends upon your population distribution, and sample size.) That may be one source of the common assumption that a normal distribution is always desirable. Instead, I'd say it is a justification for having otherwise generally too much attention placed on that distributional form.
I don't generally consider transformations to be 'improvements,' as they distort results, but it all depends upon why you think normality is an "improvement" for your purposes.
I worked a great deal with regression using highly skewed data, which naturally contained considerable heteroscedasticity for the estimated variance of predicted y value errors. Normality of the estimated random factors of the estimated residuals, in such cases, might be nice -- 'normal' 'errors' have been studied and used a great deal -- but it is generally not very important there.
Further, many want to estimate a p-value, but those are often misleading, as they are not very meaningful by themselves. See the following:
Press release for the American Statistical Association:
It is often far more practically interpretable to consider a confidence interval (or a prediction interval for regression) than to use hypothesis tests, though both are impacted by sample size.
I understand that you were using tests to judge closeness to normality, so that is not something that lends itself well to looking at confidence intervals, but to use a type II error analysis or power analysis with these tests instead can be very nebulous to interpret. But if such testing were used, then you do need some such sensitivity analysis to judge the meaningfulness of your results. Just using the graphical representation might sometimes be more useful.
Cheers - Jim
Article Practical Interpretation of Hypothesis Tests - letter to the...
Iam very grateful for your contributions. Actually i just wanted to test association between two variables using multiple regression analysis, but then, i want to explore some few assumptions (normality, autocorrelation, heteroscedasticity and so on) of OLS before i run the final test.