How can I improve upon my data so as to get a normal distribution?

Peter Samuels Popular answer

Dear Enoch,

Your data is your data. Obviously you should try to ensure that it is as accurate as possible, but non-normal data is not necessarily a problem, even with parametric analysis. This is know as the robust use of parametric tests. You should certainly not manipulate your data so that it is normally distributed.

Also, be careful how you interpret these two normality tests: they tend to be under-sensitive for small sample sizes and oversensitive for medium to large sample sizes. For the latter you should also look at histograms and QQ plots, etc.

Gregory L Wilson

Research "batch means". It may be useful.

Peter Samuels

Dear Enoch,

Brian Schwartz

I agree completely with Peter. These tests are not very accurate and highly influenced by the sample size. Anymore, parametric tests are often robust, if the sample size is big enough and the size of the groups (if there are categorical predictors) is as far as possible equal. Nonetheless, most parametric tests assume a normal distribution.

There are some transformations which are standard methods, but consider that these are non-linear transformations and strictly speaking they are not allowed, if you want to retain the metric scale of your data! Common methods are logarithmization (taking the logarithm of your data) and square root transformation (taking the square root).

Raheela maula-bakhsh

Totally agree with Brian and Peter samuels. Another suggestion is that kindly just check for outliers in your data. Removal of these type of responses can also impact normality unless or until data is enough for further analysis after deletion of such responses.

Xu Wumei

As the answers presented above, I think you should check for the outliers first and then consider the data transformation. A outlier could attribute to a mistake but significantly affect the outcome of statistical analysis.

Best wishes!

Sreemanee Raaj Dorajoo

Dear Enoch,

I would cautiously agree with the answers provided above. However, it would be useful to know what the purpose of your analysis is. i.e. is it to assess association between risk factors and and outcome, or is it to predict an outcome (the 2 are quite different).

For assessing associations, you are probably better off not transforming your data just to make it normal prior to model fitting. Regression techniques do not require your data to be 'normally' distributed. You should not worry too much about this as long as your residuals are well behaved.

However, if you're looking to predict the probability of an outcome, there might be some merit in looking at transformations. The caveat here would be that it has to be done wisely. Sometimes, some outlier values can drastically distort the relationship between risk factor and outcome. But these values are real and not due to experimental/observational error. If you remove them, there is a risk of biasing your analysis. A good example would be carcinoembryonic antigen (CEA) levels measured in colorectal cancer patients. CEA values can range anywhere between 0.5 and several thousands in some patients. Overall, the distribution will be skewed. In this case, it might be of use to apply a 'normalizing' transformation such as a log transformation. If you want greater flexibility in transforming your data, look to Box-Cox transformations that might help you achieve what you want to do. Box-Cox transformations can be quite easily performed in R or Stata.

All the best.

Manee

James R Knaub

Enoch -

While I partially agree with the above, and especially liked Peter's response, i think that there are some misconceptions which you will often encounter, and I especially want to comment below on the use of a p-value.

Is there some reason that you expect your data should be "normally" distributed? I think it is a common misconception that this should be true in many cases where there is no reason to expect it, or need it. One very useful place to expect to approach 'normality' is not for your data distribution, but for distribution of the estimated mean due to the Central Limit Theorem. (The nearness of such an approximation depends upon your population distribution, and sample size.) That may be one source of the common assumption that a normal distribution is always desirable. Instead, I'd say it is a justification for having otherwise generally too much attention placed on that distributional form.

I don't generally consider transformations to be 'improvements,' as they distort results, but it all depends upon why you think normality is an "improvement" for your purposes.

I worked a great deal with regression using highly skewed data, which naturally contained considerable heteroscedasticity for the estimated variance of predicted y value errors. Normality of the estimated random factors of the estimated residuals, in such cases, might be nice -- 'normal' 'errors' have been studied and used a great deal -- but it is generally not very important there.

Further, many want to estimate a p-value, but those are often misleading, as they are not very meaningful by themselves. See the following:

Press release for the American Statistical Association:

http://www.amstat.org/newsroom/pressreleases/P-ValueStatement.pdf

and note my caution about this nearly three decades ago:

https://www.researchgate.net/publication/262971440_Practical_Interpretation_of_Hypothesis_Tests_-_letter_to_the_editor_-_TAS

It is often far more practically interpretable to consider a confidence interval (or a prediction interval for regression) than to use hypothesis tests, though both are impacted by sample size.

I understand that you were using tests to judge closeness to normality, so that is not something that lends itself well to looking at confidence intervals, but to use a type II error analysis or power analysis with these tests instead can be very nebulous to interpret. But if such testing were used, then you do need some such sensitivity analysis to judge the meaningfulness of your results. Just using the graphical representation might sometimes be more useful.

Cheers - Jim

Article Practical Interpretation of Hypothesis Tests - letter to the...

Enoch Setor

Iam very grateful for your contributions. Actually i just wanted to test association between two variables using multiple regression analysis, but then, i want to explore some few assumptions (normality, autocorrelation, heteroscedasticity and so on) of OLS before i run the final test.

James R Knaub

Enoch -

It sounds like you might find graphical residual analysis helpful. For example, please see the following: https://onlinecourses.science.psu.edu/stat501/node/36

You can study features that way which may give you good insight as to data relationships, including heteroscedasticity.

I carried that a bit further in one way to estimate heteroscedasticity for finite population modeling:

https://www.researchgate.net/publication/263809034_Alternative_to_the_Iterated_Reweighted_Least_Squares_Method_-_Apparent_Heteroscedasticity_and_Linear_Regression_Model_Sampling

There are other ways to estimate a coefficient of heteroscedasticity:

https://www.researchgate.net/publication/263032446_Weighting_in_Regression_for_Use_in_Survey_Methodology

Anyway, here is the basic definition, geared toward finite population statistics (though there can also be heteroscedasticity in a time series), which I wrote for a Sage encyclopedia, in case you are interested:

https://www.researchgate.net/publication/262972023_HETEROSCEDASTICITY_AND_HOMOSCEDASTICITY

and here is a discussion of the use it has for weighted least squares regression, for the simplest of models, but extension to more complex ones, including nonlinear and multiple regression models, often can be seen by substituting a preliminary prediction for y as a size variable, in place of x:

https://www.researchgate.net/publication/263036348_Properties_of_Weighted_Least_Squares_Regression_for_Cutoff_Sampling_in_Establishment_Surveys

OLS is a special case for WLS (weighted least squares) which is often used as a default, when it often should not be. If you know there is substantial heteroscedasticity, using a default which assumes it is not there is not a good idea. In particular, you will find substantial natural occurrence of heteroscedasticity for regression through (to) the origin.

Cheers - Jim

PS - Exploring relationships, as you noted that you are doing, can be very enlightening. For continuous data, I highly recommend that you study scatterplots as part of your research.

Conference Paper Alternative to the Iterated Reweighted Least Squares Method ...

Article Weighting in Regression for Use in Survey Methodology

Article HETEROSCEDASTICITY AND HOMOSCEDASTICITY

Article Properties of Weighted Least Squares Regression for Cutoff S...

Can you measure ownership structure yearly?

How can I run a multiple regression analysis considering all questions under each variable?

How to learn more about SPSS and its Application?

Baseline drift in HPLC? What causes this?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How to calculate CCS for Sodiated adduct ions and Multiply Charged Ions?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Which test should be used to study association among demographic profile and awarness level?

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Posthoc test lettering in JAMOVI?

How to back transform the results generated from analyses using log transformed with In(X+1) data?