How should I decide if my data sample is large enough to neglect non-normal distribution?

Jochen Wilhelm Popular answer

You should better think why you observe a non-normal distribution!

If your data is for instance bi-modal, then you should identify the factor that causes this. If your data neccesarily positive and right-skewed, you may better look at proportional changes anyway (-> logarithms!). If your data is bounded (like a proportion between 0 and 1) or discrete (like counts) you should consider useing an appropriate analysis (for proportions or counts).

If the distribution of your data (more precisely: of the residuals) is unimodal and more or less symmetric, n>10 usually is large enough. If your data (residuals) are severely skewed or if there are other obvious patterns, better think if you use the right model and if you ask the right questions.

Jochen Wilhelm

You should better think why you observe a non-normal distribution!

Alessandro Felluga

In Lilliefors test for normality testing N =4 is the minimum sample size. This is in my knowledge the smallest dimension found for these tests. Obviously larger the sample, higher the probability of normality rejection, depend to you what you want to demonstrate

Fabrice Clerot

maybe you should first indicate why non normality is a problem in your setting ?

do you want to run some test (which one) and so on ...

Pierre Souren

Hi Pavel,

As Jochen hints already: "think why you observe a non-normal distribution"

I like to add:

There are two possible reasons for observing a non-normal distribution:

1. The phenomenon you study has a non normal distribution (counts, waiting times, proportions, percieved intensity of sound, etc). Or: the variable in the study has a non-normal distribution in the population. Changing/increasing sample size wil not change that. You should choose the correct distribution in the analysis. If you do not the analysis model wil not fit the data.

2. The normality in the population is not reflected in the sample, changing/increasing sample size may change that. If it does not, 1. is the case.

For some types of analyses you may neglect non-normality in the sample when the assumption is: normal distribution in the population. Other analyses can be robust against violations of the normality assumption.

Further more I add:

statisticians count like this: 1,2,3,4,5,..... 27,28,29, infinity; meaning that 30 looks like infinity, that is 30 cases per cel, or per group.

cheers,

Pierre

Pavel Teplý

Thank you all for your hints. I might expresses myself incompletely. I somehow automatically expected my data to have non-normal distribution.

Most of my data comes from questionnaires, especially Likert like scales, so I should probably asked first what statistical method should I use for example to determine correlation and significant differencies between groups. Should I start with normality determination (skewness)?

Jochen Wilhelm

Likert data is ordinal. It is not even interval-scaled, so analyses based on interval-scaled data may not produce meaningful results when used with Likert data.

You blind yourself by using a numerical coding for the Likert catagories. It makes not much sense to say that "strongly disagree" has a value of -2 and "agree" has a value of +1 (for instance). It makes no sense to state that the average aggreement of "strongly disaggree" and "agree" has a value of -1 and would thus refer to "disagree". This is really off-topic and not at all purposeful. Try NOT to use numbers to represent the categories. Stay with a textual representation (like "disagree", "fully agree" etc.) and think how such data could be presented, summarized, and analyzed.

Likert data is ordinal.So you should use methods that work with ordinal data. Ordered logistic regression is one of these methods.

Fabrice Clerot

Likert scales ? Oooh ... then forget about (non) normality !

in addition to Jochen's comment above, this very short article on Likert scales might be helpful :

"Likert scales : how to (ab)use them" by Susan Jamieson

http://medicina.udd.cl/ode/files/2010/07/jamieson_ME_2486.pdf

über-complete, but not available on-line (to my best knowledge ...) :

Analysis of Ordinal Categorical Data

Alan Agresti

Wiley

(for "simple" categorical data, Agresti's book "Categorical Data Analysis" can be found on-line although i'm not sure Wiley appreciates that ...)

also, but definitely not a primer on the topic :

http://www.stat.ufl.edu/~aa/articles/liu_agresti_2005.pdf

Jochen Wilhelm

Another nice example. Although it is somwhat drastic and not very realistic, it highlights well the important issue:

Consider you have a patient health status on a Likert scale: dead - very ill - ill - feeling a bit bad - feeling ok. Now consider the categories are coded by numbers from 0 (dead) to 4 (feeling ok).

You test a drug on 100 very ill patients. The average of your numbers is 1. After the treatment, the average is 1.5, indicating that the health status was improved.

Would you use this drug?

The result is possible for many different scenarios. For instance, it could be that after the treatment 50 patients were just "ill" (what is an improvement). In this case taking the drug is expected to have a 50% change of improving your health-state from "very ill" to "ill". In this case I would use it. But the same result is obtained when 75 patients improved to "ill" but 25 patients died. I would not take the drug if the chance to die from taking it is 25%!

An even stranger scenario is that about 37 patients (call them "responders") are healed by the drug (-> "feeling ok"), but all others die. The average of your numbers is still 1.5. Being killed with 63% probability is by no means a good thing. But if one could identify the reason why some people respond and others don't one could give this drug to responders only (and heal them!) while avoiding that anyone will die. But this would need further research!

All these relevant aspects and insights are neither adressed nor revealed by analyzing means of non-sensical numerical codings!

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Why does my protein refolded to beta sheet during thermal denaturation analysis?