No. Check out the central limit theorem in any mathematical statistics book. To begin with it is about a sample mean. Are you dealing with a sample mean? Next it says that under certain conditions the distribution of a standardized sample mean approaches a standard normal distribution as the the sample size tends to infinity. It does not say what the rate of convergence is. Now the theorem also requires using a simple random sample. you have not mentioned anything about how you took the sample and so on. I would suggest you begin by doing some sort of probability plot to see if the sample looks at least approximately normal as a start. If it does there are statistical tests of normality. with a sample size of 400, I would recommend a Kolmogorov-Smirnov Test assuming that your sample is a simple random sample. All of these things and more go into the use of the Central limit theorem. it does not say that if you have 400 of something that those 400 things follow a normal distribution. Please be careful and use theorems carefully or you will get wrong answers.
Thank you very much David for your answer. I heard this piece of information from a friend, but I wasn't convinced about it. That's why I thought it's better to ask experts in the field of statistics.
Going back to your recommendation of using Kolmogorov-Smirnov Test, that is a very sensitive test and even if data looks normally distributed using visual methods, Kolmogorov-Smirnov Test might show that data is not normally distributed. Are there any other tests that I can use to check for normality?
A very simple visual test to check whether a dataset follows a given distribution is a QQ plot. In this technique you compare the quantiles of your data against the quantiles of a standard distribution (a normal in your case). If they lie on a straight line then the answer is yes the given distribution follows the standard one. A QQ plot can easily be implemented in R using the function 'qqplot()'. An example of the QQ implementation is here: http://data.library.virginia.edu/understanding-q-q-plots/
Thank you Augusto for your reply. Yes that example is helpful.
Hello Raid, I applied unpaired t-test, where the nonparametric version is Mann-Whitney U test. Actually I applied both and got similar results. But I need to plot some graphs using mean values (not possible with the median as most had the same median). That's why I want to make sure that my data is normally distributed.
What about the values of skewness and kurtosis to assume normality? will they help?
I still recommend K-S test. It has been a standard for many years. a normal probability plot and a K-S test Or Shapiro-Wilk test should pin it down. Best, david
I would use normal quantiles (z scores) in place of the raw data. Have you ever used such a ranks method? It is explained in William Conover's text Nonparametric Statistics. It will limit the effects of any existing outliers and it also will make the interpretation of the results easier for you.
If most groups had the same median value, did you actually start out with a continuous variable, or did you measure the outcomes with a Likert scale? If you have discrete values, you can show descriptive statistics on the means and the standard deviations.
Raid, I never used such a ranks method before, I will read about it and try to apply it if appropriate.
Yes, I did measure outcomes with a Likert scale, I know some people consider it continuous while others consider it discrete. SO is it acceptable to present descriptive statistics on the means and SD (mainly to draw graphs) when applying non-parametric tests?
If you used a Likert scale, each value os discrete while average are viewed by many as being continuous. You could also display side by side bar chart of the individual Likert scores.
I'm plotting a simple scattergram to correlate data from two surveys (staff and patient perceptions of quality). The relationship was clear when I used means, but when I thought of using non-parametric test (for patient data only, as I had doubt about its normality), I replaced patients' mean with median, and the graph made non-sense (as most had the same median).
In this case, can I still do non-parametric test but use means to draw the scattergram?
Yes, you can. This is often done. The nonparametric test keeps the inferential part intact while using the actual scores shows what really is there.I have recently worked with such data on hospital patients and nurses/ perceptions of quality of service.
No, you can't assume the normality condition for items itselfe but you can compute the total score for each individual (observation), then, the total score variable (n x 1) may be normal distribution. It is necessary checking the normality of that variable by using for example, Statgraphics package also, by drawing the histogram and chaking the Skewness and Kurtosis or using Kolmogorov-Smirnov Test and Chi-square test.
If I suspect non-normality, I switch to a nonparametric method and move on with the statistical analysis. In the past 35 years in doing statistical consulting I have not looked into kurtosis or skewness beyond inspecting a plot of the data. Keeping things simple works very well.
I think that the question that you really want to ask is "If my sample is large, can I use parametric statistics with a non-normal distribution of the data (or more precisely, a non-normal distribution of the residuals)".
In brief, my understanding is that the answer to the revised question is Yes. But I have not have come across a definitive paper that gives guidelines for sample sizes for different types parametric procedures, and for different levels of deviation from normality (e.g. values of skewness greater than one). However, I do remember coming across a couple in the distant past that just looked at the t-test, and the required numbers were much lower than your N of 400.
However, I am not sure if this allows you to avoid the use of more advanced procedures, such as Generalized linear (mixed) models, that are specifically designed to model non-normal data. These are commonly used in the analysis of large data sets by statisticians with much higher levels of expertise than myself, so I assume that there must be good reasons. (Eg in highly skewed data there is commonly a tendency for a greater variance in scores among those cases with higher scores, leading to biased estimates - the so-called mean-variance association problem.) Also, in certain contexts, with highly skewed data, one might question the meaningfulness of modelling mean values (rather than, say, medians or percentiles) which would be the results of parametric analyses.
You may want to have a look at this paper (10.1177/1073191116669784).
For its use in the calculation of sample size for normative data, the authors analyzed the minimum sample size at which both mean and standard deviation estimates remained within the 90% confidence intervals surrounding the population estimates for different levels of skewness. They reported that "Sample sizes of greater than 85 were found to generate stable means and standard deviations regardless of the level of skewness, with smaller samples required in skewed distributions".
In general you can't. Given a sample drawn from X1,X2,...,Xn i.i.d. r.vs CLT requires that means and variances are finite for it to work. So, for instance, if your Xjs are Cauchy-distributed (mean and variances do not exist) the CLT (as well as the LLN for that matter ) are not applicable. In other words, the sample mean of n Cauchys is still Cauchy.