ANOVA in based on the following assumption:1) each sample has been randomply selected from the population it represent; b) the distribution of data in the underlying population from which of the sample is derived is normal; c) homogeneity of variance (the variances of the k underlying populations rapresented by the k samples are equal to one other. If any of the aforementioned assumptions are SALIENTLY violated, the reability of the computed test statistics may be compromised. Simulation studies have shown that p-values from F-tests are highly sensitive to deviations from normality. Like other parametric tests, the analysis of variance assumes that the data fit the normal distribution. If your measurement variable is not normally distributed, you may be increasing your chance of a false positive result if you analyze the data with an anova or other test that assumes normality. Fortunately, an anova is not very sensitive to moderate deviations from normality; simulation studies, using a variety of non-normal distributions, have shown that the false positive rate is not affected very much by this violation of the assumption (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996). This is because when you take a large number of random samples from a population, the means of those samples are approximately normally distributed even when the population is not normal. It is possible to test the goodness-of-fit of a data set to the normal distribution. Because you have a large enough data set, I suggest you just look at the frequency histogram. If it looks more-or-less normal, go ahead and perform an anova. If it looks like a normal distribution that has been pushed to one side, you should try different data transformations and see if any of them make the histogram look more normal. However, you may want to analyze it using a non-parametric test. Just about every parametric statistical test has a non-parametric substitute, such as the Kruskal–Wallis test instead of a one-way anova, Wilcoxon signed-rank test instead of a paired t-test, and Spearman rank correlation instead of linear regression. These non-parametric tests do not assume that the data fit the normal distribution. They do assume that the data in different groups have the same distribution as each other, however; if different groups have different shaped distributions (for example, one is skewed to the left, another is skewed to the right), a non-parametric test may not be any better than a parametric one.
@Nada, When group sizes are unequal we speak about an unbalanced ANOVA. I agree that you may use a wider spectrum of methods to analyse a balanced ANOVA, but balance is not a must.
The question "to what extent it is necessary to verify the normality" is strongly related to asking "at what extent of non-normality is ANOVA an acceptable compromise between simplicity and bias". What is an acceptable compromise is always a question of the application of the results.
Various results of the ANOVA may be biased to various extents, e.g. group effect estimates may be of acceptable accuracy, but the p values of a post-hoc test may not.
In case of your large sample bootstrap methods should be reliable. So I would bootstrap the ANOVA model and see if bootstrap made any qualitative differences in the results.
What are your research questions? I'd recomend the chapters on ANOVA in Andy Field's books for a brief review of te consequences of violations of normality asumption. Also it may also be helpful to check some text on robust statistical test by Rand Wilcox.
ANOVA in based on the following assumption:1) each sample has been randomply selected from the population it represent; b) the distribution of data in the underlying population from which of the sample is derived is normal; c) homogeneity of variance (the variances of the k underlying populations rapresented by the k samples are equal to one other. If any of the aforementioned assumptions are SALIENTLY violated, the reability of the computed test statistics may be compromised. Simulation studies have shown that p-values from F-tests are highly sensitive to deviations from normality. Like other parametric tests, the analysis of variance assumes that the data fit the normal distribution. If your measurement variable is not normally distributed, you may be increasing your chance of a false positive result if you analyze the data with an anova or other test that assumes normality. Fortunately, an anova is not very sensitive to moderate deviations from normality; simulation studies, using a variety of non-normal distributions, have shown that the false positive rate is not affected very much by this violation of the assumption (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996). This is because when you take a large number of random samples from a population, the means of those samples are approximately normally distributed even when the population is not normal. It is possible to test the goodness-of-fit of a data set to the normal distribution. Because you have a large enough data set, I suggest you just look at the frequency histogram. If it looks more-or-less normal, go ahead and perform an anova. If it looks like a normal distribution that has been pushed to one side, you should try different data transformations and see if any of them make the histogram look more normal. However, you may want to analyze it using a non-parametric test. Just about every parametric statistical test has a non-parametric substitute, such as the Kruskal–Wallis test instead of a one-way anova, Wilcoxon signed-rank test instead of a paired t-test, and Spearman rank correlation instead of linear regression. These non-parametric tests do not assume that the data fit the normal distribution. They do assume that the data in different groups have the same distribution as each other, however; if different groups have different shaped distributions (for example, one is skewed to the left, another is skewed to the right), a non-parametric test may not be any better than a parametric one.
First, remember that it's the *error* distribution that should be normally distributed. Given that, it's critical to verify the normality generally speaking. The degree of divergence in the computed values of statistics like p, F, etc., from their true values will be a function of the magnitude of the deviation from normality. Skewed data produce skewed estimates of the DV. However, I wouldn't immediately go non-parametric but instead determine whether the data can be transformed to normality or analyzed using a generalized (not general) linear model with different error distributional assumptions.
Yes, it isDiscovering Statistics Using SPSS (Introducing Statistical Method). You`ll find a brief review of consequences of violations of asumptions and also relevant authors for a more comprehensive review in his references.
In 'Statistics at Square One' 11th ed (Campbell MJ and Swinscow TDV) BMJ Books 2009 on page 83 is my response to the question 'Should I test my data for normality before using a t-test.' Basically the answer is that a formal test is a waste of time, particularly if the sample sizes are similar in teh groups, but always do an eye-ball test, particularly for checking for outliers. If the outcome is from a randomized experiment, then RA Fisher wrote (in The Design of Experiments) '.. the physical act of randomization .. affords the means in respect of any particular body of data, of examining the wider hypothesis in which no normality of distribution is implied'. In other words, if the outcome is from a randomised study, then one is to some extent protected from requiring the normality assumption, and I don't recall an instance where he did check for normality (though I could be proven wrong).
ANOVA does not merely loose power (that is not rejecting the null although there was a significant effect) if the underlying assumptions* are not met, ANOVA can just produce dead wrong results giving you significance where none was. And I dont want to know how many results in science are based on this mistake...
*If only the normality assumption is not met, ANOVA looses only power though.
Felix, every test on alpha=5% guarantees you to get 5% false-positive results on tests performed on true H0s. The question is, if this rate is higher when the normality assumption is not met. Since the variance is often over-estimated when the assumption is not met, I think it is likely that this rate is even lower than 5%. But I would be happy to refine my prejustices.
Btw: I find it much more critical that authors claim "no effect" when p>0.05, and that p
For the first part... Thinking about it, I guess I should've written "if the independence assumption is not met" which goes into the direction of Steven Moffitt response. If you have some drift over time in your data, it might lead to wrong results.
If all assumptions except for the normality are fulfilled... I guess you are right.
Ok, but how do you judge data as being outliers? If the values are physically/biologically nonsense, then it is clear, though. But if the values itself are possible/resonable, then it might be that these "outliers" are the only data trying to tell you that your model is not correct...
So that's a graphical approach. It is somewhat straightforward. Is there a P-value based approach, helpful for people with a lower familiarity with this aspects?
I understand you tested normality on the residuals of the whole data set. Should normality be tested on the data as well? Should normality be tested per each group treatment in a factorial experiment instead of the whole dataset?
A p-value based approach to test the assumptions for a subsequent test is not quite sensible. The important thing here is if the pattern and the size of the violation may be relevant. This is not reflected by a p-value. Consider this: If you have little data, p-values may be large even if the violation is strong; if you have much data, p-values are close to zero even if the violations are negligible.
The normality assumption refers to the ERRORS, not to the data. (also see Michael Youngs answer above) This is a common misconception (that the DATA should be normally distributed). In fact, to model a response by predictors using the normal error model it is not required that the residuals actually have a frequency distribution that looks like a normal probability distribution. The normal error model is a projection of our irgnorance (things we can not know and do not know about the reasons of variability of the data). A frequency distribution of the residuals that is strongly different from a normal probability distribution tells us that the model does not use all available information appropriately. It may be that this is still the best model available!
However, for *testing* purposes (scientifically not so interesting, but usually *the* main thing to be done with the data...) it *is* required that the (limiting!) frequency distribution is *identical* to the normal probability distribution (or at least "quite identical"). Here, the central limit theorem may come into play, but then we have the same problem as above: if there is little data, violations matter and give wrong results (mostly false-negatives; so the research is rendered to waste of time and money), if there is a lot of data, violations don't matter (much) but the p-values will anyway be close to zero.
Deepak, there is never something "best" in general. KS test is one of a variety of possibilities, each one with its own pros and cons. Besides, there are tests desiged especially to test the normal distribution (for instance Shapiro-Wilk).
Are Error and Residual the same thing? Therefore, testing normality on residuals,can the test be done on the whole dataset, while the groups(treatments) could be ignored at this stage? Instead, must the groups be considered for the homoscedasticity test?
To my opinion or my to use of language (talking about models), errors and residuals are the same. May be other people make a philosophical distinction here. I think that "residual" actually is the better word here.
Yes, the probability model refers to the residuals, to all residuals of the model. If the model is about several groups, possibly with interactions, the residuals are (neccesarily) taken from all data from all groups.
If there is heteroskedasticity, you would see it in the residual plots (QQ, predicted vs. residuals, predictor vs. residuals). In your example the predicted vs. residual plot shows heteroscedasticity. This could be tested for instance using a trend test on the absolute residuals (note: I would *not* recommend to actually do this, however). Between groups (predictor vs. residuals for a categorical predictor) -> Levene test (same note as above; to my knowledge only for very simple models with no interactions and no continuous covariates). If there is more than one group and also possible interactions, one has to test all level-combinations of all predictors, including possibe interactions. This results in a lot of "subgroups" with only very few values in each of these "subgroups". Now there are two problems: little data -> nothing will be significant anyway, and lots of tests -> multiplicity problem. I think you end up with a multitude of new philosophical problems about power, specificity, false-positive and false-negative rates, and the correctness of the assumptions for the tests for normality and homoscedasticity. It is a bit like calculating a p-value for a p-value (for a p-value for a p-value ...).
Showing that your model is reasonable is not the same thing as showing that your model is not the best possible model.
I realised that for a hypothetical experiment including only one group (treatment), a normality test (e.g. Shapiro-Wilk, or a graphical approach) is identical when compared either on data or residuals. In a real experiment with several groups, normality test on residuals of the whole dataset, with some extent, summarize the several normality tests performed per each group. Indeed, all groups to be compared should have a normal distribution. Probably, this concept may be understood more easily from statistics-non-familiar people. Please correct me if I'm wrong.
Now, one problem: residual caluclation depends on the model used. Thus, which model should be used to calculate the residuals: a saturated (all main factors with all possible interactions) or the best fitting (e.g.: only with significant main factors and interactions) model?
You distinguish "one group" and "real experiment with several groups". This distinction is misleading here. If you have one group, then you calculate a mean value (average) and analyze the residuals to this average. This determines that the average of the residuals is zero anyway. There is no "location information" anymore in the residuals. This information was removed by using the average as the "reference" to which the residuals were calculated. So *if* the frequency distribution of the residuals resembles a normal distribution, then neccesarily one with µ=0. Only the variance is free. By using the average you already applied a (very simple) model to the data. The average is the result of the maximum-likelihood estimation based on the normal error model. This most simple model makes no use of any predictors. It gives a single summary value for the location ("typical or expected value") when there is no further information (like group membership, treatment, environmental factors...) known and when we have no reason to think that any partculat value is "special" in any way. If we further have no reason to believe that positive or negative deviations from the "expected value" are more likely, we end up in the normal error model, that is used to determine the likelihood of the data. The average is then just the "expected value" for which the likelihood of the data is maximal.
Short: Calculating an average is already applying a (most simple) model.
If there are predictors available, the model can become more complicated, allowing to explain more of the variance seen in the data.
Whatever can not be "explained" by the model is left over in the residuals. If this left-over does not contain any useable information (has the maximum entropy), then the (limiting) frequency distribution will be normal.
Models are never correct. Models are neccesarily simplified descriptions, reducing way too complex things to a few relevant aspects. Hence, real data will never have truely normally distributed residuals (what renders strict hypothesis tests about this insensible).
What is a good model? A good model should not leave very much information in the data unused (or use the information inappropriately). This can be recognized by large, clear, pronounced deviations from normality of the residuals. Apart from this should a good model be justified by expertise.
If there is no expertise available: The simpler model is the better. Predictors with no relevant impact should not be used. What "relevant" is, though, needs some expertise again. However, removing a predictor with no/little impact does not change the conclusions (much). Model selection is a big topic, there is a lot of literature available. To my opinion, much of this is overdone. One should keep things as simple as possible and decide "outside of the data", using expertise.
ANOVA is a parametric test based on the assumption that the data follows normal. hence it is necessary to test the normality. if the data does not follow normal distribution then we can opt for non-parametric tests like Kruskkal - Wallis test.
Error = residual. ANOVA is also ROBUST to small departures from normality and the more important assumption is that of equal variances. I find that almost always if the residuals are not normally distributed, a simple transformation solves the problem - log transform is the best one to start in many (but not all) areas. You should NEVER just test residuals and then give up on a parametric test. Remember, the mean and variance are also parametric - if you use a NON-parametric ANOVA (K-W), then you CANNOT use the mean and variance (standard deviation, standard error) in your presentation of results either!
Large topic- You might want ot take a look at Chapter 18 "Anova Diagnostics and Remedial Measures" in Kutner, Nachtsheim, Neter, & Li's _Applied Linear Statistics Models, 5th Ed._ by McGraw-Hill.
Hello, I'm not sure if this is relevant, but in the electonic statistical textbook on the favourite links page of my website http://sites.google.com/site/deborahhilton/ there is a section on Deviation from Normal Distribution
Effects of violations. Overall, the F test (see also F Distribution) is remarkably robust to deviations from normality (see Lindman, 1974, for a summary). If the kurtosis (see Basic Statistics and Tables) is greater than 0, then the F tends to be too small and we cannot reject the null hypothesis even though it is incorrect. The opposite is the case when the kurtosis is less than 0. The skewness of the distribution usually does not have a sizable effect on the F statistic. If the n per cell is fairly large, then deviations from normality do not matter much at all because of the central limit theorem, according to which the sampling distribution of the mean approximates the normal distribution, regardless of the distribution of the variable in the population. A detailed discussion of the robustness of the F statistic can be found in Box and Anderson (1955), or Lindman (1974).
In the context of mixed models, some people seem to use the term "error" for all random effects, and residual for the (hopefully small) part of the data that is not explained by any predictor, either fixed effects or random effects --- so residuals are only a component of the error part of the model, that will also include for instance the random effects "patient" predictor. This approach is especially used, I think, when the fixed part of the model is of interest, and the random part (including patients) is just a kind of additional source of variability.
In addition, I would say that in the model Y = f(X) + epsilon, epsilon following a given law, epsilon is the error and it is (often) assumed that the epsilon_i are iid, or at least independant. In the contrary, the residual is the (y[observed] - y[predicted]) value, which contains the realisation of the error epsilon, but also the (hopefully small) difference induced by the fact that f(X) estimated is not the "true" f(X), so is not only the error in the previous sense, but has additional "error" due to the imperfect estimation of the parameters. For this reason also, residuals are NOT independant even if the errors are assumed to be so. There are NOT homoscedastic, even is the errors are assumed to be so. And unless the model is linear, there is no reason for them to be Gaussian even if the errors are assumed to be so.
In practice, however, these last three points are assumed to be negligible for diagnostic purposes... Note however that there are Studentized residuals introduced just for this reason and really are independant, homoscedastic residuals like the error is assumed to be. But not Gaussian, instead distributed against a Student's law (hence their name) if the model is linear.
For this reason also in practice, interpretation of residuals diagnostics is a whole: apparently non-normal residuals can arise just because the model f(X) is too wrong, hence the residuals have a very strong part that is not the error; despite the error can still be Gaussian. So you cannot conclude on normality of the error just with normality diagnostics on the residuals.
As for Levene's tests (Levene or Brown-Forsythe variants), the only constraint is that you have only categorical data and several measures for each combination of these categorical data; the absence or presence of interaction does not matter (but obviously, residuals will change if an existing interaction is omitted from the model, so Levene's test on a really wrong model can be misleading because "residuals"): just consider a single factor indicating in which case of your categorical mix you are and use the Levene's test on this single factor. However, it does not work with continuous variables --- but some tests are available [Breusch-Pagan IIRC], though I do not use them enough to be clear in which conditions they work.
Basically, in ANOVA analysis, Data or samples that are used should be parametric (e.g., normal distribution, or n is a large number in each group in order to reduce variations), however, the test of distribution is necessary to give you the distribution of data even if the number is large, subgroups, bimodel or even more. The nonparamateric data or samples the test of analysi that are indicated is e.g., Kruskall-Wallis test which can be more reliable test for non-normal data and for the low number). However, this is a big topic in analysis of data among groups and a professional statisticians should be adviced for more groups subgroups and factors influencing the groups.
The assumptions of the ANOVA have to be verified if you want to apply the test correctly. Moreover, I believe it is important to verify the normality of your data. It can give you a clear idea of the distribution. You can check if there is a sub-population and understand if you have a bias in your selection.
Very short practical answer. If the data is more or less symmetrical and you have enough subjects (which you seem to have), there is no big practical problem.
If you have doubts, then use a nonparametric approach or another approach like logistic regression for binary data or proportional odds model fro ordinal data etc. But testing the normality assumption seems to be of little help in practical terms.
You might want to take a look at Maxwell & Delaney's Designing Experiments & Analyzing Data (2nd Ed.) that read through the section on statistical assumptions (pp. 110-117) which talk about how expected values are affected, and discuss Anova's robustness. They also have a nice discussion of why the F test falters if you have unequal N's and unequal population variances (pp. 145-147) and a nice discussion of choosing between parametric & nonparametric tests (pp. 137-142). That way you can more accurately map this advice on to what we know about how F behaves under these various conditions. Hope this helps.
Thanks for the clarifications, Philip Wood. However unfortunately I don't have assess to this book. Do you know if there is a digital version of these chapters?
The usual effect of lack of Normality is loss of power. The type one error rate is not inflated. However, note that it is not the original data that need to be Normally distributed but the noise elements of the model (which can be approximately established using the residuals). A useful plot in general is that of the residuals versus the fitted values. Transformation can be a useful trick to restore normality.
The two attached plots show residual checking for a one-way layout (six treatments with six replicates per treatment) using GenStat. The first one shows lack of Normaiity of the residuals (the QQ plot on the lower left departs markedly from linearity) and also shows increasing variability as the values increase (plot on upper right of residuals versus fitted values). This suggests a log-transformation. The second plot shows that the residual on the log-scale are much better behaved.
As it turns out either analysis is 'significant' but the F statistic on 6 and 30 degrees of freedom is 6.3 in the first case but much higher at 40.1 in the second.
The ebook version of Maxwell & Delaney is available for $80. from googlebooks. Alternatively, you can get it as an ebook for the Nook from Barnes & Noble. http://www.barnesandnoble.com/w/designing-experiments-and-analyzing-data-scott-e-maxwell/1101798089?ean=9781410609243 Hope this helps.
I don't disagree with your general observations, Stephen, but notice that he has well over 10,000 observations in his posting of the question. So isn't the issue more appropriately whether he has a correct estimate of mean square error for contrasts he might be interested in or the inference difficulties from an unbalanced design?
Philip, if he is going beyond the overall global F test to individual contrasts he probably should not be using ANOVA type contrasts at all. The t-test is robust if the contribution of the variance estimate is "internal" (based on the same observations that are used to calculate the mean) and if each mean is based on the same number of observations. It is only worth pooling variances from means not involved in the given contrast if degrees of freedom are scarce and they are not here. See Senn, S. J. (2008). "The t-test tool." Significance: 40-41.
Yes - transformations can be useful. However, you may end up testing something different than you initially wanted to do. Especially in factorial designs you may want to look at additive effect and transformation of the data would look at whether factors "multiply" each other. If you just want to look whether two distributions are different, you may want to consider just the Wicoxon-Mann-Whitney test (this has actually many names), which basically looks whether the average rank is bigger in one group compared to the other. The nice thing about this test is its invariance under monotone transformations (others you would probably not consider anyway), i.e. no need to use he "right" transformation.
However, Stephen Senn has also critiqued this test as it takes both variance and mean into account. But I would say, it is discussable if this is a feature or a bug :-)
Thanks, Alexander. My criticism of the WMW test is indeed 'that it takes mean and variance into account' but one has to be careful what this means. Even if the variance int the two groups is identical the overlap of ranks is a function of the variance. The degree to which the groups separate is a function of this variance. So, for example, if you single replace random blood pressure measurements by regular monitoring over an extended period you will get less overlap of the two distributions even though the mean shift is the same. You, of course, understand this very well but most users of WMW have no idea that this is the case.
As regards transformation I would say that the main task of the statistician is to find a scale on which the effect of any intervention is additive. This can frequently also lead to well-behaved residuals. However, the two do not have to be linked and then it may be necessary to have separate models for signal and noise.
Yes - statistics is all about understanding the variation, isn't it? Understanding very well how the variance is, will help you understand the problem. And sometimes the question is more "Will a subject from group A have a bigger endpoint than those from groub B?". Looking just into mean differences does not tell you very much about this, but the relative effect that the WMW is based upon will give you exactly this. Some discussion regarding this can be found here (http://www.ncbi.nlm.nih.gov/pubmed/18266888 - just shameless promotion ;-))
Ah yes - and there is another point. It is important to know where the variance comes from - is it mainly "random error" or "measurement error" or is it really coming from some kind of natural variation within a biological sample. If it is mainly the first, then of course, there is less value in looking into the relative effect. (This is actually an argument by Sebastian Domhof (http://ediss.uni-goettingen.de/bitstream/handle/11858/00-1735-0000-000D-F284-4/domhof.pdf?sequence=1) Unfortunately his PhD thesis never got published in English but is only available in German.
I am not statisticain, But I would say that does have the number of samples any effect?. I mean if you have a large number of samples will sure minimaze the variation and more normality will be. Otherwise, use the nonparameteric and will sure be safe in conclusion.
Alexander, the problem with this question "Will a subject from group A have a bigger endpoint than those from group B?" is that the lives of others* are not available as an alternative. What a patient has is a choice between futures: that if a drug is taken and that if it is not. The extent to which the measured values of two groups of patients overlap does not address this question directly. Having said that I am a great fan of Brunner, Domhof and Langer so thanks for having directed me to Sebastian's PhD thesis.
* However I can highly recommend the "Lives of others"
and yes, that lives of others are not an alternative. What I wanted to say is that the values of the patients from group A would be informative about further patients in the future getting the same treatment A. Thus said, it would inform you about how the alternative treatments, wouldn't it?
Although this may be an indirect answer to ur question, i thought it was worth mentioning that ANOVA can be robust to violations of its assumptions, but not always. on the alternative. There are a group of tests (often called assumption-free, distribution-free and non-parametric tests, none of which are particularly accurate names).
The one-way independent ANOVA has a non-parametric counterpart called the Kruskal–Wallis test. If you have non-normally distributed data, or have violated some other assumption, then this test can be a useful way around the problem.
Azubuike, the KW test tests the null that all samples come from the *same* distribution.This means that this test will give you low p-values when the data is unlikely under this hypothesis. Is this what you want? I think this is often not the question. I think that most often the main research question is for a "location shift" (and not for a change in the distribution). A location shift is tested then and ONLY then when all other distributional characteristics (all higer moments like variance, skewedness, kurtosis,...) are identical. And exactly this is often clearly NOT the case, especially if the data show a "non-normal" distribution (I often see that mean and variance are correlated). If you have such data, then the question is: what is the KW-test actually testing? And is this really the question you are interesed in? I think. the answer is most often: "no". The same, for sure, applies to the MW/Wilcoxon test.
Before anything to solve the problem of "non normal distribution of the residuals", It's useful to consider the model, sometimes inserting a factor into the model could result in normally distributed residuals.
Yes, Rohullah! First one should ask oneself: is the functional form of the model correct (additive or multiplicative? linear or non-linear relationships?), are all relevant covariables considered, and are all relevant interactions between covariables considered? - Thinking about all this can easily take days or weeks, and it is usually not done. But if it is done, one will eventally have a much better understanding, of the data and of the results.
Now typically people go for hypothesis tests of something. Such tests assume correct models. So one must know before collecting/seeing the data that the model is correct and one must know all the relevant details (just to name one: what is the minimum relevant effect size), so there should not be any doubt about the distribution of the residuals anyway. This is also one reason why I find the common habit so stupid that normality is tested and then another test is chosen based on the result of the former test - and this also in absence of a meaningful defined alternative hypothesis and fixed significance levels for an autpomated decision-making.
Also, if you have a problem with error normality at N > 10,000 you probably also have other issues as well. Variances should be equal in all treatments too. And, finally, a huge sample size can make an unimportant difference in treatments look significant.