First of all are they planned or were they decided while looking at the data? Second does the procedure use decision based (per contrast) error rate or does it use an familywise error rate?
Based on these dimensions the comparisons can then be classified for example in the following way:
Planned + decision error rate = t test
Planned + family error rate = bonferonni & dunnet
Post hoc + decision error rate = Duncan & rodger’s method
Post hoc + family error rate = Newman-Keuls, Tukey & Scheffe
Also often the type and number of comparisons comes in play.
BONFERONNI
The Bonferroni simply calculates a new pairwise alpha t based on the number of contrasts.
The Bonferroni is highly flexible, very simple to compute, and can be used with any type of statistical test. However it tends to lack power because of many reasons: (1) the familywise error calculation depends on the assumption that, for all tests, the null hypothesis is true. Seems unlikely, especially after a significant omnibus test; (2) tests are assumed to be independent when calculating the familywise error test; usually not the case when all pairwise comparisons are made; (3) does not take into account whether the findings are consistent with theory and past research. If consistent with previous findings and theory, an individual result should be less likely to be a Type I error; and (4) Bonferroni overcorrects for Type I error. In addition, in certain situations where one wants to retain, not reject, the null hypothesis, then Bonferroni correction is non-conservative.
DUNNET
The Dunnet test is similar to the Tukey test but only used if a set of comparisons are being made to one particular group. This is rarely of interest, and Tukey serves a much more general purpose. In contrast to the Bonferroni correction it exploits the correlation between the test statistics for these comparisons.
Common tables of critical values for Dunnett's test assume that there are equal numbers of trials in each group, but more flexible options are nowadays readily available in many statistics packages such as R.
DUNCAN
Duncan's new multiple range test (MRT)is a variant of the Student–Newman–Keuls method that uses increasing alpha levels to calculate the critical values in each step of the Newman–Keuls procedure. Duncan's MRT is especially protective against false negative (Type II) error at the expense of having a greater risk of making false positive (Type I) errors.
Duncan's test has been criticised as being too liberal. Duncan argued that a more liberal procedure was appropriate because in real world practice the global null hypothesis H0= "All means are equal" is often false and thus traditional statisticians overprotect a probably false null hypothesis against type I errors.
Other main criticisms are that Duncan's MRT does not control family wise error rate at the nominal alpha level, a problem it inherits from Student–Newman–Keuls method. Also the increased power of Duncan's MRT over Newman–Keuls comes from intentionally raising the alpha levels (Type I error rate) in each step of the Newman–Keuls procedure and not from any real improvement on the SNK method
ROGER
Using the traditional F-criterion produces an inevitable loss of statistical power as the numerator degrees of freedom increases. In direct contrast, Rodger’s approach ensures that statistical power does not decline with increasing numerator degrees of freedom. Rodger's method has more power than all other post hoc procedures to detect every conceivable sort of interaction effect.
An unlimited amount of post hoc data searching is permitted by Rodger's method, by the guarantee that the long-run expectation of type 1 errors can never exceed Eα. Both the increased power that Rodger's method possesses, and the impossibility of type 1 error rate inflation, are obtained by using a decision-based error rate - the same as is used in planned t-tests. An error occurs, in the statistical context, if and only if a decision is made that a specified relationship among population parameters either is, or is not, equal to some number and the opposite is true. Rodger’s position is that statistical error rate should be based exclusively on those things in which errors may occur, and that can only be the statistical decisions that researchers make.
NEWMAN KEULS
The Newman–Keuls or Student–Newman–Keuls method is a stepwise multiple comparisons procedure. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance. The Newman–Keuls method is similar to Tukey's range test as both procedures use Studentized range statistics. Compared to Tukey, the Newman–Keuls is more powerful but less conservative.
TUKEY
Tukey’s test calculates a new critical value that can be used to evaluate whether differences between any two pairs of means are significant. The critical value is a little different because it involves the mean difference that has to be exceeded to achieve significance.
It has greater power than the other tests under most circumstances and it is readily available in computer packages. It is important to note that the power advantage of the Tukey test depends on the assumption that ALL possible pairwise comparisons are being made. Although this is usually what is desired when post hoc tests are conducted, in circumstances where not all possible comparisons are needed, other tests, such as the Dunnett or a modified Bonferroni method should be considered because they may have power advantages.
SCHEFFE
Scheffé's method is a single-step multiple comparison procedure which applies to the set of estimates of all possible contrasts, not just the pairwise differences of the Tukey–Kramer method. When many or all contrasts are of interest, the Scheffé method tends to give narrower confidence limits and is therefore the preferred method. If only pairwise comparisons are to be made, the Tukey method will result in a narrower confidence limit, which is preferable.
The Scheffe test computes a new critical value. The new critical value represents the critical value for the maximum possible familywise error rate. However this also results in a higher than desired Type II error rate, by imposing a severe correction.
ALSO:
LSD RATIONALE
Least Significant Difference test is simply the rationale that if an omnibus test is conducted and is significant, the null hypothesis is incorrect. The reasoning is based on the assumption that if the null hypothesis is incorrect, as indicated by a significant omnibus F-test, Type I errors are not really possible because they only occur when the null is true. So, by conducting an omnibus test first, one is screening out group differences that exist due to sampling error, and thus reducing the likelihood that a Type I error is present among the means. However Fishers LSD test has been criticized for not sufficiently controlling for Type I error.
This first step sort of controls the false positive rate for the entire family of comparisons. Note that other multiple comparison tests (Bonferroni, Tukey, etc.) do not require this first step -- do not need to be protected. The results of these multiple comparisons tests are valid even if the overall ANOVA has a P value greater than 0.05.
GAMES HOWELL UNEQUAL VARIANCES
This test is used with unequal variances and also takes into account unequal group sizes. Severely unequal variances can lead to increased Type I error, and, with smaller sample sizes, more moderate differences in group variance can lead to increases in Type I error. The Games-Howell test, is based on Welch’s correction to df with the t-test and uses the studentized range statistic. This test appears to do better than the Tukey HSD if variances are very unequal or can be used if the sample size per cell is very small.
First of all are they planned or were they decided while looking at the data? Second does the procedure use decision based (per contrast) error rate or does it use an familywise error rate?
Based on these dimensions the comparisons can then be classified for example in the following way:
Planned + decision error rate = t test
Planned + family error rate = bonferonni & dunnet
Post hoc + decision error rate = Duncan & rodger’s method
Post hoc + family error rate = Newman-Keuls, Tukey & Scheffe
Also often the type and number of comparisons comes in play.
BONFERONNI
The Bonferroni simply calculates a new pairwise alpha t based on the number of contrasts.
The Bonferroni is highly flexible, very simple to compute, and can be used with any type of statistical test. However it tends to lack power because of many reasons: (1) the familywise error calculation depends on the assumption that, for all tests, the null hypothesis is true. Seems unlikely, especially after a significant omnibus test; (2) tests are assumed to be independent when calculating the familywise error test; usually not the case when all pairwise comparisons are made; (3) does not take into account whether the findings are consistent with theory and past research. If consistent with previous findings and theory, an individual result should be less likely to be a Type I error; and (4) Bonferroni overcorrects for Type I error. In addition, in certain situations where one wants to retain, not reject, the null hypothesis, then Bonferroni correction is non-conservative.
DUNNET
The Dunnet test is similar to the Tukey test but only used if a set of comparisons are being made to one particular group. This is rarely of interest, and Tukey serves a much more general purpose. In contrast to the Bonferroni correction it exploits the correlation between the test statistics for these comparisons.
Common tables of critical values for Dunnett's test assume that there are equal numbers of trials in each group, but more flexible options are nowadays readily available in many statistics packages such as R.
DUNCAN
Duncan's new multiple range test (MRT)is a variant of the Student–Newman–Keuls method that uses increasing alpha levels to calculate the critical values in each step of the Newman–Keuls procedure. Duncan's MRT is especially protective against false negative (Type II) error at the expense of having a greater risk of making false positive (Type I) errors.
Duncan's test has been criticised as being too liberal. Duncan argued that a more liberal procedure was appropriate because in real world practice the global null hypothesis H0= "All means are equal" is often false and thus traditional statisticians overprotect a probably false null hypothesis against type I errors.
Other main criticisms are that Duncan's MRT does not control family wise error rate at the nominal alpha level, a problem it inherits from Student–Newman–Keuls method. Also the increased power of Duncan's MRT over Newman–Keuls comes from intentionally raising the alpha levels (Type I error rate) in each step of the Newman–Keuls procedure and not from any real improvement on the SNK method
ROGER
Using the traditional F-criterion produces an inevitable loss of statistical power as the numerator degrees of freedom increases. In direct contrast, Rodger’s approach ensures that statistical power does not decline with increasing numerator degrees of freedom. Rodger's method has more power than all other post hoc procedures to detect every conceivable sort of interaction effect.
An unlimited amount of post hoc data searching is permitted by Rodger's method, by the guarantee that the long-run expectation of type 1 errors can never exceed Eα. Both the increased power that Rodger's method possesses, and the impossibility of type 1 error rate inflation, are obtained by using a decision-based error rate - the same as is used in planned t-tests. An error occurs, in the statistical context, if and only if a decision is made that a specified relationship among population parameters either is, or is not, equal to some number and the opposite is true. Rodger’s position is that statistical error rate should be based exclusively on those things in which errors may occur, and that can only be the statistical decisions that researchers make.
NEWMAN KEULS
The Newman–Keuls or Student–Newman–Keuls method is a stepwise multiple comparisons procedure. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance. The Newman–Keuls method is similar to Tukey's range test as both procedures use Studentized range statistics. Compared to Tukey, the Newman–Keuls is more powerful but less conservative.
TUKEY
Tukey’s test calculates a new critical value that can be used to evaluate whether differences between any two pairs of means are significant. The critical value is a little different because it involves the mean difference that has to be exceeded to achieve significance.
It has greater power than the other tests under most circumstances and it is readily available in computer packages. It is important to note that the power advantage of the Tukey test depends on the assumption that ALL possible pairwise comparisons are being made. Although this is usually what is desired when post hoc tests are conducted, in circumstances where not all possible comparisons are needed, other tests, such as the Dunnett or a modified Bonferroni method should be considered because they may have power advantages.
SCHEFFE
Scheffé's method is a single-step multiple comparison procedure which applies to the set of estimates of all possible contrasts, not just the pairwise differences of the Tukey–Kramer method. When many or all contrasts are of interest, the Scheffé method tends to give narrower confidence limits and is therefore the preferred method. If only pairwise comparisons are to be made, the Tukey method will result in a narrower confidence limit, which is preferable.
The Scheffe test computes a new critical value. The new critical value represents the critical value for the maximum possible familywise error rate. However this also results in a higher than desired Type II error rate, by imposing a severe correction.
ALSO:
LSD RATIONALE
Least Significant Difference test is simply the rationale that if an omnibus test is conducted and is significant, the null hypothesis is incorrect. The reasoning is based on the assumption that if the null hypothesis is incorrect, as indicated by a significant omnibus F-test, Type I errors are not really possible because they only occur when the null is true. So, by conducting an omnibus test first, one is screening out group differences that exist due to sampling error, and thus reducing the likelihood that a Type I error is present among the means. However Fishers LSD test has been criticized for not sufficiently controlling for Type I error.
This first step sort of controls the false positive rate for the entire family of comparisons. Note that other multiple comparison tests (Bonferroni, Tukey, etc.) do not require this first step -- do not need to be protected. The results of these multiple comparisons tests are valid even if the overall ANOVA has a P value greater than 0.05.
GAMES HOWELL UNEQUAL VARIANCES
This test is used with unequal variances and also takes into account unequal group sizes. Severely unequal variances can lead to increased Type I error, and, with smaller sample sizes, more moderate differences in group variance can lead to increases in Type I error. The Games-Howell test, is based on Welch’s correction to df with the t-test and uses the studentized range statistic. This test appears to do better than the Tukey HSD if variances are very unequal or can be used if the sample size per cell is very small.
First of all, in my case, data were not planned for comparison. We usually carry out yearly field campaigns for environmental assessment, and fortunately, now we can evaluate some improvements implanted in the area on water parameters. In this way, the number of water samples collected each year are different, and we just want evaluate if changes implemented in the area have changed water chemistry. Furthermore, in one of the areas evaluated data do not follow a normal distribituon, so we are going to use a non-parametric comparison method (I think it is an easier way than the normal one).
Iker, for non-parametric tests the same applies w.r.t. the control of arror-rates as for the parametric tests.Further, when you state that some samples have normally distributed data and some others do not, than the usual assumptions of the non-parametric are already violated. I find it better to find out why some of the samples deviate so much from a normal distribution. Is there something else going on? Are there outliers, probably attributable to the sampling/measurement or something else? And even if not (and most samples do have a normal distribution) - it might still be more informative to analyse a parametric model, probably with a little more careful interpretation since some of the data are not "beautifully normal".
Note: Bonferroni, Tukey and Dunnet control the family-wise error rate (FWER). An Anova is not required. Often, pooled variances are taken from the Anova calculations, but - in contrasts to a common opinion - a "significant" Anove is no prerequisite to have control over the FWER.
Note2: Holm is a little less conservative than Bonferroni, and for a large number of tests controling the false-discovery rate (FDR) might be better (-> Benjamini-Hochberg).
Thanks Jochen. For the different sampling areas, the parametric and non-parametric data belong all the time to the same area. In this way, the non-normal distribution is not due to outliers.
So you may probably miss something. Maybe there is an influential predictor or an interaction or non-linear relationship you should consider, or the error model is not additive, or any combination of these. It is strange - from a mathematical/philosophical perspective as well as from a biological - that data from some samples are "normal" and from others are not (given the data is from comparable samples and obtained under comparable conditions and so on).
ANOVa is the classic but you can use a structural methods how for example stastic pattern recognition by discriminat function and bayes probability. the structure is covariance matrix.
Iker with water quality data it is common to have multiple analytes. Different analytes often have different distributional forms. But for any given analyte I usually see the same shaped distribution across sample regions - unless there is some sort of mixing process going on.
I s the difference in distribution you are seeing between analytes (which is only to be expected) or between regions for the same analyte (which is much more concerning)?
Over many years and literally thousands of ANOVA tests on many projects my personal preference is Duncan. It has given me the most consistent and resolved differences most (when other methods have not). No statistical rationale for this just experience