The whole discussion here is faulty. If you *test* something, then you do it to have control on error rates. Such a control is only only only possible when the analysis is INDEPENDENT of the data. Looking at the data and deciding for a test afterwards ("post hoc") that might "fit" to the data makes you LOSING the control. So why testing then??
Once again: If your only aim is to get some p-values below 0.05 and if you follow advices like "try another adjustment procedure", "transform your data", "use non-parameric tests" or something like this, then you had it much easier to create such p-values as a random number between 0 and 0.05. You can use Excel for this:
=0.05*RAND()
This gives you a desired p-value: it is below 0.05 and gives you no control on any error rate.
PS: @Olga: LSD only controls the type-I error rate when the number of tests is ≤ 3 (plus the ANOVA p is ≤ α -- a condition that is *not* required for other adjustment procedures like Tukey, Bonferroni, Holm, Benjamini-Hochberg, ...)
Check to ensure the assumptions of your ANOVA have been met; if not, run a Kruskal-Wallis H test in place of the ANOVA. I've seen the problem you've described happen with highly skewed datasets.
Yes, I attached such data: 10 groups with 50 values each. The assumptions are met. The ANOVA p is 0.0004 (ok, slighly above 0.0001, but, hey!) and all Tukey-adjusted pairwise p-values are > 0.05. The smallest is the comparison of groups "3" and "7" with p=0.0657.
Samuel, you do not specified with which alpha you were doing your Tukey Test.
The Tukey test is a test "a posteriori" test and It is less potent that the ANOVA. If you use the same alpha that in the ANOVA, you may obtain P alpha in the Tukey test.
Samuel, if your aim is to get some "significant" p-value at any cost, then either change the level of significance or just make a convenient one up. This seems to be cheating but it is nothing diffent to what you are doing right now.
Samuel. In SAS, if you want to use Kruskal Wallis, you may use Proc Npar1way or you can use Proc Ranks to transform your data with normal=blom or normal=DW and use them in the Proc ANOVA or Proc GLM with Tukey, instead of using the original data.
Samuel, try to use Fisher's LSD (least significant difference) test. It protects better against increasing alpha error rate in the multiple comparison procedures.
Another way - to combine some groups due research assumptions before analysis.
The whole discussion here is faulty. If you *test* something, then you do it to have control on error rates. Such a control is only only only possible when the analysis is INDEPENDENT of the data. Looking at the data and deciding for a test afterwards ("post hoc") that might "fit" to the data makes you LOSING the control. So why testing then??
Once again: If your only aim is to get some p-values below 0.05 and if you follow advices like "try another adjustment procedure", "transform your data", "use non-parameric tests" or something like this, then you had it much easier to create such p-values as a random number between 0 and 0.05. You can use Excel for this:
=0.05*RAND()
This gives you a desired p-value: it is below 0.05 and gives you no control on any error rate.
PS: @Olga: LSD only controls the type-I error rate when the number of tests is ≤ 3 (plus the ANOVA p is ≤ α -- a condition that is *not* required for other adjustment procedures like Tukey, Bonferroni, Holm, Benjamini-Hochberg, ...)
I agree with Joquen on the method of analysis should be chosen for reasons unrelated to the desires / needs of the researcher and that he should conform with its results.
But when the assumptions behind the chosen analysis are not verified, it is necessary and certainly methodological lawful to search for data transformations and other methods of analysis, more consistent with the characteristics of the data, and see to which conclusions arrived. If the assumptions for this new analysis are verified better than before, we should be content with this new obtained results although they likes us less than before.
In the present example may be that the lack of homogeneity of variance or normality affects the results as Samuel commented, then it may be useful to check for alternatives of analysis.
Guillermo, I am certainly with you! There is only one point I want(ed) to stress that is usually neglected or missed:
It is absolutely legitimate to search/find a seemingly appropriate analysis ("fitting to your data"), be it by transformations or using ranks or anything. I agree here. But: when you did so, the p-value from this analysis on *THIS SAME DATA* is meaningless, since this cannot anymore control the error rate. So after having identified an analysis that works well on this "learning" dataset one would have to use *NEW DATA* (a "test set") to get an unbiased p-value.
To give an example (not a very good one but it clarifies the point):
Give you have some patients with cancer and you assume that there is a genetic marker that will allow you to predict this cancer. You try this and that and finally you find that the absence of a Y-chromosomal gene (*) is strongly correlated with the cancer, p=0.001. This would mean that only 1:1000 of such studies would find such a strong (or stronger) corellation when the gene was not associated with cancer. So we identified a potent anti-oncogene?
However, consider that here that the proportion of males in the healthy control group was higher (it can happen by random sampling!). It then is obvious that the poportion of Y-chromosomes is lower in the cancer group. So identifing a systematic difference in just a Y-chromosomal gene is reflecting a "bad" sampling rather than the existence of a cancer-gene. This is a very very obvious example and surely noone would walk into such a trap (also an adjustment for the sex effect can be made and and and...) - but it shows the problem:
If we were not aware of the relation of Y and sex and if we were not even aware of different X:Y ratios in the groups, how should we recognize that this result is a big mistake? Such things *can* happen (and they do) and our lifesaver for this is to control error-rates, so that such thinks -as being not absolutely avoidable- won't happe too often.
Now think if you had the information of all genes and you transform and modify this data and the kind of analysis to look for a correlation between genes and cancer in your data *until you find something*, then you will - in this sample - SURELY come up with some X- or Y-chromosomal genes (that are like absolutely unrelated to the cancer!). So the actual error rate here is not below 0.001 - it is close to 1.0!
Unfortunately, in the real world we just do not know is there exist confounding effects or if the the data we have some strange properties. If we select analysis methods according to the data, then we will likely make (unwanted and unintended) use of these confounders or properties giving us low p-values. And if we do this in the long run, many results will be false positives, so we have not efectively controled the error rate.
I hope this made it a little clearer now.
(*) for the non-biologists: humans have 23 pairs of chomosomes as genetic material. One of these pairs is different for males and females. These sex-specific chromosomes are called allosomes or sex chromosomes. In males, a major part of one of the two allosomes is missing. This shorter version is called "Y chromosome", the full-sized version is called "X chromosome". Thus, males have the allosomes XY whereas females have XX. Not all genes on the X chromosome are present on the Y chromosome. Such genes are present in 2 copies/cell in women but only in 1 copy/cell in males.
Yes, it is possible to have a significant ANOVA F-test, yet no significant differences in the Tukey post hoc tests. Remember that the ANOVA F-test represents all of the possible contrasts among your groups, while the Tukey post hoc tests only examine the PAIRWISE tests. It is entirely possible that there is a significant difference among your groups, but it is more complicated than a simple pairwise test. For example, if you have 4 groups (A, B, C and D), then maybe the contrast (A + B + C) - D is statistically significant, even though the individual pairwise comparisons A - D, B - D and C - D are not significant. So, it isn't really true that the pairwise tests have less power than the overall ANOVA F-test ... it's just that the pairwise Tukey tests do not capture all of the possible contrasts that are covered by the overall ANOVA F-test.
Jochen and Guillermo are certainly correct when they tell you that you should not waste time searching for this unknown significant contrast. You should choose your contrasts (e.g. Tukey, Dunnett, Hsu, or custom Bonferroni adjusted contrasts) based on the experimental design. I would not recommend that you use Fisher's LSD contrasts to see if that produces a significant result, because the Fisher's LSD method is more likely to give you false positives. Most statisticians reject any use of Fisher's LSD. Don't go searching for any statistical method that will give you the result you want. Choose the methods best suited for your original experimental design and accept the results you find.
That's true, Jeff, and it is even worst. It is no problem to get results the other way around: non-significant ANOVA-p but significant pairwise-p values (corrected for multiple testing, for sure, e.g. by Tukey!). This should actually not happen if your explanation was 100% correct and complete. So there is more about that problem, and this is the fact that ANOVA tests something entirely different. It has bees stated often and by many people that the ANOVA F-test is an equivalent to testing an omnibus null hypothesis, but this is more an allegory than a fact ("lies-for-kids", so to say). To my knowledge, only Fisher proposed the ANOVA as protective when *3* (not more!) groups had to be compared, and then the FWER is in fact kept (without further correction of the pairwise-p values, what is Fisher's LSD). This "protectionism" was slighly generalized by Scheffe and others to control the FWER even for more than 3 groups and arbitrary linear contrasts, but the protection was never seen as "testing H0:µ1=µ2=µ3=...µn" or any combination of linear contrasts. I actually do not know where this (misconception?) entered the literature. Or I overlooked something (probably...). However, IMO it caused a lot of confusion and should be overcome. To my opinion, the ANOVA F-test only tests if the inclusion of a predictor in a linear model reduces the residual variance. Translating this into any other kind of hypothesis will eventually cause problems, confusions, and inconsistent result or interpretations.
And just to conclude, "The good news is that statistical analysis is becoming easier and cheaper. The bad news is that statistical analysis is becoming easier and cheaper." (Hofacker (1983). I 100% agree with you Jochen, the real problem is that softwares are becoming more and more user friendly (assisting the user to make its statistic without any or minimal knowledge on statistic). This is good and bad at the same time. There is a struggle for p values < 0.05 but no ones cares about power, clinical significance of the statistical probability value, fairness and honesty of the statistical analysis, the study design, the way in which data were taken, etc. Statistic is becoming like the shall game, data normalization, imputation, transformation, post-hoc test decision, etc are all strategies to obtain a p0.05) and that also a p>0.05 is a result!!!
Places I've seen this occur is when assumptions of sphericity are violated. ANOVA makes the assumption that the variance within each group is equal and all groups are independent. If these assumptions are violated, you can get bizarre results like what you're describing. You may need adjust for violations of sphericity.
The omnibus F test in ANOVA only guarantees that one of the possible contrasts is statistically significant - not that the contrast is significant after correction for multiple comparisons or that the contrast that is significant is an interesting one (e.g., it need not be a pairwise comparison).
Thom, your conclusions are correct, but the reason is slightly wrong: That the ANOVA tests an "omnibus hypothesis" that all means are equal is a simplification (so to say a useful lie). Assuming that this is correct leads in turn to the problems interpreting any subsequent findings. The ANOVA tests the hypothesis that a categorical predictor in the model reduces the residual sum of squares. The omnibus-H0 is one possible scenario for which this "Anova-H0" would not be rejecable. Testing a predictor as a whole is different to just testing if all means are equal (with one exception: when there are only two categories in the predictor).
The statement "The omnibus F test in ANOVA only guarantees that one of the possible contrasts is statistically significant" is wrong. One exception: when there are two groups, then F = t² and a significant F implies a significant t. But I think there is an argument that for more than 2 groups a significant F implies a significant contrasts, when contrast is seen as one of the infinitively many possible matehmatical linear contrasts that can be formulated. This way your statement was correct... but I doubt that this is the answer to the question ;) However, sorry if I confused you (or anybody else). In this case simply ignore my post, this won't hurt. It is not that important.
I meant that when F > Fcrit for the omnibus test it is possible to construct a linear contrast that is statistically significant at the equivalent alpha threshold.
It is often assumed (falsely) that a significant omnibus test implies one or more differences among pairwise comparisons (when not infrequently there are none).
A more dangerous myth (arguably) is that common post hoc tests such as Tukey's require omnibus F significance. As you pointed out, most of the tests don't. This makes Tukey's HSD etc. overly-conservative when used in this way.
Of the common post hoc tests only Fisher's LSD requires omnibus F significance.