ANOVA and posthoc tests have nothing to do with each other!
ANOVA tests the "exploratory value" of a predictor in a model (typically a factor with more than two levels). It answers the questions how likely we can expect a reduction in the residual variance when there was no association of the presumed predictor with the response.
Posthoc tests are apriori unspecified tests, so to say tests *after seeing* the data and deciding then which comparisons might be interesting. They use pooled variance estimates and control the FWER or the FDR for the family of tests. There are few special cases of posthoc tests that do not efficiently control the FWER. They need a kind of "protection" by a "significant" ANOVA. Most famous example is Fisher's LSD, but this is restricted to 3 groups. Tukey does not require any protection and it keeps the FWER.
In general, ANOVA and posthoc tests answer considerably different questions. Your observation is thus not puzzeling at all.
ANOVA and posthoc tests have nothing to do with each other!
ANOVA tests the "exploratory value" of a predictor in a model (typically a factor with more than two levels). It answers the questions how likely we can expect a reduction in the residual variance when there was no association of the presumed predictor with the response.
Posthoc tests are apriori unspecified tests, so to say tests *after seeing* the data and deciding then which comparisons might be interesting. They use pooled variance estimates and control the FWER or the FDR for the family of tests. There are few special cases of posthoc tests that do not efficiently control the FWER. They need a kind of "protection" by a "significant" ANOVA. Most famous example is Fisher's LSD, but this is restricted to 3 groups. Tukey does not require any protection and it keeps the FWER.
In general, ANOVA and posthoc tests answer considerably different questions. Your observation is thus not puzzeling at all.
I agree with Jochen, these are different tests with different aims. I've seen even in some studies the researcher use DUNCAN or LSD after ANOVA showed no significant results, but comparisons are significant for LSD and DUNCAN.
ANOVA is, in words, "variance of the means divided by mean of the treatment variances, multiplied by replications (why multiplication? answer it by yourself)". But different comparison methods use different thresholds for testing the significance ( I bet if you use LSD or DUNCAN there exist significant differences). The logic behind these tests are error control (Type I & II) and it depends on your aim and partly your data.
I also agree with Jochem. This is just the idea of post-hoc test. In my opinion really strange situation appear if ANOVA or Kruskal-Wallis result is insignificant and become significant after post-hoc test.
Dear Malgorzata, you can try that. perform an ANOVA on a dataset which result in a non-significant ANOVA with p-value of around 0.1 or 0.07, and you will see there may exist some significant differences in two methods I've mentioned above (LSD, Duncan). By keeping p-values constant and adding more groups, the number of significant post-hoc tests will increase.
Thanks! I had such observations and thought it was something wrong with my script to compute ANOVA or something. Good to know you got such observations.
This is called a "weak control" (of the FWER): once the Anova is significant, one collects too many false-positives within the family. So either you get nothing (Anova=n.s.) or a "clustered" set of false-positives (among true positives, hopefully). There is only one special case where this "clustering" won't distroy the control of the FWER, namely when you have exactely 3 groups (or 3 post-hoc tests) (so this is the case where Fisher's LSD really controls the FWER).
Umesh, Justification of the results depend mostly on your aim and your constraints (specially financial). For example, if you find significant ANOVA you choose the post-hoc methods based on controlling type I and type II errors. Significant post-hoc test results, for instance, in changing some methods in work. In health science, you may find better results of healing using some antibiotics. IF AND IF two new antibiotics are more expensive than the old ones and they can heal your patients better, you should control type I error as much as you can. Because any significant result by chance has negative financial consequences, but has no important effect on healing.
I hope you have the whole idea of these two types of errors, cos they are very very important in health and social sciences.
Dear Ivan, You've mentioned a good point, but in regression you can not control error types as far as I know. For example, If we have 12 groups to compare, we simply can control FWER using Tukey, Sidak, Bon, etc at different levels and if they came from same population, you have very low type I error. But using regression and the criteria you mentioned, there may be a "Chance" for entering any dummy variables in model even if they came from same population, Unless you lower the enter p-value of criterion (for example using 0.03 instead of 0.05 for variable enter). So this can be somehow misleading.
Decades ago, when SPSS, Minitab, etc. were not yet around, ANOVA is a pre-requisite to Duncan's or Tukey's and other multiple comparison tests. If ANOVA shows no significant difference, that means between the variables in any of the pairs, there is no significant difference (ANOVA has at least 3 pairs with variables to compare, unlike t-test which has only one pair) and we stop there. If the result of ANOVA says there is significant difference, we proceed to multiple comparison test like Duncan's. The multiple comparison test finds out (like t-test) if there is significant difference between the two values in each of the pairs (ANOVA has at least 3 pairs).
These days, the computer can solve, at the same time, F-test for ANOVA, and the multiple comparison tests at the same time. Making ANOVA irrelevant for purposes of multiple comparison test. Take note that ANOVA does not tell you whether all pairs (at least 3 pairs) have significant difference in them, it only tells that at least one pair has significant difference in it. To find out, based on the ANOVA result which of the pairs have significant difference in them (pair number 1, or all the pairs) we proceed to multiple comparison tests, and t-test is one of them.
I agree Ehsan. The purpose of that is not to waste time testing differences in the so many pairs (at least 3 pairs) which would be not significant afterall. But SPSS et.al. computes ANOVA and the multiple tests, together in just 2 second. Even if the multiple comparison tests result to no significance, we wate only 2 seconds.
As to the observation of Umesh, there is nothing surprising there. It is a normal occurrence. In the ANOVA of ten pairs, a significant difference could mean significant difference in one pair but no significant difference in at most 9 pairs. Many researchers (and even statisticians) overlook this point. Ed
I just would like to comment on those who down voted my first answer. If this forum is an opinion section, I would not mind it, it is your own opinion. But his is not. This is a forum to clarify issues, to correct wrong ideas. Down voting correct, authority-based answer simply because it does not jibe with your long-held, but incorrect concept will further strengthen and multiply mistakes. I hope you review your statistics very well so that mistakes will not be multiplied to researchers who seek help in research gate.
@Eddi, time saving is not the aim of ANOVA. When there show no significant ANOVA and you have significant comparisons, specially in high number of treatments, your type I error is very very high fir LSD and DUNCAN. But be aware that all the numbers in ANOVA table talks to you. If you don't understand its language you lose some key points in analysis.
Thank you Fausto. As a matter of opinion I am up voting your answer. To be honest, throughout my stay in RG I only down voted only once, as a matter of opinion. Even if I don't know (since the answer is new) and clarificatory my tendency is to up vote because it is an effort to help. Gestures of helping must be up voted not down voted..
lat to teh show here, but when I've actually seen this happen it's been in situations where a significant post-hoc test is not a simple pairwise test but rather requires the pooling of some goups. i.e., if there were 3 groups and no significant pairwise comparisons, it could be that AB is different from C. A digression from the conversation that's gone on here, but a possibility in your data.
The main effect test in ANOVA tests the null hypothesis that all the means are equal versus at least one is different from at least one other. The type I error will be what you set it at, if all assumptions are met. A big assumptions is that the varibility in each level is the same.
Multiple comparisons (all pairwise comparisons in this discussion) will result in the overall type I error being no more than the level you indicate. Again if all assumptions are met. Again a big asumption is that the standard error is the same for each of the sample means. Many of the simpler multiple comparisons assume equal sample size. Thus if you have different varaibiltiy in the populatons or considerably different sample sizes you can get the results you observed.
There are variations on the basic Tukey's method that handles unequal variability and unequal sample sizes. The overall error rate protection becomes approximate in these cases. There are a wide range of different methods to do multiple comparisons, each with different overall error rate approximations and each handling unequal variabiltiy and unequal sample sizes differently. The resampling method in the MULTTEST procedure of SAS has the best overall set of properties.
You can get misleading results from ANOVA in various situations. One case for example is the "slippage configuration", where one mean is much different from the others, which are all close. The ANOVA may result in a statistically significant difference because of the different mean, but if this mean was removed, the p-value would be large, ie the result is not statistically significant.
Well it could because as you can see the tests answer slightly different questions and have different power. The post hoc tests focus on differences between groups they have more power to detect such differences even though the overall ANOVA indicates that the differences among the means are not statistically significant.
I suggest you have a look at Huck SW. Statistical Misconceptions: Taylor & Francis; 2008.
I used SPSS and I used to have the same problem. I have tried different tests in the PostHoc. For equal variance assumed, I suggest you use Dunnet test in which you can have different results if you change the selection in Control category (First or Last) and sometimes in the Test (2-sided, < Control, > Control). For unequal variance assumed, you should use Tamhane's T2. That's my personal experience.
Not achieving a statistically significant result does not mean you should not report group means ± standard deviation also. However, running post hoc tests is not warranted and should not be carried out (p-value is greater than 0.05).
Recall from earlier that the ANOVA test tells you whether you have an overall difference between your groups, but it does not tell you which specific groups differed - post hoc tests do. Because post hoc tests are run to confirm where the differences occurred between groups, they should only be run when you have a shown an overall significant difference in group means (i.e., a significant one-way ANOVA result). Post-hoc tests attempt to control the experiment wise error rate (usually alpha = 0.05) in the same manner that the one-way ANOVA is used, instead of multiple t-tests. Post-hoc tests are termed a posteriori tests; that is, performed after the event (the event in this case being a study).
If "post-hoc" means pair-wise, it is a matter of the sample sizes, the number of groups, and the difference between the sample means. Consider the case where there are 4 groups (A, B, C, D), where the mean for D is larger than the others, which are close (called the slippage configuration). The overall ANOVA may be significant due to the slippage of D. However, the pairwise comparisons of A, B, and C may not be significant. On the other hand, if there are many more groups and one has slipped, the overall ANOVA may be not significant but the pairwise comparisons for the extreme group will be significant.
ANOVA just tells you if there is only one significant difference between at least two samples (experimental conditions, treatments...etc) but could not preciesely tells where is the differences were. for that you need to do the post hoc, the post hoc testing means a pair wise comparisons between all the experimental conditions thus you can find a significant differences between two groups but this differences were meaningless which resulting in no real significant difference. For exemple lets say that you have three experimental groups (A,B and C) including control group (A, untreated) that you want to test their differences, with ANOVA you got sigficance, and with post hoc you got only significant difference between B and C which are two different treatments, but there were no significant difference between A Vs B nor A Vs C, that means there is no real significant difference in your testing unfortunately you got it with ANOVA.
Imprecise formulations increase the confusion about that topic. That's not very helpful.
Specifically:
"ANOVA just tells you if there is only one significant difference between at least two samples (experimental conditions, treatments...etc) but could not preciesely tells where is the differences were."
No. ANOVA compared entire (nested) models, not groups/samples. ANOVA doesn't care about differences between groups/samples. It is about increase in residual variance that can attributed to the restriction of any set of coefficients in the model. These coefficients usually comprise a complete explanatory factor, or even several such factors. It may be a source of confusion that a two-level factor is usually coded with a single coefficient in the model (estimating the expected difference in the response between the two levels), and that the restriction on this single coefficient actually represents the hypothesis of "no difference between the (two!) groups".
An ANOVA is useful to see how a set of restrictions in a model will impact the performance of the model (measured as the increase of the residual variance). The ANOVA does not analyze differences between group means at all.
Further:
"you can find a significant differences between two groups but this differences were meaningless which resulting in no real significant difference."
No. "A significant difference" is nothing one finds. You can use the p-value as an empirical measure of "significance", then significance is a "statistical signal-to-noise ratio" that can have lower or higher values. It is not an absent or present thing, it is a value estimated from the observed data, it is a number that varies in the interval (0,1). To make use of such a value you need to interpret it. If the p-value is quite small you may decide to reject the tested hypothesis and consider that you will have to work with the unrestricted model to account for all relevant structures in your data. But this is an interpretation and a personal decision. This is not in the data, it is not god-given and there is not a natural property or phenomenon that is discovered. There exists nothing like a "real significant difference".
Finally:
"but there were no significant difference between A Vs B nor A Vs C, that means there is no real significant difference in your testing"
No. Again, as you use the term "significant" means the interpretation (not the numeric value of "p"). As you used the word "significant" here the two "non-significant" findings can always be attributed to the failure to give you a "low-enough p-value". Since the p-value is a function of the sample size that decreases with sample-size for any non-zero effect size, the result only tells you* that you have not had enough data to convince you to better use a model that includes a coefficient for the A-B difference or for the A-C difference given your experimental setup and sample size. There is nothing more a p-value can tell you.
--
* the effect size in the ANOVA model is a continuous variable. The probability of such a variable having a particular constant value (like zero) is zero. Thus, it is sure that the effect is non-zero, and therefore it is only a matter of the amount of data to get a "sufficiently small p-value". So, actually, having small p-values is not the crucial point in the analysis! The crucial point is the likely size of the effect, which may be negligible. But this can only be adressed with subject knowledge, not with probability theory. Combining effect sizes and testing in a more formal way leads to the Neymanian hypothesis tests, where a balance is set between the size and the power of a test according to some given relevant effect size (for wich one actually must formulate a loss-function, what is usually not possible in research).
I could not see why you disagree or agree, only one thing that i concluded from your contribution that you did not understand my meaning, otherwise i absolutely agree with what you said, but i tried to explain it more simply to someone who could be not familiarized with those statistical analyses as mostly almost ambiguities comes from for what and or which test is most suitable for such analyses
I apprechiate that you help giving simple explanations. That's a good thing. The point I see critically is that some of these simple explanations are not only wrong but also misleading. Every day I am confronted with the down-side of similar "simple but wrong" explanations leading to bad publications, bad reviews, a waste of money, rescources and animal lives, and to a large confusion and anxienty about statistics among students (and post docs).
My points in short:
(i) you say that ANOVA tellst you something about differences in means. That's wrong. In very simple cases this may be equivalent, but generally it is not. ANOVA should not be seen as a tool to analyse differenses in means but rather to compare different (nested) models.
(ii) you say that there "is" something like significance and it is our aim to find it. That's wrong. Significance is a matter of interpretation, It is something that we attribute to observations. It is not in the observations.
(iii) you say that a "non-significant result" means that there is no difference. That's wrong. It is as wrong as concluding a difference when the result is "significant". To make any inference about the difference you would need a Bayesian approach. Significance does not answer then questions you think it would.
look, please don't try to hear from others what you want to understand
and again i agree with what you're trying to explain (with some disagree), and i don't know why you're trying to give me lessons in statistics.
never mind what you're trying to prove, which might be acknowledged by some body. please, just bring in mind, as Einstein said "if you cannot explain it to someone who is 6 years old it means that you do not yourself understood it"
Interesting. When I ask students what they understand from your answer, they tell me just the things I claim you said. I will try to find out how differently the sentences can be understood. Maybe you can be a bit more specific and tell me where I am wrong?
Regarding your Einstein citation: we should then all stop teaching stats. The statisticians failed as they have not even managed to explain it to scientists from other fields, and those who give seemingly simple explanations (unfortunately this also includes many statisticians) that are understood by 6-year-olds are wrong. I see the problem here that these "6-year-olds" will only have to invest time to understand the topic; the topic can be understood, but not on-the-fly.
PS: I am surely not giving you lessons in statistics.
The underlying trouble with the question is that you have entrusted your analysis to ANOVA, which gives a precise answer to a vague question. Was your study hypothesis really that there's some kinda difference between the means? Because after you get a significant ANOVA, then you are faced with the problem that it tests a hypothesis that has rarely any scientific value. You might be concerned that there were differences between the means of academics marking exam papers, so a significant ANOVA would lead you to conduct some marker training, but most scientific hypotheses can be expressed as a one-df hypothesis.
Formulating your hypothesis after the fact based on post-hoc tests is completely reversing the logic of science. You can noodle around post-hoc to see if there's anything interesting looking, but it's not hypothesis testing, since you personally don't have a hypothesis. If you had, you would have written a model to test it.
I am not entirely clear whether your question arises from an apparent anomaly you have encountered with your own data or simply from curiosity. If the former, did you use a standard F test and if so, did you do some assumptions testing before using it?
I am in fact pleasantly surprised that GraphPad clearly writes that ANOVA and pairwise comparisons of group means are not logically connected.
However, there is still a severe (to my opinion) flaw in the text: the authors repeatedly compare a "significant ANOVA result" with "significant results of the 'post test'". These two significances refer to different concepts and they are not comparable in principle; the significance of the data under the F-test is to be judged differently than the significance of the data under the t-test. These tests have a different "frame of interpretation" of their results.
Connected to this point is that Fisheras LSD is reported as a case where the significance of the ANOVA really determines the "validity" of the 'post tests'. This is again not correct. Fisher's LSD controls the family-wise error-rate (FWER) over 3 tests (mean-comparisons). This is a diferent kind of error rate than the test-wise error-rate (TWER). It has nothing to do with the validity of the TWER. These are still different things.It is only the case that the FWER is controlled at alpha when both, ANOVA and 'post test' are conducted at alpha. Apart from this: If one wants to control the FWER, then Fishers LSD works (with k=3). If the FWER is not a concern, the ANOVA is completely nonsensical, even for k=3 groups.
However, there is one example where ANOVA and 'post-test' really do the very same thing: for k=2. Only then there is a simple monotonic relation between F and t (F = t²), and the p-values are identical (thus, doing an ANOVA with k=2 this is also an application of Fisher's LSD, which is therefore strictly valid for 2 ≤ k ≤ 3. The point here is surely that the entire "family of tests" is a single test, so controlling the FWER is identical to controlling the TWER.
Thank you for your answer, Jochen. So, if I have 7 means to compare (I am interested in understanding if 7 treatments had different effects on the intensity of 10 wine sensory attributes) may I directly apply a post-hoc test (for example Duncan)? Should I also report the results of ANOVA even if "in contrast" with Duncan's test results? Is it correct to report in my paper only the results of the post-hoc test? And in materials and methods?
It finally depends on the reviewers. Technically, the ANOVA is not at all interesting in your case, so I would say it's not required to report it. What should be reported are the pooled variance estimates, the residual standard error, and the degrees of freedom used for the tests. But I know (unfortunately) that reviewers exists that will call your analysis wrong if you don't mention that you really did do an ANOVA.
As a "significant test" is only start of the interpretation (not the result or the end of an analysis!) I would also like to see the actual estimates (how big are the differences between the treatments?) together with some measure of the uncertainty acssociated with these estimates (most people like confidence intervals).Just to make sure: I am not talking about the 7 means of the 7 groups. I am talking about the mean differences between the groups.
Karl L. Wuensch. Pairwise Comparisons. Excerpt: Members of the STAT-L were recently asked: I am running a one way ANOVA, and testing significance between groups using the tukey HSD test. The ANOVA shows a statistically significant between group difference. However the tukey HSD shows no pair of groups that are different from each other . [Accessed November 23, 2009]. Available at:http://core.ecu.edu/psyc/wuenschk/StatHelp/Pairwise.htm
I have experienced a case where my P value was 0.056 - This traditionally is not statistically significant because the value is greater than the margin of 0.05. However, the post hoc test showed significant groupings. Can I go ahead and report the significant groupings and the P value (which is greater than 0.05)?