Is there a post hoc test for Kruskal-Wallis test rather than Bonferroni? AND why Bonferroni used as a reference post hoc test in non-parametric test in SPSS?
The Bonferroni method is simply a way to constrain the aggregate risk of type I error at or below some specified level for a set of statistical tests.
There are other "control" methods available, several of which offer somewhat better power than the Bonferroni. Dunn's method (for applicable designs), Benjamini-Hochberg, Holm's test, and so on.
The R function, p.adjust, offers a simple way to apply a wide array of such mechanisms. Here's a link that walks through the process: https://rcompanion.org/rcompanion/f_01.html
Unfortunately, some software calls some post-hoc tests "Bonferroni", when, as David Morse points out, Bonferroni isn't a post-hoc test, just a way to adjust p-values (or in the old days, alpha values).
The best course is always to look up what test your software is actually using.
My guess is that for Kruskal-Wallis, your software is using pairwise Wilcoxon-Mann-Whitney tests, and then applying a Bonferroni correction.
I think a better post-hoc test for Kruskal-Wallis is the Dunn (1964) test, which retains the ranking of values from the all the data. Whereas pairwise WMW tests ignore the data that isn't in the pairwise comparison under consideration in that iteration.
Other appropriate post-hoc tests are Conover and Nemenyi.
It's pretty easy to conduct the Kruskal-Wallis test and any of these post-hoc tests in R. I have some examples here: https://rcompanion.org/handbook/F_08.html
I think you can run all these examples at the following site without installing software: https://rdrr.io/snippets/ .
After the significant Kruskall-wallis test, you can use Dwass-Steel-Critchlow-Fligner pairwise comparisons test. You can easily implement it Jamovi software (https://www.jamovi.org/). Alternatively, you can use a series of post-hoc Mann-Whitney U test with Bonferroni correction after a significant Kruskall-wallis test.
I believe the Dwass-Steel-Critchlow-Fligner test, like pairwise Wilcoxon-Mann-Whitney tests, re-ranks the observations for each comparison. I think this would make it a less appropriate post-hoc test than those that retain the original ranking (and therefore in essence the original variance) used in the Kruskal-Wallis test.
See the answer by Alexis, here: https://stats.stackexchange.com/questions/71996/differences-between-dwass-steel-critchlow-fligner-and-mann-whitney-u-test-for-a
Sal Mangiafico I checked the URL (https://rcompanion.org/handbook/F_08.html) you provided and I wonder if there is some alternative to the cldList function, it just missed one of my treatments. Thank you
By default, the cldList() function removes spaces, equal signs, and zeros from comparisons and group names. This is to make it applicable to the varied output of different functions.
If it's dropping a group, it's probably the case that a group name is the same as another except for a 0 or maybe a space. You can change this behavior with e.g. remove.zero=FALSE.
The cldList() function is really just a wrapper for multcompView::multcompLetters(), so that's an alternative.
You can also just construct the compact letter display by hand.
Issam S. Ismail asked about the options in SPSS. If one is using the NPTESTS command, there are 3 options, as shown in this excerpt from the command syntax:
COMPARE= PAIRWISE | STEPWISE | NONE. Multiple comparisons. The COMPARE keyword controls how and whether multiple comparisons should be performed. PAIRWISE produces all pairwise multiple comparisons. STEPWISE produces stepwise stepdown comparisons. NONE turns off multiple comparisons. By default, all pairwise comparisons are produced.
Unfortunately, that part of the documentation does not say what method is used to make the pairwise comparisons. I imagine one would have to consult the "algorithms" documentation to find that info. You can find that document by searching for "algorithms" here:
The "algorithms" documentation I pointed to above says this on p. 813 (emphasis added):
"The Kruskal-Wallis, Friedman and Kendall, and Cochran tests use the procedure proposed by Dunn (1964) (originally designed for the Kruskal-Wallis test)."
See also: https://www.ibm.com/support/pages/can-spss-perform-dunns-nonparametric-comparison-post-hoc-testing-after-kruskal-wallis-test
I think there is some confusion in the nomenclature.
A post-hoc test is a test comparing two groups but that uses information from larger set of groups. This information is typically some kind of pooled variance estimate that must be available before any 2-group-comparison test can be performed. Therefore "post-hoc".
Notably, a post-hoc test does not control the family-wise error rate (FWER). If the FWER should be controlled, the p-values obtained from the post-hoc tests need to be corrected or adjusted for multiple testing. There are many methods available, some more and some less specialized, and each with its individual pros and cons in particular scenarios.
There are some post-hoc procedures combining a series of post-hoc tests and the control of the FWER among these tests. Examples are Tukey's or Dunnett's prodedures.
Bonferroni is a method to adjust p-values (or alpha) to control the FWER, not a post-hoc test. It cares only for the number of p-values (which should all be stochastically independent), not about where they come from.
Bonferroni might be applied to the p-values obtained from a series of Wilcoxon tests. But this is still not a host-hoc test, as each test uses only the information from the two groups being compared. The compared ranks are also different from the ranks used in the KW-test.
Dunn's procedure is a post-hoc test on ranks, as it calculates "z"-statistics based on the standard deviation of the ranks (from the KW test). The p-values are taken from the standard normal distribution and are not corrected to control the FWER. This may be achieved by subsequently applying a method to adjust the p-values, like Bonferroni (or Holm, or Sidak, or other methods). The procedure of using Dunn's (z-)tests with Bonferroni-correction of the p-values is called "Dunn-Bonferroni". This is a post-hoc procedure including a control of the FWER.
Jochen Wilhelm, I think that confusion also arises when people talk about "Dunn's test" without specifying whether they mean Dunn (1961) or Dunn (1964). ;-) FWIW, I believe that "Dunn-Bonferroni" is typically used to describe Dunn's (1961) method for making comparisons among means. See these pages, for example:
By the way, I also need to think a bit more about the way you characterized "post hoc" tests as using information from a larger set of groups. A couple of situations come to mind that make me question that characterization. E.g., suppose I conduct an experiment with k = 4 groups, and I plan to carry out three "planned orthogonal contrasts". The (modified) t-tests for those three contrasts all use MSerror from the ANOVA table in the denominator--i.e., they use information from the larger set of groups. Are you suggesting that they are post hoc tests, therefore? Thanks for clarifying.
Dear Bruce Weaver , thank you for pointing out the ambiguity of "Dunn's test".
To your question: yes, I'd say so :) I have learned this definition but I can't remember the source :( This made direct sense to me so I was never inclined to keep the source in mind to justify the obvious...
Thanks for clarifying, Jochen Wilhelm. The distinction I learned, which I believe is pretty typical in stats books for psychology, is the one David Howell made in his popular textbook. The attached image shows one version of it. The entire chapter can be viewed here:
Thank you. I know this, but this never made much sense other than w.r.t. muliplicity correction. I'd accept that there are competing definitions of the term "post hoc tests", but I still find the one you cited from Howell quite confusing and less helpful.
Fair enough, Jochen Wilhelm. I find your definition (very) confusing too. ;-)
Here is another resource that makes the same distinction Howell did, and AFAIK, GraphPad Prism is far more popular with folks in medical sciences than in psychology.
I have a couple of problems (particularly with Prism, but that's a different topic) here. First of all, "The terminology is not always used consistently." - that's what all the fuzz is about. Life would be much simpler if at least scientists would use and stick to proper (unambiguous) definitions. But well, that's life.
"Multiple comparison test applies whenever you make several comparisons at once" just highlights one of the linguistic problems. One does not do several comparisons "at once". Each comparison is done on its own, and if not using parallel processing, they are done subsequently. And it does not even matter if minutes or years go by in-between.
"Post-hoc test is used for situations where you can decide which comparisons you want to make after looking at the data." - This is in-line with "my" definition. I am sure that "my" statement is simplified and rephrased into to this to make it sound less "technical". No applied scientists is interested in these technicalities of using pooled estimates and such nerdy stuff. Sadly, as so often in statistics, making a statement seemingly more simple, more "everyday language" takes away important information and opens doors for misconceptions and misinterpretations.
There is a wild conglomerate of doing multiple tests and controlling a FWER. This is nowhere clearly separated, and the page linked via "When I do planned comparisons after one-way ANOVA, do I need to correct for multiple comparisons?" (https://www.graphpad.com/support/faqid/1092/) adds further confusion.
In my field of research, there is often no need to control the FWER although multiple tests ("planned comparisons") are done - but these do not resemble a "family of tests" for which controlling an FWER would make sense. For instance, there are a couple of comparisons done only to show that the particular treatments all give expected results. Then there is one intervention that is to be compared against one of the controls, and then there are some counter-interventions, further controls so to say, that should all show an alleviation of the intervention effect. It would be stupid not to use the pooled variance for all these tests, just because this is much more robust (given the typically tiny sample sizes per group). On rare other occasions, a series of possible alternatives should be tested to find which (if any) of them works. For instance it is investigated which of several different receptor sub-types is required to transmit a treatment signal. This is a family of tests, and requires controlling the FWER. This is actually also the case when collecting data on individual sub-types from different research groups done in the past 10 years (in which case using a pooled variance estimate may not be possible).
The last sentence, "When comparisons are not orthogonal, you gain power by accounting for the overlap. That is why the Tukey, Dunnett, Newman-Keuls tests (etc) are different than a plain t test." I did not at all understand :(
Bruce Weaver , honestly, I find authoritative what you say. And I do see that you are correct that this is the common definition of post-hoc tests. And I am still sure that this definition is doing more harm than it helps.
Most sources I know start with "after the ANOVA was significant..." what is already nonsense because in most cases where really group means ought to be compared the result of the ANOVA F-test is not at all interesting, except in very few exotic cases where an ANOVA-protected method for multiplicity correction is used. If I use Tukey's or Dunnett's procedure, I don't care about the significance of the ANOVA F.
I also find the distinction between planned and unplanned tests confusing at best.
If I have 5 treatments that should be compared against a control, than these are all planned comparisons, right? What about the multiplicity problem then? What if these treatments are 2 different kinds of negative control treatments, 2 different kinds of positive control treatments and one new treatment?
What if the 5 groups are different time-points during an ontogenic process, carefully selected (e.g. addressing developmental stages of an insect from egg over larvae to imago)? Here, essentially all possible pairwise comparisons are of interest, the experiment was designed to make all these comparisons. So these are planned comparisons. What about multiplicity correction?
If the comparisons had been planned or not is completely uninteresting. What is more important is if the comparisons is done to screen across groups and to answer the question if there is at least one difference of which the direction may be interpreted with the desired confidence. This is what defines the family of tests and gives rise to a FWER that may be controlled. Maybe only a subset of all pairwise comparison is part of a family, there can even be several different families arise from a single experiment, and there can be additional individual tests that do not belong to a family. This is all much more relevant - to my opinion, of course - than taking about "planned" or "unplanned" comparisons. Btw, I don't know any researcher who would say that any comparison he/she made was unplanned. Never ever. Who would be so stupid (to do experiments, possibly kill animals, collect data, invest time, sweat and money without having planned to use all the data anyway)?
Hi Jochen Wilhelm. I completely agree with one of your statements in that last post, and completely disagree with another! ;-)
Here is the one I agree with completely. You said:
If I use Tukey's or Dunnett's procedure, I don't care about the significance of the ANOVA F.
Agreed! Only one of the multiple comparison procedures (MCPs) commonly used in psychology and related fields requires a significant omnibus F-test first, viz., Fisher's LSD. For most or all of the others, the only reason for estimating the ANOVA model is to get the MSerror. And before anyone objects, I would recommend Fisher's LSD only when one has k = 3 groups. ;-)
Levin, J. R., Serlin, R. C., & Seaman, M. A. (1994). A controlled, powerful multiple-comparison strategy for several situations. Psychological Bulletin, 115(1), 153–159. https://doi.org/10.1037/0033-2909.115.1.153
Later, you wrote: If the comparisons had been planned or not is completely uninteresting.
I do not agree with you here. Whether the comparisons are specified in advance (i.e., before seeing the data) or not affects the need to correct for multiplicity, IMO. I find David Howell's example about this issue quite compelling. It starts on page 5 of the supplementary material for his textbook under A priori vs post hoc comparisons:
Waiting until one has seen the results before choosing which comparison(s) to make seems to me to be similar to HARKing--i.e., hypothesizing after the results are known. But as I write this, I am mindful of the important between SHARKing and THARKing (i.e., secret and transparent HARKing). So it may be that we are not quite so far apart as it might appear? I'm certain we agree, for example, that transparent reporting of what one has done is very important.
Hollenbeck, J. R., & Wright, P. M. (2017). Harking, Sharking, and Tharking: Making the Case for Post Hoc Analysis of Scientific Data. Journal of Management, 43(1), 5-18. https://doi.org/10.1177/0149206316679487
Finally, I hope I have managed to disagree without being (too) disagreeable. :-)
If I plan to compare k treatments against a control to find out if I can claim that at least one treatments is effective, then I would have to take care about multiplicity - although the comparisons had all been planned.
Let be give an example: I a bacterial substance is sensed by immune cells via a pattern recognition receptor resulting in an interleukin production. The signal could be sensed either at the surface by one or more of 10 different TLRs or intracellularily by one or more of 23 different NLRs. Given I have cells with single knock-outs of each of these receptors, I prepare 34 experiments (1 control + 10 TLR knock-outs + 23 NLR knock-outs) in which I treat the cells with the substance and measure the interleukin production. The data (the log concentrations) are compared by Dunnett's procedure, all comparison to the control are planned. But I clearly need to take care about muliplicity here.
Now let's consider we identified the receptor and want to shed light on the singaling pathway, which could either go via A -> B -> C or via X -> Y -> Z. There are possibilities to block each of these steps that will work more or less well or not at all. So we plan 7 experiments including a control and all 6 possibly effective treatments. Again, Dunnett's procedure is applied, but here multiplicity is relevant only between the two series, not with the sets of 3 comparisons, and not at all between all 6 comparisons.
I agree that doing a test on the two groups that happen to show the largest difference implicitly includes a screening across all possible pairwise comparisons (by this the pair to be tested was identified!), and the question asked is essentially if at least one difference may claimed statistically significant. Of course, if there is at least one such case, then this must be the most extreme case. By doing this (looking at the data and find which tests to do) is to consider all pairwise comparisons as a homogeneous family of tests, and and test out of this family may and should hold the corresponding FWER.
Disagreement is fine. It's important. Only disagreement this gets us further, given it's aligned with arguments ;)
Jochen Wilhelm, again, I agree completely with your first paragraph above:
If I plan to compare k treatments against a control to find out if I can claim that at least one treatments is effective, then I would have to take care about multiplicity - although the comparisons had all been planned.
I too would use Dunnett's test in that case.
For me, the 3 main factors that determine whether some kind of multiplicity adjustment is needed are: 1) Whether the comparisons were planned in advance or not; 2) Whether they are orthogonal or not; and 3) How many comparisons there are. (These are not completely independent of each other, as the maximum number of orthogonal contrasts in a set is k-1, where k = the number of groups.)
As I see it, there is no need to adjust for multiplicity if all comparisons are both planned in advance and orthogonal. If they are not orthogonal but still planned in advance, theoretically relevant, and small in number (e.g., k-1 or less), I could perhaps still live with no multiplicity adjustment. Fisher's LSD with k = 3 groups is another situation where I would make no adjustment for multiplicity. But other than those specific situations, I would likely want some kind of adjustment.
But another factor, according to some authors, is where the research falls on the exploratory-to-confirmatory spectrum. See Bender & Lange (2003), for example, who say this on p. 344:
On the other hand, in exploratory studies, in which data are collected with an objective but not with a prespecified key hypothesis, multiple test adjustments are not strictly required. Other investigators hold an opposite position that multiplicity corrections should be performed in exploratory studies [7]. We agree that the multiplicity problem in exploratory studies is huge. However, the use of multiple test procedures does not solve the problem of making valid statistical inference for hypotheses that were generated by the data.
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—when and how? Journal of clinical epidemiology, 54(4), 343-349.
Article Adjusting for Multiple Testing—When and How?
This discussion reminds me of a statement that I used to see (years ago) in the signature file of a regular poster to the old sci.stat.* newsgroups. It went something like this:
"All too often, the analysis of data requires thought." ;-)
Yes, I also think that it is hard to have orthogonal contrasts making up a family over which is screened, so there should rarely/never be need for any multiplicity adjustment.
The exploratory research opens a completely new world. I don't think this is relevant here, because I think that significance tests make no sense at all in exploratory research (which is made to generate hypotheses, not to test them). And I know that this point-of-view is again not very common. Most exploratory studies show p-values and reviewers and editors would request them. I think we should stay with experiments designed around prespecified hypotheses here.
A post-hoc test for Kruskal-Wallis: The Dunn's test you can then apply multiple comparison adjustments to the Dunn's test, like Bonferroni, Sidak, Holm, and Benjamini-Hochburg. So the appropriate post-hoc test that should follow a Kruskal-Wallis test, is Dunn's test, not Mann-Whitney-Wilcoxon.