Hello,

From the messages I often receive I realise that researchers are sometimes surprised that the Mann-Whitney (aka Wilcoxon), Kruskal-Wallis, and a few more rank-based tests do not compare medians in general, unless strong IID condition is met, i.e. unless it is the location-shift case.

This widespread misconception has been spread so widely, so many textbooks, free and paid courses, even academic lecturers repeat it. And then analysts may be surprised how is that possible that for exactly equal means or medians the test returns p-value < 0.0...01 (for quite small samples) or for much different means or medians the p-value > 0.999, while methods like Brown-Mood or quantile regression (under a variety of methods for obtaining standard errors) differ greatly in their findings, not to mention that visual assessments (e.g. box-plots) also do not support respective claims.

Briefly, in general, if group dispersions and shapes are not comparable, the empirical CDFs can differ in more ways than just by locations. In other words, if a difference is found, it cannot be solely attributed to the difference in, say, medians. This is because these tests are sensitive to stochastic superiority (dominance).

There are several articles confirming that these tests fail as tests of medians*, but only a few books explain it. So, in case you need need, let me cite a few, with the most important one opening the list:

  • Brunner, E., Bathke, A. C., & Konietschke, F. (2018). Rank and pseudo-rank procedures for independent observations in factorial designs: Using R and SAS. Springer.
  • Hettmansperger, T. P., & McKean, J. W. (2010). Robust nonparametric statistical methods (2nd ed.). CRC Press. (chapter 2.4: Inference based on the Mann-Whitney-Wilcoxon)
  • Thas, O. (2010). Comparing distributions. Springer. (chapter 9.3.3.1: Implied Null Hypothesis)
  • Wilcox, R. R. (2021). Introduction to robust estimation and hypothesis testing (5th ed.). Academic Press. (chapter 5.7 Methods Based on Ranks and the Typical Difference)

Plus the original paper of Mann and Whitney:

  • Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1), 50–60. https://doi.org/10.1214/aoms/1177730491

If you know more such books, please kindly add them to this thread, so others can use it.

Added resources:

  • Fagerland MW, Sandvik L. The Wilcoxon-Mann-Whitney test under scrutiny. Stat Med. 2009 May 1;28(10):1487-97. doi: 10.1002/sim.3561. PMID: 19247980. [ https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/paranp?action=AttachFile&do=get&target=MannW.pdf ] proposed by @Bruce Weaver

----

* the mentioned articles and discussions:

- Divine, G. W., Norton, H. J., Barón, A. E., & Juarez-Colunga, E. (2018). The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians. The American Statistician, 72(3), 278–286. https://doi.org/10.1080/00031305.2017.1305291 [ Article The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians

]

- Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests actually test? The Stata Journal, 12(2), 182–190. https://doi.org/10.1177/1536867X1201200202 [ Article What Hypotheses do “Nonparametric” Two-Group Tests Actually Test?

]

- Hart A. (2001). Mann-Whitney test is not just a test of medians: differences in spread can be important. BMJ (Clinical research ed.), 323(7309), 391–393. https://doi.org/10.1136/bmj.323.7309.391 [ Article Mann-Whitney Test Is not Just a Test of Medians: Differences...

]

- Kleinman K, Example 2014.6: Comparing medians and the Wilcoxon rank-sum test [ http://proc-x.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/ ]

- https://stats.stackexchange.com/questions/363335/wilcoxon-signed-rank-test-null-hypothesis-statement

Plus some toy figures from my various presentations.

More Adrian Olszewski's questions See All
Similar questions and discussions