Hello,
From the messages I often receive I realise that researchers are sometimes surprised that the Mann-Whitney (aka Wilcoxon), Kruskal-Wallis, and a few more rank-based tests do not compare medians in general, unless strong IID condition is met, i.e. unless it is the location-shift case.
This widespread misconception has been spread so widely, so many textbooks, free and paid courses, even academic lecturers repeat it. And then analysts may be surprised how is that possible that for exactly equal means or medians the test returns p-value < 0.0...01 (for quite small samples) or for much different means or medians the p-value > 0.999, while methods like Brown-Mood or quantile regression (under a variety of methods for obtaining standard errors) differ greatly in their findings, not to mention that visual assessments (e.g. box-plots) also do not support respective claims.
Briefly, in general, if group dispersions and shapes are not comparable, the empirical CDFs can differ in more ways than just by locations. In other words, if a difference is found, it cannot be solely attributed to the difference in, say, medians. This is because these tests are sensitive to stochastic superiority (dominance).
There are several articles confirming that these tests fail as tests of medians*, but only a few books explain it. So, in case you need need, let me cite a few, with the most important one opening the list:
Plus the original paper of Mann and Whitney:
If you know more such books, please kindly add them to this thread, so others can use it.
Added resources:
----
* the mentioned articles and discussions:
- Divine, G. W., Norton, H. J., Barón, A. E., & Juarez-Colunga, E. (2018). The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians. The American Statistician, 72(3), 278–286. https://doi.org/10.1080/00031305.2017.1305291 [ Article The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians
]- Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests actually test? The Stata Journal, 12(2), 182–190. https://doi.org/10.1177/1536867X1201200202 [ Article What Hypotheses do “Nonparametric” Two-Group Tests Actually Test?
]- Hart A. (2001). Mann-Whitney test is not just a test of medians: differences in spread can be important. BMJ (Clinical research ed.), 323(7309), 391–393. https://doi.org/10.1136/bmj.323.7309.391 [ Article Mann-Whitney Test Is not Just a Test of Medians: Differences...
]- Kleinman K, Example 2014.6: Comparing medians and the Wilcoxon rank-sum test [ http://proc-x.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/ ]
- https://stats.stackexchange.com/questions/363335/wilcoxon-signed-rank-test-null-hypothesis-statement
Plus some toy figures from my various presentations.