Mann-Whitney (Wilcoxon) rank test: the null hypothesis (not about medians) - in case you needed the references

01 January 1970 6 8K Report

Hello,

From the messages I often receive I realise that researchers are sometimes surprised that the Mann-Whitney (aka Wilcoxon), Kruskal-Wallis, and a few more rank-based tests do not compare medians in general, unless strong IID condition is met, i.e. unless it is the location-shift case.

This widespread misconception has been spread so widely, so many textbooks, free and paid courses, even academic lecturers repeat it. And then analysts may be surprised how is that possible that for exactly equal means or medians the test returns p-value < 0.0...01 (for quite small samples) or for much different means or medians the p-value > 0.999, while methods like Brown-Mood or quantile regression (under a variety of methods for obtaining standard errors) differ greatly in their findings, not to mention that visual assessments (e.g. box-plots) also do not support respective claims.

Briefly, in general, if group dispersions and shapes are not comparable, the empirical CDFs can differ in more ways than just by locations. In other words, if a difference is found, it cannot be solely attributed to the difference in, say, medians. This is because these tests are sensitive to stochastic superiority (dominance).

There are several articles confirming that these tests fail as tests of medians*, but only a few books explain it. So, in case you need need, let me cite a few, with the most important one opening the list:

Brunner, E., Bathke, A. C., & Konietschke, F. (2018). Rank and pseudo-rank procedures for independent observations in factorial designs: Using R and SAS. Springer.
Hettmansperger, T. P., & McKean, J. W. (2010). Robust nonparametric statistical methods (2nd ed.). CRC Press. (chapter 2.4: Inference based on the Mann-Whitney-Wilcoxon)
Thas, O. (2010). Comparing distributions. Springer. (chapter 9.3.3.1: Implied Null Hypothesis)
Wilcox, R. R. (2021). Introduction to robust estimation and hypothesis testing (5th ed.). Academic Press. (chapter 5.7 Methods Based on Ranks and the Typical Difference)

Plus the original paper of Mann and Whitney:

Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1), 50–60. https://doi.org/10.1214/aoms/1177730491

If you know more such books, please kindly add them to this thread, so others can use it.

Added resources:

Fagerland MW, Sandvik L. The Wilcoxon-Mann-Whitney test under scrutiny. Stat Med. 2009 May 1;28(10):1487-97. doi: 10.1002/sim.3561. PMID: 19247980. [ https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/paranp?action=AttachFile&do=get&target=MannW.pdf ] proposed by @Bruce Weaver

----

* the mentioned articles and discussions:

- Divine, G. W., Norton, H. J., Barón, A. E., & Juarez-Colunga, E. (2018). The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians. The American Statistician, 72(3), 278–286. https://doi.org/10.1080/00031305.2017.1305291 [ Article The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians

]

- Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests actually test? The Stata Journal, 12(2), 182–190. https://doi.org/10.1177/1536867X1201200202 [ Article What Hypotheses do “Nonparametric” Two-Group Tests Actually Test?

]

- Hart A. (2001). Mann-Whitney test is not just a test of medians: differences in spread can be important. BMJ (Clinical research ed.), 323(7309), 391–393. https://doi.org/10.1136/bmj.323.7309.391 [ Article Mann-Whitney Test Is not Just a Test of Medians: Differences...

]

- Kleinman K, Example 2014.6: Comparing medians and the Wilcoxon rank-sum test [ http://proc-x.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/ ]

- https://stats.stackexchange.com/questions/363335/wilcoxon-signed-rank-test-null-hypothesis-statement

Plus some toy figures from my various presentations.

Bruce Weaver

Hello Adrian Olszewski. Here is another article you may wish to add to your list:

Fagerland MW, Sandvik L. The Wilcoxon-Mann-Whitney test under scrutiny. Stat Med. 2009 May 1;28(10):1487-97. doi: 10.1002/sim.3561. PMID: 19247980.

From the top of p. 1488 (with emphasis added):

"Although not correct in general, there are situations where P(X

Adrian Olszewski

Thank you very much for this paper, Bruce Weaver!

Yes, I described it slightly different: "[...]unless strong IID condition is met, i.e. unless it is the location-shift case". I noticed a variety of ways of describing this: "pure shift", "location-shift" (I cannot recall one more name right now but I remember there was something else as well)...

The paper you cited is just excellent!

The conclusions:

"In summary, the large sample approximate WMW test can be a poor method for comparing the means or medians of two populations, unless the two distributions have equal shapes and equal scales. This problem also seems to apply in various degrees to the exact WMW test, the FP test, the BM test, and the Welch U test on ranks. [...]"

agree with those in the Brunner, E. et al. "Rank and pseudo-rank procedures for independent observations in factorial designs: Using R and SAS" (attached as a screenshot).

I also found the PDF: https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/paranp?action=AttachFile&do=get&target=MannW.pdf

Sal Mangiafico

The Quantitude podcast, which is usually pretty good, just had an episode on nonparametric tests, and repeatedly said that Mann-Whitney et al. is a test of medians. :(

A simple way to show this is just with an example. The Handbook of Biological Statistics has a three-group example with equal means and medians, but a significant Kruskal-Wallis test. https://www.biostathandbook.com/kruskalwallis.html

Sal Mangiafico, have you used the "Ask a Q!" tab to direct the Quantitude co-hosts to this thread?

https://quantitudepod.org/ask/

Bruce Weaver , no, I didn't find that. I made a comment on YouTube, but who knows if they see that... I should listen to the episode more carefully and send them an intelligent email... *sigh*...

Does anyone have issues using Prepman Ultra reagent for MicroSeq ID bacterial, fungal and yeast sample preparation?

What skills should today's teachers take into account to teach students and prepare them for a long-term future?

What considerations should be taken when using proprioceptive stimulation in patients with multiple sclerosis?

What are the principles to take into account when carrying out techniques with the Kabat Method?

How important is the development of research skills in students?

What principle should be taken into account when performing Frenkel exercises?

How important is carrying out a sports detraining program for the health of former athletes?

Has anyone experienced this uneven lane issue with their western blots?

Pofind vs superSubset in QCA?

¿Cuáles son los enfoques mas efectivos en la prevención de la violencia de género?

How to learn more about SPSS and its Application?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

Do you know any references for analyzing stochastic fiber orientaion composites ?

Which test should be used to study association among demographic profile and awarness level?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Posthoc test lettering in JAMOVI?

How to do Mann-Whitney U test with Bonferroni corrected p-values?

How to back transform the results generated from analyses using log transformed with In(X+1) data?