Can you recommend additional books addressing the "a-test-of-medians" fallacy for the Mann-Whitney (Wilcoxon) and Kruskal-Wallis?

17 March 2025 5 606 Report

Hello,

I'm preparing for a webinar addressing several common statistical misconceptions in clinical trials that I observed many times. Now I'm collecting "good resources", clearing these misconceptions (+ doing my own simulations illustrating them).

One of the most common misconception I saw in various textbooks, presentations, discussions, etc. is that "both Mann-Whitney (-Wilcoxon) and Kruskal-Wallis compare medians", which is often stated:

a) without any additional conditions, which is wrong in general and easy to disprove just by example (or less easy, formally, as in the 1st book cited below)

b) as a "location-shift" problem, which doesn't translate to medians easily without additional conditions of IID samples or symmetry around the medians - which is a very strict condition and practically a "zombie" one. Zombie - means that this is rarely (if ever) checked in practice, as far as I could observe over years. Which sometimes makes researchers surprised to learn that:

1) stochastic equality (even at very high p-value) was claimed at very different means or medians

2) stochastic superiority (even at very low p-value) was claimed at exactly same means or medians. They simply forgot to check variances and shapes of the distributions. They also learned, that "happy juggling with tests", as I call it, in case of violated normality assumption may lead to testing different hypothesis, consistent or inconsistent with the original questions. In other words, it's not impossible to obtain a technically valid answer to a never asked question.

Funny, the original papers by Wilcoxon ("Individual Comparisons by Ranking Methods") and Mann-Whitney ("On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other") don't refer to medians.

So far I found two excellent books, explaining this very issue, and a couple of articles, software manuals, and forum discussions:

Books:

1. Brunner, E., Bathke, A. C., & Konietschke, F. (2018). Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs: Using R and SAS. Springer International Publishing.

ISBN: 978-3-030-02912-8 (Print), 978-3-030-02914-2 (eBook)
DOI: 10.1007/978-3-030-02914-2

This book shows step by step why do the MW(W), Fligner-Poicello and Brunner-Munzel are not consistent for detecting different medians or means.

2. Nussbaum, E. Michael. (2024). Categorical and Nonparametric Data Analysis: Choosing the Best Statistical Technique (2nd ed.). Routledge.

ISBN: 978-0-367-69815-7 (Paperback)

3. Thomas D. Cook, David L. DeMets. (2007). Introduction to Statistical Methods for Clinical Trials.Chapman and Hall/CRC

ISBN: 9781-5-848-80271

Links:

1. K. Barbe, Statistical Methods I: Nonparametric statistical inference, Master Mathematics Vrije Universiteit Brussel and Universiteit Antwerpen, [2.1 Walsh Average and Wilcoxon rank test] (a copy of the PDF from the Wayback Machine), https://web.archive.org/web/20210524064522/http:/homepages.vub.ac.be/~kbarbe/StatMet1.pdf

2. The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians

Article The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians

3. What hypotheses do “nonparametric” two-group tests actually test?

Article What Hypotheses do “Nonparametric” Two-Group Tests Actually Test?

- The Mann-Whitney test doesn't really compare medians

https://www.graphpad.com/guides/prism/latest/statistics/stat_nonparametric_tests_dont_compa.htm

- Example 2014.6: Comparing medians and the Wilcoxon rank-sum test

http://proc-x.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/

- Mann-Whitney test is not just a test of medians: differences in spread can be important

https://edisciplinas.usp.br/pluginfile.php/1065042/mod_resource/content/1/Mann%C2%ADWhitney%20test%20is%20not%20just%20a%20test.pdf

- FAQ: Why is the Mann-Whitney significant when the medians are equal?

https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-why-is-the-mann-whitney-significant-when-the-medians-are-equal/

- Wilcoxon signed-rank test null hypothesis statement

https://stats.stackexchange.com/questions/363335/wilcoxon-signed-rank-test-null-hypothesis-statement

- Yoon-Jae Whang, Econometric Analysis of Stochastic Dominance. Concepts, Methods, Tools, and Applications, ISBN: 9781108602204. [page 64: Test of stochastic dominance: Basic results]

Could you, please, recommend any other titles (books, papers) that cover this very topic in any way, less or more formal?

Rainer Duesing

I do not immediately have any other ressources specifically for this topic, but some time ago, I simulated data in R to show that you may have a significant U Test, but median differences and mean differences have different sign. Just contact me, if you are interested.

P.S.: Ronán Michael Conroy one of the authors of your cited article is also here on RG, maybe he has some ideas.

Marius Ole Johansen

Hollander et al.'s Nonparametric Statistical Methods (at least the 3rd edition) has a nice section explicitly explaining why Mann-Whitney and Kruskal are not median tests but rather rank-based tests of stochastic ordering (or "stochastic dominance")

Ronán Michael Conroy

I am puzzled by the title of "The Wilcoxon–Mann–Whitney Procedure Fails as a Test of Medians".

Of course it does, in the same way that your toaster fails as a hand dryer. It was never designed to do the job. And, like your toaster, you could maybe construct improbably scenarios in which it might be a test of two medians, but it just plain isn't. I don't think a lot of thought or maths needs to go into showing this. The Mann & Whitney paper is entitled "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other". It does what it says on the label, nothing else.

David A. Jones

You might try: Maritz, J.S. (1981) Distribution-Free Statistical Methods, Chapman and Hall. ISBN 0-412-15940-6

This is wide-ranging across a variety of theoretical tests. While it may not look at the effects of departures from assumptions, it does cover tests for when those assumptions don't hold, including paired-sample situations. There is some attention paid to constructing confidence intervals, rather than just simple tests. However it may be too terse and theoretical for what you want.

I think what you are doing should hopefully lead to better terminology being used to name these tests, rather than just the peoples' names. Maritz's book uses things like "rank-sum test" to add some specificity to the description.

Does anyone have issues using Prepman Ultra reagent for MicroSeq ID bacterial, fungal and yeast sample preparation?

What skills should today's teachers take into account to teach students and prepare them for a long-term future?

What considerations should be taken when using proprioceptive stimulation in patients with multiple sclerosis?

What are the principles to take into account when carrying out techniques with the Kabat Method?

How important is the development of research skills in students?

What principle should be taken into account when performing Frenkel exercises?

How important is carrying out a sports detraining program for the health of former athletes?

Has anyone experienced this uneven lane issue with their western blots?

Pofind vs superSubset in QCA?

¿Cuáles son los enfoques mas efectivos en la prevención de la violencia de género?

How to learn more about SPSS and its Application?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Do you know any references for analyzing stochastic fiber orientaion composites ?

Which test should be used to study association among demographic profile and awarness level?

Posthoc test lettering in JAMOVI?

How to do Mann-Whitney U test with Bonferroni corrected p-values?

How to back transform the results generated from analyses using log transformed with In(X+1) data?

Are you interested in contributing a book chapter on biomaterials in Orthopaedics?

"Idrisi Selva" tutorial Books for pridiction with driving factors in modeling?

Bonferroni correction. I have independent t-test, paired t-test and ancova conducted. Which test would require Bonferroni adjustment?