No normality, no homocedasticity. U Mann-whitney no significant differences, t-test significant differences, which test should I trust?
I am trying to compare categorical (male-female) on a Likert variable (0-10). I checked normality and homocedasticity assumptions, which were not achieved. Thus, I choose U Mann-Whitney nonparametric test, rejecting Ha (there are no significant differences).Nevertheles, it is supposed that t-test is potent enough to go on even if assumptions are not assumed, so I did it to compare, obtaining significant differences (same results conducting ANOVA 1-factor).Which test should I trust?
Patrice Showers Corneli and Sundaram Ramaiyer Karimassery – The Mann-Whitney does not test the hypothesis that the medians are equal except under the unlikely assumption that the distributions are otherwise identical.
It's worth noting the actual title of Mann and Whitney's paper : On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other (1). That's exactly what it tests. In fact, if you divide U by the product of N1 and N2, this gives you the proportion of cases in which an observation from one sample is higher than an observation from the other sample.
t-tests are, in fact, pretty robust to non-normal variables (there's a big simulation literature on this). The real problem is that people who use the Wilcoxon Mann-Whitney don't understand what hypothesis they have just tested!
1. Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann Math Statist. 1947 Jan 1;18(1):50–60.
If in doubt, I would bootstrap the sampling distribution of the statistic I want to test. If this is not possible for you, I recommend staying with the conclusion that the data is not conclusive regarding the difference between males and females.
Please take care not to confuse "having non-significant results" with the "demonstration of the absence of a difference". The absence of significance does not imply the absence of an effect (difference)! And you never reject Ha. You either reject H0 or you don't. And if you don't, then you say that your data is not sufficiently conclusive regarding H0.
Likert scale is an ordinal variable and as such a non-parametric test should be utilized. The t-test is based on a theoretical distribution constructed by a set of continuous random variables. It is then normalized and a t-static constructed as a means to see how your test value deviates from what is expected. I wouldnt attempt to boot strap "ordinal" data as that information is inherently loss and boot strapping will NOT improve parametric estimates as it assumes certain population parameters, which cannot be obtained with a values that are inherently non-parametric.
In short, go with the results of the non-parametric test.
I would certainly go with the non-parametric option of Mann-whitney U test, where the assumptions needed to operated the parametric t-test were not met. However, I also recommend data quality checking for possible outliers or entry error, which might risk the normal distribution of the data.
They are both right and you should trust them both.
The trouble is that you have got two different answers but you haven't realised that you asked two different questions. The t-test is a regression that estimates the difference between the means of two groups. The Wilcoxon Mann-Whitney test, on the other hand, tests the hypothesis that an observation from one group will be larger than an observation from the other group.
If the means are different, but there the probability of an observation from one group being higher than an observation from the other group is not significantly different to 0•5, then I would suggest that Najla Alsiri has spotted the likely cause : one or more outliers causing the mean in one group to be higher than the other.
Both the t-test and the multiple group variant, ANOVA, are based on assumptions of normally distributed random variables. Hence the non-parametric test is more reliable and more powerful than the t-test.
The t-test tests whether two means are equal to one another or not. The Mann-Whitney tests if the medians are equal or not. Both tests assume that are iid (independent and identically distributed the two groups have the same shape though a different measure of central tendency).
To suggest that the Mann-Whitney is different because one group is larger than the other is really not quite right. Both check for equal central tendencies.
Plotting the data is always a good idea before proceeding with test choice.
It would be ideal and appropriate to apply non-parametric test to compare the means ( correctly Medians)of two groups due to the non-normal distribution and small sample size.However, if the sample size is large in both the groups it would be better to apply t test even if there are abnormal values and the distribution is not normal ,since parametric tests are stronger and theoretically more powerful .However, before applying t test it would be ideal to convert the values to their log values and verify whether the distribution is normal or not.This will additional safety and strength for the comparison.
Patrice Showers Corneli and Sundaram Ramaiyer Karimassery – The Mann-Whitney does not test the hypothesis that the medians are equal except under the unlikely assumption that the distributions are otherwise identical.
It's worth noting the actual title of Mann and Whitney's paper : On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other (1). That's exactly what it tests. In fact, if you divide U by the product of N1 and N2, this gives you the proportion of cases in which an observation from one sample is higher than an observation from the other sample.
t-tests are, in fact, pretty robust to non-normal variables (there's a big simulation literature on this). The real problem is that people who use the Wilcoxon Mann-Whitney don't understand what hypothesis they have just tested!
1. Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann Math Statist. 1947 Jan 1;18(1):50–60.
Thank you all for your responses. Is very kind of you.
Ronán Michael Conroy actually one of the biggest problems I am facing is to understand the meaning of the Mann-Whitney results (because I'm getting equal Mdn, and different average ranks).
What I want to know is if there are differences between Male-Female (n=500 and n=450), regarding several Likert variables no normal distributed, and with no homoscedasticity (risk perception, attitudes).
At the begining I choose U Mann-Whitney due to the failed assumptions, but as I got a big enouht sample I'm hesitating with t-test (even I saw some studies using ANOVA for this).
I totally agree with you: "The real problem is that people who use the Wilcoxon Mann-Whitney don't understand what hypothesis they have just tested "
So, how would you state these hypothesis for t-test and Mann-Whitney as well?
Moreover, what do you think about the evaluation of Likert variables?
Sundaram Ramaiyer Karimassery
Thanks again!
Hypothesis of the t-test: "the expected difference is 0".
Hypothesis of the MW-test: "the variables are stochastically equivalent".
The hypothesis of the MW-test may be phrased differently: "Let A and B two random values from the two groups, respectively. The tested hypothesis is that P(A>B) = 0.5, that is, that it is equally likely that the larger value comes from either group."
Thankyou Ronan. But please note that I already noted in the same post that both Mann-Whitney and the t-test depend on iid random variables:. Independent and identical.
The t-test is extremely powerful if assumptions of normality ate met but the t-test is very bad otherwise.
What the title leaves (On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other) leaves out is( becasuse it is iimplied) is that it is a non-parametric ranked test (ranks have medians not means) of whether one of two random variables is larger than the other. In other words it is a generalization of a t-test which tests whether one Normally distributed random variables is stochastically larger than another Normally distributed variable.
In both case we are looking for stochastically equal central tendencies (median for Mann-Whitney U and mean for t-test)).
Stochastic simply means probabilistic not deterministic (as in Physics).
Stochastically larger simply means that the one set of observed data are random variables with a central tendency larger than the other identically distributed set of random variables.
To be precise, the null hypothesis of the t-test is that the mean difference is 0 (paired t-test) or that the mean of group 1=mean of group2.
Really this whole thing has gotten off the topic. This is not about t-tests and Mann-Whitney U tests.It is about Likert scale data analysis which is not easy because Likert data is always ordinal which is always hard.
Under certain circumstances a t-test may work because of the central limit theorem and maybe 400- 500 observations means that the t-test might be good even though ordinal data is not generally not normally distributed. But lots of data and a scale with 10 intervals seems like it should approach the behavior of continuous data and may be close enough to normal that I'd prefer it to the Mann-Whitney test which is based on ranks and may therefore simplify the data too much.
Andrea you have probably read a lot about the Likert scale and its analysis and so know more the most of us. Sounds like depending on the number of categories in the scale, a number of different tests may be appropriate.
I'd try to figure out a way to plot (or rank) the male data and the female data and look at it. Does it look the same. If not how is it different?
This sounds like a challenge but exploratory analysis before testing is generally a good idea.
I add a couple of papers that I find helped me a bit understand the structure of the data. Thanks of helping me think through a nice topic.
Andrea Cecilia,
It seems to me that, in your case, comparing parameters such as the average or the median is a risky exercise and, at the end, it will leave you with very limited information.
Could we assume that the range of values of the Likert variable has interpretable segments, for example, high = greater than 7, medium = between 4 and 7, low = less than 4? You would define this segmentation. On the other hand, are the sample sizes sufficient to obtain tables of suitable frequencies for analysis?
In this case, I suggest you change your strategy by comparing the distributions of the scores in the two groups. This way you could know if men or women tend to take the highest or lowest scores in different proportions. You could work this very easily with chi-square tests.
Jorge's answer is so sensible. A contingency table has the advantage that visually you'll get the gist of the data in one 2 x 10 table. Then sensibly collapsing cells if the data is sparse with the full blown table would give you an easy way to compare the observed table to the expected table to check for homogeneity across genders.
Patrice Showers Corneli – the trouble with large tables is that they are, well, large. It's hard to take in or see patterns in the data because the limitations of working memory are about 7 numbers, and a 10x10 table has 100 cells.
An alternative is to plot the table. A spineplot is good for showing patterns in tables. Here is a spineplot of the famous hair and eye colour data. The original table has 20 cells, most of which are filled with 3-digit numbers. The plot allows you to see what the numbers just don't show : a correlation.
Ronán Michael Conroy , I don't worry about the table size because it is only 2x10. If Andrea decides to collapse some categories it will be even smaller. As I have seen, samples sizes (about 300 and 250), are good for this analysis. Your suggestion about plotting is very interesting. In addition, 2x10 tables with ordered categories can also be analyzed as ranked data with the Mann-Whitney test, but I think more appropriate a goodness of fit approach with chi-square test.
I agree with Jorge. A 2x10 contingency table is not complicated and readily allows a visual comparison between males and females for each cell and immediately points to the cells (if any) that differ in magnitude. It has the advantage of presenting all the data at once in exactly the manner that it will be used in the 'chi-square' (log-linear) test.
Even simpler is make a 2x10 table of proportions with the marginal total (the sum of all cells for males and the sum of all for females) as the denominators and the cell count for the numerator. This standardizes the cell contents for easy comparison for these two genders with different numbers of samples.
As usual popular responses are not always the correct responses.
Also for Ali, 500 males and 450 females (the marginal totals for the 2x10 contingency table).
checking for normality before running a test is known to increase type I error
t-test are valid if testing the means makes sense (if your distribution is unimodal)
the WMW test test the difference in the distributions and it is very sensitive to changes in shape, it will not test what you need (it is meant for unimodal distributions too!)
you have a likert scale-> use an ordered logit with sex as a regressor
Stefano Nembrini – checking for normality before running a test is known to increase type I error – I've seen this somewhere before. Have you a reference? ("Is known to" usually means "damn, I read this somewhere and now I can't remember where"!)
Stefano's remark is quite intriguing. But the implication is that one does not bother to check that the model describing the data actually fits the data. Shall we just assume normality and then go ahead and get the wrong answer?
Andrea states that the data is heteroscedastic and not normal. So using the t-test means violating model assumptions and getting silly answers.
Do not misinterpret my words!
I said: use a more appropriate model, i.e. ordered logistic regression
In my humble opinion, people should start looking at their data.
Start thinking more: if you have number of children, could it possibly be unimodal and symmetric? highly unlikely, use a GLM with a poisson likelihood.
Let's say that you run a normality test and you come up with a p-value of 0.6, does that mean that your data is normally distributed? NO. There is no way for you to know if the null hypothesis holds. Does a p-value of 0.02 mean that you have highly nonnormal data... same... NO
Normality tests are meant to be used to check for extreme deviations from normality. You fit a linear model and your residuals have a trend kind of thing
Using WMW totally blindfolded is in my opinion more dangerous than applying a t-test blindfolded.
WMW is thought to be a test on the medians, but it is not (same applies to
Kruskal–Wallis ANOVA).
WMW is no magical remedy, it is simply a t.test on the ranks, and it tests the distributions not means or medians
Normality is not necessary to perform a t.test (despite what you've been taught), just the normality of the sampling distribution is required, which you can check through a bootstrap.
Article When t-tests or Wilcoxon-Mann-Whitney tests won't do
Article Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothes...
And please, stop playing the significant/insignificant game.
What p-values do you get? are they small? are they big? You just don't say
Use appropriate modelling!
Use p-values as a summary of your data and be more analytical
Article Mindless Statistics
Stefano, what are you talking about. Why are you scolding someone who is asking an honest question to a group of colleagues hoping for some good advice.
Your answers are snarky and do not seem well informed.
For example, how can you possibly use "appropriate modeling" if you eschew testing for goodness of fit of the distribution.
And for another example, why do you insist on stop playing the "significant/insignificant game" and at the same same time ask how big or small are the p-values. Just exactly how do you judge "big or small". Any answers would be as arbitrary as is the "significant/insignificant game".
Furthermore none of your opinions have been humble nor do they explain why, for example, you think a logistic regression would be good. Now it is categorical data so logistic regression is one example of a good distribution but if you so fear checking model assumptions because they might affect your type 1 error then how do you know that it is the best distribution?
Your hubris is strongly displayed in your answers as well as your photo.
That was just for emphasis.
You say that I am not well informed, yet I linked all the necessary publications... did you read them? Patrice Showers Corneli
Are you aware of publication bias and misuses of p-values at all?
What does my picture have to do with my answers?
I was just talking about t.tests or WMW: they are of little use, especially because people won't even look at their data
Goodness of fit tests are also known of to be of little use (see http://web.ccs.miami.edu/~hishwaran/papers/IL.JTCVS2018.pdf)
Do you agree that 1 in a billion is a small p-value? Or is that not small enough for you?
The question refers to a response measured on an ordered scale
so why not use an ordered logistic regression which is meant to exactly model those outcomes? Would you rather use a linear regression model?
does it make sense to test if that response is normally distributed in the first place? If you get a pvalue of 0.1 would you rush to the conclusion that the distribution is normal? That's exactly how pvalues are misused
Deriding the person who asks the question is not a way of putting emphasis on your point. It is that derision that I object too. And to the tone of your answers as if everyone else is a fool. I have been a statistician for many years and it is true that misusing statistics is rampant and one of the most common mistakes that researchers make is to fail to examine their data every which way before subjecting it to analysis.And other common mistakes stem from the failure of people to understand output (p-values, goodness-of-fit, confidence intervals, type 1 and2 error and power). But the haughty derision is always misplaced.
Also could you please give me a citation for your interesting idea that "checking for normality before running a test is known to increase type I error"
You are no doubt a bright young man but have a little to learn about humbris.
I apologize if in my messages I came off as arrogant, that was not my intention. Patrice Showers Corneli
I was not mocking anyone, I was just stating the fact that there are more appropriate models around to be used.
The fact that pre-testing inflates type I error has been around for a while, see for instance
Article Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothes...
Article To test or not to test: Preliminary assessment of normality ...
Article The two-sample t test: Pre-testing its assumptions does not pay off
Article Preliminary Goodness-of-Fit Tests for Normality do not Valid...
Article A closer look at the effect of preliminary goodness-of-fit t...
Stefano Nembrini Many thanks for the useful reference list. Like you, I had read this somewhere but, unlike you, could not remember where!
t test is very powerful and in case of large sample size normal distribution could be assumed and t test could be applied even if the distribution is not normal
i understand that t test is used for parametric data with normal distribution. Is it possible to use it for non parametric data as well? pls advise and explain. Thanks
Abdelazeem El-Dawlatly there is no such thing as parametric data. Statistical procedures are parametric if they estimate some property of the population. So a t-test estimates the difference between two means, and the Wilcoxon Mann-Whitney test estimates the probability that a score from one group will be higher than a score from the other (yes, the W M-W test is parametric!).
There is a large simulation literature to say that the t-test performs well even when data are not normally distributed. And, just to repeat, the W M-W test tests a different hypothesis, so the two tests are not exchangeable.
Andrea Cecilia Serge, This discussion reminds me "Look-elsewhere effect"
We were advised by our teacher in Finland (when I was young) to put both results from t-test and Mann-Whitney side by side in our papers.
Presumable because he erroneousely thought that these were "equivalent" tests, testing the same hypothesis... ?
If the sample size is large we can accept the result obtained on applying t test.If the sample size is large we can assume normality
Now Jochen and Ronan are being snarky. Again we see tremendous hubris exhibited by contributors.
Andrea has said the data do not follow a normal distribution and is not heteroscedastic.
Then augmenting with some other method is wise. Should it be the Mann-Whitney? I cannot know without seeing the data. However, the test generally has good power to detect a real difference between populations even when the the data follow a normal distribution. The opposite is not true if the data is not normal. Then the Mann-Whitney test outperforms the t-test.
Jochen, of course there is paramedic data. It is this sort of data that follows a known distribution with known parameters and conveniently can be analyzed by one of several well known distributions (normal, gamma, log-linear, logistic, etc. etc.). Mann-Whitney is a nonparametric method. And we often refer to data as normal or non-nomal. We all know that means the data follows a distribution that has been described and for which we can estimate the specific parameters that characterize the data set.
Some data tend to violate normality to a small extent. So using both tests, even though they do not test the same hypothesis, having both tests there is not only advisable, but can do no harm. More information about a certain data set and its distributions from various measures is always a good idea.
Jochen,
It was because our teacher went through a large amount of data in his PhD thesis and compared the results from nonparametric tests with other ones. We had to read Siegel's "Nonparametric statistics for the behavioral sciences" by heart.
But how is this comparison done? The p-values of both tests refer to different hypotheses under different assumptions. It's like comparing the answers to different questions. I don't see how this could make sense.
Jochen,
I do not remember how the teacher's PhD thesis was done, es ist so lange her. He was one of those teachers I understood (I came from Stockholm from the latin line with 7h latin pro week, I could not even draw a square root at that time).
Liebe Grüsse aus Finnland, es ist noch ziemlich kalt.
Ultimately you are asking whether the two samples come from the same population of possible outcomes. Rejecitng the null in either case - regardless of somewhat different hypotheses - Onemeans that the data are not cosisitent with the null. Either way the evidence suggests that the samples are from different populations and should be treated as such.
The t-test compares mean (and variances), the Mann-Whitney test for one sample being larger than another. The aim is the same, the hypothises are answering the same question. Do the samples differ significantly.
Patrice Showers Corneli Please keep the tone light and helpful. "Snarky" and "hubris" sound as if they were intended to be demeaning and insulting. I hope this wasn't the case.
Whatever about paramedic data (and I should declare an interest – my brother-in-law is a paramedic), there is no parametric data. A parameter is not a datum or a collection of data. A parameter is a property of the data generation process. That is why we characterise distributions by the number and nature of the parameters needed to specify them.
And I should point out that "do the samples differ significantly" is not a question, for two reasons : the first is that you need to specify what property you are testing and the second is that in statistical testing we are interested not in the samples but in the populations they were drawn from.
Patrice Showers Corneli "Either way the evidence suggests that the samples are from different populations and should be treated as such." - that's a strange argument. You know beforehand that the samples are from differnt populations. This is how the sampling was done. The test is to check if the amount of data is sufficient to see a particular statistical difference (that are given by the hypotheses tested) clear enough. To stay with the simple example "t-test vs. U-test": it is possible that both tests turn out to be "significant" (give p 0, but Pr(A[i]>B[i]) < 0.5. Now what?
If the assumptions are ok for the t-test, the U-test in fact tests the same hypothesis, but using less information (lower power). This is also well known, so one does not need to perform both to get any new insights.
Here's a real example. It's a small datafile of a case-control study of emergency cesarean section delivery. Mode of delivery is 1 for cases and 0 for normal delivery controls. The other variable is mother's body mass index (BMI).
Looking at a t-test, you find no difference in means between the two groups. And this makes sense – the means of the two groups are pretty close. But now do a Wilcoxon Mann-Whitney. It's significant.
Understanding why this is so is key to realising that if you don't know what hypothesis is being tested, you can't interpret the test!
I'm indebted to one of my MSc students for the example, which emerged during a classroom session.
Thanks for the example, Ronán. Adding to the story, analysis of that data set via logistic regression (which would be typical for a case-control study) yields an odds ratio of 1.078 (95% CI, 0.971 to 1.197).
Of course I meant populations in the statistical sense:a homogeneous large group of things from which the samples were taken and which are identically distributed and independent of one another.This is the basis of the null hypothesis that the parameters that describe the population both also describe each sample.
In this case there are at least two reasons for her outcome.
One is that the t-test assumes normality and so if one fails to reject it, we have no evidence that the two samples come from two different population (sensu statistics and all possible outcomes).
If the sample depart significantly from normality, then the nonparametric alternative is appropriate with a similar null: that the samples came from the same set of all possible outcomes (the population), but we use Mann-Whitney and ranks to determine whether one sample is larger than the other.
Either way a significant outcome from either one should provide evidence that the samples were from two different populations.And indeed Andrea's does find that they are different.
The t-test is very if the data do not depart too much from normality. Even so the MannWhitney is nearly as powerful. But as Conover says: the reverse is not true. The t-test is never as powerful as Mann-Whitney if normality is violated.
Ronán Michael Conroy, thank you for this interesting example. It seems to me that the differences in the Wilcoxon Mann-Whitney test are more informative than those of means comparisons with the Welch test. If we compare the means without taking into account the most extreme value of sample 0 (controls), the mean differences will be significant, but the WMW test remains more informative, as can be seen with a comparative boxplot. This shows how important it is to guide researchers about the information given by the indicators, that is, the statistics for their work hypotheses.
I am sorry for the snarky remark. It is not really professional. Let me lighten the accusation.
What bothers me is the lack of respect for the person who is asking the question in the first place. Some members present themselves as experts with all the right answers and that others are foolish. And this is clearly not the case when the "expert" makes erroneous claims or when the "expert" derides a teacher who he does not know simply because the "expert" does not understand the problem.
Or for example "there is no parametric data".Really! A collection of data is is only important when described by its properties which, if well chosen, describe the data. The data is of no interest if it cannot be described and related to the hypothesis. Such criticism and such declarative statements are not useful to anyone.
Really the choice of the two tests depends on which test satisfies the assumptions of each test. In cases (such as Ronan's case control study) it is clear that that even though the means were very close so that the t-test found no difference, the variances were probably unequal which violates a simple t-test assumptions of equal variance. Perhaps the Welch test would have worked better. But that explains why a rank-based test like the Mann-Whitney.would pick up the difference between the the two samples. Properly accounting for different means as well as different variances should work better than a simple t-test.
Patrice Showers Corneli – I beg to disagree. The choice of tests depends on the question that you are asking. The alternative is that you allow the data to dictate the model. This abdication of responsibility leads to nonsense like stepwise methods.
The whole confusion between parametric and nonparametric is a legacy of a paper that was old even when I was young : Siegel S. Nonparametric Statistics. The American Statistician. 1957 Jun;11(3):13–9. The trouble is that these 'nonparametric' methods include a suprising number of tests that actually calculate parameters, some of which are very useful measures of effect size. See Newson R. Parameters behind “nonparametric” statistics: Kendall‘s tau, Somers’ D and median differences. Stata Journal. 2002.
The distinction becomes even more useless when you consider that Cox regression is both parametric and nonparametric!
Not to knock Siegel – his book of "nonparametric" tests was a staple of researchers in the days before computers. These were tests that could be done with a pencil on a sheet of foolscap. And the accessibility of Seigel's writing was a breath of fresh air in an era of obtuse textbooks.
On the contrary. One must conduct tests (I use likelihood models) to determine what parameters best fit the model. One cannot just decide to use a particular model without determining the goodness of fit. Suppose you have data that is thought to be Poisson observations and you fit a Poisson model to your data (sequence data say) and find that the variance is larger than expected under a simple Poisson model (over dispersion)? Well the next step might be to check the likelihood of the data under a compound Poisson. If the likelihood score is higher for the compound Poisson you know that using a simple Poisson to describe the data is likely to bias you results and the probability of converging to the wrong answer is more and more likely with increasing data (consistency).
Perhaps you are much younger than I am -1957 does not seem so long ago. However, my statistics masters degree is a very good one that I obtained in the late '80s from a master theoretician in Likelihood Theory. I never followed any paper from the decade in which I was born with the exception of RA Fisher - who wrote long before I was born - and Neyman and Cox.
Please send me a reference to this problem as I may not be understanding the way you describe the problem. Oh I see that you have one reference so I will read it. Thank you for that.
Very nice paper by Newson but it is about the advantages of using a confidence interval rather than a simple p-value to describe the significance of the analysis.
Of course the data dictates the model.
Really all you have to work with the sample data. From these you estimate parameters to test whether the two (or more) groups come from the same population.
From Newson: "Rank-based statistical methods are sometimes called “nonparametric” statistical meth- ods. However, they are usually in fact based on population parameters, which can be estimated using confidence intervals around the corresponding sample statistics. "