Do you mean parametric test or test that assumes the residuals are normally distributed? I'll assume the later (since otherwise I don't understand the question), but you should clarify. How big a sample? There has been a lot written on how procedures like t-test have low power when there are outliers. A good review is:
Article How Many Discoveries Have Been Lost by Ignoring Modem Stati...
Daniel Wright Thanks Daniel for your reply, I meant tests that assume normality e.g. t-test or ANOVA. I had a discussion with a colleague and he states a sample that is larger than 30. I will take a look at the article - Thanks!
Jochen Wilhelm Thanks for your comment, so if you have a sample size of 200 and you want to compare the mean of two unpaired groups, do you use a t-test or Mann-Whitney test? Do you base your decision on the distribution i.e. draw a histogram or q-q plot?
There is no finite n for which the central limit theorem (CLT) always applies, so the brief answer is that one can't assume that that inference will be accurate with statistical models that assume a normal distribution for the errors.
If the CLT applies (e.g., for averages of well-behaved distributions such as normal, binomial etc.) then the rate of convergence towards a normal sampling distribution depends on the shape of the distribution - being faster for symmetrical distributions and slower for distributions with heavy tails (for instance). So if the shape depends on the value of the parameter being estimated - as it does for say binomial or Poisson then convergence could be very fast or very slow.
e.g.,
- binomial proportion with mean = 0.5 requires only fairly low n to be approx normal
- binomial proportion with mean = 0.0005 requires only very high n to be approx normal
The "is 30 right" question has been asked before (https://www.researchgate.net/post/What_is_the_rationale_behind_the_magic_number_30_in_statistics). One of the problems is that outliers increase the standard deviation (because the residual is squared before being summed) more than the mean.
Just to show Thom S Baguley 's point, here is sampling 1000 people from almost normal distribution (99.8% normal, sd=1, .2% normal, sd=100) and the observed percentage significant is less than have the nominal value, thus showing the power is low. A point made by Fisher, Tukey, etc.
Yes you can but it depends on the nature of the testing problem. With regards to central limit theorem, the testing problem for instance Test for randomness of data set converges to normal distribution while others may converge to chi-square.
Osaid H. Alser, the point is that the MW-test does not "compare the means" (i.e., test hypotheses about mean differences). It does not even "compare medians", as many say. The only test I know that is about mean differences is the t-test. If you are interested in testing mean differences but the assumptions the t-test is based on really make no sense, you may bootstrap the null distribution of the mean difference. A sample size of 200 seems to be ok to go for a reasonably robust bootstrap approach.
According to the MW-test, I'd like to cite Ronán Michael Conroy's answer:
>>>
It's worth noting the actual title of Mann and Whitney's paper : On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other (1). That's exactly what it tests. In fact, if you divide U by the product of N1 and N2, this gives you the proportion of cases in which an observation from one sample is higher than an observation from the other sample.
t-tests are, in fact, pretty robust to non-normal variables (there's a big simulation literature on this). The real problem is that people who use the Wilcoxon Mann-Whitney don't understand what hypothesis they have just tested!
1. Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann Math Statist. 1947 Jan 1;18(1):50–60.
Also worth clarifying, it is the sampling distribution of the statistic (e.g., mean) that is converging on normal under the CLT. The raw data won't change distribution.
Daniel Wright I don't understand your example. Both "populations" have a mean of 0, so H0 is true under the simulation. I don't understand how you can discuss "power" in this context.
The distribution of the p-values under H0 is not uniform giving a more conservative test.
An alternative approach to the bootstrap mentioned briefly by Jochen Wilhelm is to do a permutation test based on the mean difference or on the statistic usually identified with the t-test. You just need to do the permutations to correspond to whether you have a paired or unpaired situation. If you can do enough permutations, the probabilities obtained for the null distributions are exact for any distribution of the data and for any sample size. But the interpretation of those probabilities may differ from the usual as they refer to different probability spaces. However they still provide a valid significance test.
"The good news is that if you have at least 15 samples, the test results are reliable even when the residuals depart substantially from the normal distribution".
" For simple regression, the study assessed both the overall F-test (for both linear and quadratic models) and the F-test specifically for the highest-order term.
For multiple regression, the study assessed the overall F-test for three models that involved five continuous predictors:
a linear model with all five X variables
all linear and square terms
all linear terms and seven of the 2-way interactions ".
I don't understand the experiment/simulation they have done.
5 predictor variables + square terms = 10 coefficients, leaving 5 d.f.. And this with serious violation of assumptions?
And what distributions did they use?
Apart from that it's not the problem that the nominal type I error would not be met. If you have just one residual d.f., you end up with the fact that any (proper) test keeps the nominal type I error rate -- but the power is miserable! This is nothing they mention. They only mention that "there is a caveat if you are using regression analysis [of data clearly violating the assumptions] to generate predictions." Now what? Hooray because the type I error rate is kept but no power and no predictions. Sounds like a bad deal.
(Kelvin, that's not against you. I am just pointing out that such recommendations as given in the link you posted are stupid, and I know that you know that this is stupid. I am afraid that other readers might not)
Mohamedraed Elshami None of the references you cite actually presents a reason for the magic number, and they look more like cookbooks than research. There's a lot of that about, to be fair – simple answers to complicated questions. The book you show extracts from has some odd notions, such as that the Mann-Whitney test is an alternative to the Wilcoxon. It is exactly the same as the Wilcoxon, which is why it gives identical results. It also doesn''t actually tell you the hypotheses you are testing (kind of important if you are to run a test!). Not impressed.
There's a lot of literature on the performance of the t-test (which is a simple OLS regression with a binary predictor) and I've never come across 50 as being a threshold for anything. Can you point me to where this figure is actually calculated or to simulation studies?
And as for EFA, sample size is heavily dependent on the number of variables, so there's nothing magic about 50 here either.
Re that Minitab Blog post, I wonder if the author(s) meant 15 observations per variable in the model? I tried to look at the two white papers mentioned near the end of the post, but both links are broken. I've written to Tech Support at Minitab to see if they can provide links that work. Will post them here if I receive them!
An obvious concern of using parametric tests whether there are any harmful effects of misspecification. This problem can be avoided by estimating this distribution through the use of the Nonparametric Maximum Likelihood technique or by using a transformation to transform the non normally distributed data to normal or close to normal.
I just received this from Tech Support at Minitab. Emphasis added.
"I’ll pass that error [i.e., broken links to the white papers] onto the web team. In the meantime, all the white papers including the two on Regression are located here:
And the 15 refers to the number of rows of data, regardless of the number of variables you have. Whenever you run regression, all columns must have the same number of rows, and that number should be at least 15."
I am astonished by that claim, and don't believe it for all the reasons Jochen Wilhelm listed in his Nov 29 post.
Details (such as they are) about the simulations for multiple regression are in Appendix C of this document:
The 30 comes from the fact that the tabulated t-tables tend to stop at 30 df. This has nothing to do with the issue in the question but that's where it comes from. For a little history of this see this link: https://www.google.com/search?q=n%3E30+implies+the+central+limit+theorem+history&rlz=1C1CHBF_enUS874US874&oq=n%3E30+implies+the+central+limit+theorem++history&aqs=chrome..69i57.33394j0j1&sourceid=chrome&ie=UTF-8
The history of Statistics, Mathematics and the Sciences in general is quite interesting. I recommend it to you. Best wishes, David Booth
@Bruce Weaver Thanks for mentioning the nonparametric maximum likelihood approach. I had no idea that it existed. For any others like me here's a link:
Jochen Wilhelm and Bruce Weaver Yes it is scary but software vendors can often say things like that. I have seen similar stuff from others. Best, David I have also heard colleagues say since Microsoft is a big, Famous company Excel can't have errors. Well ….. Best, David
Parametric models for non normal data or in other words non-linear statistical models for density estimation problems. In such scenarios, you may use Statistics on Stiefel & Grassmann manifolds : Book Statistics on Special Manifolds. Lecture Notes in Statistics
Another good Book is Directional statistics defined on non linear data: Chapter In Directional Statistics
For Aspects of multivariate statistical analysis, also see Book by Muirhead :https://www.booktopia.com.au/aspects-of-multivariate-statistical-theory-robb-j-muirhead/book/9780471769859.html?source=pla&gclid=EAIaIQobChMIu4-n8fqO5wIVSgwrCh2gOAEBEAQYASABEgJ_AfD_BwE
The central limit theorem tells us the data should be approximately normal for large sample. If your data is still not normally distributed for large sample, I suggest you use the non parametric equivalent for the required parametric test
There are 2 trade-offs to consider when assessing any procedure from a frequentist POV:
1. Robustness for validity. Will the t--test falsely reject the null at a higher rate than the pre-specified alpha? In general, the t-test *is* robust for validity in that the type I error rate remains near the nominal alpha when assumptions are false.
This is the "robustness" that secondary sources on research methods tout when they defend the general use of the t-test for non-normal data.
2. Robustness for efficiency: What these proponents fail to consider is that the WMW (Wilcoxon-Mann-Whitney) independent samples test is nearly as efficient as the t-test under normality (aprox. 95.5%), *at worst* 86.5% efficient (ie. distributions with thinner tails than the normal), but it can be *infinitely* more powerful in certain cases of heavy tails or skew.
I've read a number of papers simulating these results, and they all come to the same general conclusion that asymptotic analyses conducted in the 1940's and 1950's also discovered.
It is very hard to beat the simplicity of the Wilcoxon test without either making an assumption (ie. Bayesian methods), or peeking at the data (Robust or Adaptive Methods).
R. Clifford Blair discusses this debate in a historical context in this link.
you say that, in case of skewed distributions, the power of the MWM test is higher than that of the t-test. But the MWM test tests a different hypothesis than the t-test. How can you comper the power? Isn't this is like comparing apples and peaches?
One might argue that both test the same hypothesis (zero expected difference) under the assumption that the distributions can only differ by a location shift. In this case the power of MWM can be (much) higher for skewed distributions. But if the distributions are skewed, the effect to test is almost never a location shift. I would be thankful for having a single practical example of a variable with a skewed distribution where the relevant effect is a pure location shift.
The MWW test the stochastic ordering assumption -- to what degree are values in group X > Y? The real difference is the parametric t test makes a scale assumption (data are interval or ratio), while the MWW only makes the assumption that the data can be ordered.
We can always convert an effect size from one model to another by multiplying by the appropriate scale.
We can also see the decision result under both procedures conditioning on the same data set. That is how the simulation studies work. We specify a distribution, draw samples from that distribution, then see how power and alpha behave empirically.
I don't see the problem with comparing the procedures at all, and neither do the hundreds of papers that compare the 2 via simulation.
FWIW -- I think mean differences are used in circumstances when the MWM/proportional odds model would be a more appropriate choice.
Misconceptions Leading to Choosing the t Test Over the Wilcoxon Mann-Whitney Test for Shift in Location Parameter
https://digitalcommons.wayne.edu/coe_tbf/12/
Fermat, Schubert, Einstein, and Behrens-Fisher: The Probable Difference Between Two Means When σ_1^2≠σ_2^2
P(X>Y) > 0.5 (MWM significant) and at the same time, for the same data, E(X-Y) > 0 (t-test significant). Now if you rejet "some H0", what do you conclude?
PS: Just because something is said or written or done very often does not make it correct, and this is never a reason to give credit. How many stats books do you find where "probability" is defined as a limiting relative frequency? In how many is written that failing to reject H0 means to accept H0? These things don't become correct just because they are repeated so often.
P(X>Y) > 0.5 (MWM significant) and at the same time, for the same data, E(X-Y) > 0 (t-test significant). Now if you reject "some H0", what do you conclude?
In principle, shouldn't you should pre-specify what test to use before seeing the data, or pre-specify a protocol for an adaptive/robust test?
In reality, there is never a reason to do both tests as far as I can tell. If I were presented with the results of conflicting tests, I'd favor the MWW over the T test. I'd also prefer to see the actual p value, rather than "reject/fail to reject." An actual estimate would be best.
If your argument is that these problems/questions are better placed in an estimation framework, you would be in excellent company.
I only objected to the idea that these procedures were not comparable because the hypotheses were different. Both procedures map scientific hypotheses to number systems (reals for the t-test, naturals for the MWW). Both systems share the ordering assumption, but differ on the idea that the scale is equally spaced.
Where you can do a t-test, you can also do a MWW. Do you disagree with this? I'm not exactly sure what your criticism entails.
The most you can conclude from your example (or any hypothesis test, for that matter) is that, (from a frequentist perspective), based on the data, the researcher would reject the null model of no effect -- ie. the data are compatible with the existence of a discernable effect.
What that "rejection" entails behaviorally is context sensitive, and outside the realm of the operating characteristics of the decision procedures.