I have large sample (just one sample with 1100 cases) and I want to test a hypothesis about comparing mean of my sample in the two groups (each group has 550 cases).
Some statisticians told me "you can use formal student t-test because the data are normal, based on Central Limit Theorem".
I'm confused, the Central Limit Theorem is about "mean of sample means". for example, if we have a data with 100,000 cases which is not normal then we can take 100 samples. In this case, the average of 100 sample means would be normal. Now I can use the t-test.
If my sample is large, can I use parametric statistics (or testing hypothesis test) with a non-normal distribution of the data?
Since you don’t have normality, You can do a non-parametric test to check your hypothesis. Use Mann Whitney U. It is the non-parametric equivalent of independent samples t test.
Stamatios Ntanos Thank you. I know about non-parametric methods, but my question is about the Central Limit Theorem. Those statisticians told me you can assume the data normal when the sample size is large. However, i think the Central Limit Theorem says "the mean of sample means" is normal.
The answer is No.. Read the theorem carefully in a good MATH Stats book. you really don't have the situation to apply the CLT. Best, david Booth
Central Limit Theorem is not the right solution. You should go for non-parametric tests in case of non-normal data
This matter doesn't relate to Centra Limit Theorem, but rather on the specific non-normality and the robustness of the statistical method under use.
I think in many situations parametric methods are used with non normal data. For instance a logistic distribution sample of moderate size (n = 50) could be easily confused with a normal sample and with high probability (the AD test power is < 16%). On the other side, evident multi modal non normal sample can be judjed inadequate for parametric test even though they have skewness and kurtosis similar to a normal process. Thus, the problem is mainly around the robustness of the methos used and the characterization of the non-normality.
Please have a look on my page where I loaded a preprint article with some considerations about this issue.
I agree with the comments, if your data is non-normal, even for large sample size you have to carry out non-parametric tests.
No central limit theorem doesn't apply here. that idea was discarded 50 years ago. Best, D Booth
what if the variable is normally distributed but the sample size is small ( say < 20)?
Can we use parametric test?
All my stat texts say the X-bars are normally distributed with mean equal to the underlying mu and standard deviation equal to sigma/(sqrt(n)
for sample sizes above 30 or t distributed with standard deviation equal to s/sqrt(n) and dof = n-1 where n is sample size. This holds whether or not the original distribution is normal. What am I missing?
Ira, some books give all easy problems but I found life is not like that. Go to z-lib.com and look at Devore and Berk, Modern Mathematical Statistics. Maybe even the files I attached. David Booth
David Eugene Booth - Thank you. The cites are interesting and useful but they do not concisely address the question. Let me try a summary. You are not rejecting the central limit theorem in any straightforward sampling scenario - say the x-bars of samples of size 100 of an exponential. But rather are you saying in some scenarios when using a predictive model the distribution of the coefficients may not be normally distributed even when sample sizes are large?
David Eugene Booth I found the book you recommended (Devore and Berk, Modern Mathematical Statistics) and it says the following:
"However, at the other extreme, a distribution can have such fat tails that the mean fails to exist and the Central Limit Theorem does not apply, so no n is big enough. We will use the following rule of thumb, which is frequently somewhat conservative. If n > 30, the Central Limit Theorem can be used. Of course, there are exceptions, but this rule applies to most distributions of real data." So I was wondering if you or Mukaram Ali Khan can recommend another source. Thanks!
Jesica Formoso short answer: yes we can use the Central Limit Theorem for n>30
@Jesica according to the Devote and Berk cite it says that some distributions are such that the central limit theorem doesn't apply for any.n at all..I personally suggest two examples of such distributions Cauchy and Laplace that should suffice. If the sampling distribution is unknown did God tell you that this unknown distribution is such that a normal approximation to the sampling distribution will suffice. This is why Laplace developed L1 estimators. See stiggler:s history of statisics. Given that God didn't speak on the issue do you want a clinical trial of your current cancer chemotherapy to ALWAYS use this approximation or not. I apologize for the author of the book using 30
David Eugene Booth thanks for the info. I rely on the distribution, regardless of the sample size, but recently I was asked to modify some analyses based on the CLT and I wanted more information about it and a book or article that I could refer to.
@jesica. Nobody says never use the approximation. But it is risky to do so in certain situations. I personally don't use it if the work could have serious consequences. Best wishes for a good day, David Booth.
PS Many books on Robust Statistics will have a section devoted to heavy tailed distributions. Wish you the best, David Booth
If you are interested in testing a parameter, like the difference between two means, and not something like that, then a parametric test is appropriate. There are methods to address non-normality if you test has that as an assumption. For the traditional t-test, the biggest issue is related to power. Tukey (1960) made this clear.
BUT, the big issue for this (old) question is that for most applications comparing two means with n_j = 550, any meaningful difference will pass the inter-ocular trauma (IOT) test so a t-test (or whatever) may not needed.
@Daniel Before making such a blanket statement I suggest you run a simulation using a Cauchy population. As someone once suggested intuition in a strange situation is not always a good guide. Best wishes, David Booth
David Eugene Booth , if you are addressing my first paragraph, that high kurtosis distributions affect the power, then yes Cauchy is an example of this. In my field I don't see much data this extreme.
If you are addressing by second paragraph (which I think you are), good point and especially good to make towards one of my comments! The odd extreme value (or few) can cause a t-test to be non-significant (e.g., Fisher, 1925, pp. 111-112) so the IOT test could say meaningful difference but t-test wouldn't. So, yes, my blanket statement was me guessing what the questioner's data are like (and I really shouldn't do that since that is something that I often bring up on RG ... so big thanks).
@Daniel you are correct in your assumptions about my post. However the Cauchy simulation would say that in general if you know nothing about the population distribution then using the n bigger than 30.rule is especially unsafe. Best wishes, David
Calcula la diferencia de medias con la distribución de Fisher y aplica ANOVA para probar la diferencia igualdad.
Toda muestra grande y mayor de 30 debe utilizar estadistica parametrica Z
Mehdi, the short answer is no. Parametric statistical tests assume a normal distribution. If the distribution of your data is not normal, then non-parametric tests are needed. Having a large data set does not automatically make it normal. Normality is all about the shape of the distribution, not its size. Normality means it looks like a Gaussian bell curve, y=e^-(x^2).
Here's an example to illustrate the dangers of using parametric tests on non-parametric distributions. Consider two data sets, one triangular with a sharp drop on the right, and the other with a mirror image shape and a sharp drop on the left. Assume they have identical means. If they are mirror images, they will also have the same standard deviation. The two sets however will have completely different modes and completely different medians, and they are showing two very different population trends, but if you apply parametric tests to them, the tests will tell you that because they have the same means and standard deviations, there is no significant difference between them. Never trust a parametric test unless you are testing normal or nearly-normal distributions.
The central limit theorem cannot be used to transform a non-normal data distribution into a normal distribution. This theorem is about the relationship of means of sample sets to the population mean. It is not a tool for transforming the shape of data distributions.
Ian Dash Jaime Eduardo Gutiérrez Ascón Daniel Wright Jesica Formoso Ira Robbin
#
All of the above sentences are about the average of the sample means. If you have just one sample, then the sampling distribution of the mean is approximately normal when the sample size is larger than 30.
So CLT is not about samples, it is about the mean of samples (All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution). If we have just one sample (which generally takes only one sample from the population), the CLT talks about the average of that sample.
#
I showed the above in R software:
#
#Create 100000 random samples from a standard uniform distribution
Uniform
Mehdi, I think your last message agrees with what I said earlier about the nature of the CLT, although you have gone into much more detail and provided a practical demonstration. But for the record, let me elaborate on my previous comment about the CLT in case it was misunderstood.
Nonparametric tests are in general much messier and more tedious than parametric tests, and there are fewer implementations of nonparametric tests available in software. For these reasons, many (perhaps most) researchers try to avoid nonparametric testing if at all possible. The CLT is sometimes seen as a way to bypass nonparametric testing by downsampling. The CLT says that a downsampled distribution will be much more normal than the parent distribution if the parent distribution is non-normal.
The pros and cons of downsampling is a topic we could discuss at great length, but I will try to be brief. Downsampling has many uses. It is a good way to
a) make a large dataset more manageable
b) reduce noise and
c) filter data in other ways with the aim of retaining the most useful information and discarding only the un-useful information.
The tradeoffs for this are that
d) N becomes smaller
e) because N is smaller, confidence levels are reduced
f) a lot of information is lost.
If you are dealing with digital audio or video, downsampling is enormously valuable. It makes it possible to transmit a recognisable version of the original content over the very restricted bandwidth of the internet. Audio and video compression algorithms typically reduce the information by around 99% to do this.
In the field of research however, data is much harder to obtain. Instead of generating 270 million bits of information per second, which is the bit rate of uncompressed standard definition television, we take weeks, months or even years to obtain a few dozen or a few hundred data points. As well as time, a lot of effort and resources often go into obtaining these precious gems of data.
When you downsize a data set of say 1100 data points, as in your data set, by a factor of 30, so that the reduced data set looks approximately normal, you are left with a very small data set, around 36 data points in this case. You have then lost 97% of the information you laboured so hard to obtain in the first place and your confidence intervals will be around six times larger. But this then may allow parametric tests to be used on the decimated data set because it will now have a fairly normal distribution. Is it worth losing 97% of the information to do this? I guess it depends on what you are testing and how big your effect size is. If the effect size is big enough to still be visible, maybe it is worthwhile.
If you are splitting your data set into two groups, as your original question said, and you downsample by a factor of 30, you will be left with 18 data points in each group. You are then back in a situation where N has become so small that nonparametric methods may be needed again, and you have come full circle, although a nonparametric test on two groups of 18 data points will be easier than the same test on two groups of 550 data points.
So to go back to my original comment, which was that the CLT is not a tool to change the shape of a distribution, downsampling can be used this way, but only at the cost of losing most of the original information.
An analogy is that a camera can be used purely as a light meter if you defocus the lens enough, but then you lose the picture, which was probably much more interesting and useful than just knowing the light level.
The answer will ALWAYS BE NO. BECAUSE THE CENTRAL LIMIT THEOREM DOES NOT GUARANTEE A FAST CONVERGENCE TO NORMALITY
Mehdi, although I earlier dismissed resampling methods as ineffective, on further consideration it may be worth considering jackknife and bootstrap resampling methods for analysing your data. These are nominally non-parametric methods but they are closer in form to parametric methods than are rank-based nonparametric techniques. I haven't used them myself but they are well established and there are several texts available on them. For a brief introduction, see https://math.montana.edu/jobo/thainp/jack.pdf and https://math.montana.edu/jobo/thainp/boot.pdf .
For the size of data set you have, bootstrapping would probably be less tedious than jackknifing, but beware that the random nature of the bootstrap resampling technique will introduce some variability into the results.
@Ian Brad Efron asked me to convey his best wishes. Have a great day, D. Booth
Mehdi, in case David's joke passed you by, Brad Efron is the inventor of the bootstrap technique and has written several books on it, including Efron, B., and Tibshirani, R. J. An introduction to the bootstrap, Chapman and Hall 1993. He has also written numerous articles on it. If you want a briefer look at the technique than an entire book, try Efron B. And Gong G., A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation The American Statistician, February 1983, Vol. 37, No. I pp 36 - 48 which you may find on the web.
As a complement to this discussion and valuable remarks made by Prof. David Eugene Booth , let me also put this quote from Wilcox(2012) [1]. It totally agrees with my experience in a field well known from generating "difficult" datasets (clinical trials -> clinical biochemistry).
1.4 The Central Limit Theorem
When working with means or least squares regression, certainly the best-known method for dealing with non-normality is to appeal to the central limit theorem. Put simply, under random sampling, if the sample size is sufficiently large, the distribution of the sample mean is approximately normal under fairly weak assumptions. A practical concern is the description sufficiently large. Just how large must n be to justify the assumption that ̄X(bar) has a normal distribution? Early studies suggested that n=40 is more than sufficient, and there was a time when even n=25 seemed to suffice. These claims were not based on wild speculations, but more recent studies have found that these early investigations overlooked two crucial aspects of the problem.The first is that early studies looking into how quickly the sampling distribution of ̄X(bar) approaches a normal distribution focused on very light-tailed distributions where the expected proportion of outliers is relatively low. In particular, a popular way of illustrating the central limit theorem was to consider the distribution of ̄X(bar) when sampling from a uniform or exponential distribution. These distributions look nothing like a normal curve, the distribution of ̄X(bar) based on n=40 is approximately normal, so a natural speculation is that this will continue to be the case when sampling from other non-normal distributions. But more recently it has become clear that as we move toward more heavy-tailed distributions, a larger sample size is required.The second aspect being overlooked is that when making inferences based on Student’s t, the distribution of T can be influenced more by non-normality than the distribution of ̄X(bar). In particular, even if the distribution of ̄X(bar) is approximately normal based on a sample of n observations, the actual distribution of T can differ substantially from a Student’s t-distribution with n−1 degrees of freedom.Even when sampling from a relatively light-tailed distribution, practical problems arise when using Student’s t as will be illustrated in Section 4.1. When sampling from heavy-tailed distributions, even n=300 might not suffice when computing a 0.95 confidence interval via Student’s t.
[1] Wilcox, Rand. (2012). Introduction to Robust Estimation and Hypothesis Testing. 10.1016/C2010-0-67044-1.
Bearing in mind there are 500+ statistical tests, with lots of non-parametric and robust alternatives, not just limited to the good-old Wilcoxon (Yuen-Welch, permuted Brunner-Munzel, ANOVA-Type Statistic (ATS), permuted Wald-Type Statistic (WTS), Van der Waerden, quantile (mixed) regression, proportional-odds model (ordinal logistic regression being a generalization of the Wilcoxon + adjustment for covariates) one should at least give it a try.
Of course, when comparing results, one should remember about the underlying tested null hypotheses, which are different in each case, and additional distributional "constraints" can modify the interpretation. Users often confuse them, e.g. equating Mann-Whitney-Wilcoxon with a test of medians, which is not in general (unless additional properties of the analysed data are confirmed) and are surprised, when, say, the t-test and Mann-Whitney give opposite results.
One should always think first what is the goal of the analysis, which measure is relevant to summarize the given dataset (it's not necessarily the mean) and what kind of hypotheses indeed reflect the objectives.
If we can assume the underlying mean and variance exist and are finite, why wouldn’t the usual t-test work even if the underlying data was not normal?
Ira Robbin E.g. for the reasons I quoted in my response above from Wilcox work. With very large samples, about 1000, it probably will (assuming means are the measure of our interest, which rarely happens in case of asymmetric data), but for smaller samples, the behaviour of the t statistics isn't as nice as it's commonly pictured in textbooks (where authors focus separately on the nominator and prove the CLT "makes it work at N=30", forgetting it's not only about the nominator, but the entire test statistic). That's my common issue while working with clinical datasets and below N=200-300 (we work here typically with smaller and much smaller data, from about 20 to 500), depending on case, it fails quite often in terms of the type 1 error or power. Actually, even such robust methods like GEE aren't as robust as one might expect at this sample size.
Ira Robbin because the t test assumes that the data are normally distributed. It is meaningless (no pun intended) on a non-normal distribution. For the same reason, the mean itself is generally of little value in such a data set. See the paper I posted earlier in this thread for illustrative examples.
Ian Dash - I am not following your example. You had two sets of data with 55 data points each. The two sets of data were supposedly the result of sampling two distributions. The resulting X-bars were identical. So that implies we cannot reject the null hypothesis that the underlying means were equal.
However, that does not imply we can conclude the means are equal. We could compute the probability the true difference is between +/- a margin of error using the t stat. I ran a few thousand iterations randomly drawing from the sample points and by eye it started to get awfully close to the t. I may not be interpreting this example as you intended.
I know how Bootstrap works, Ian Dash Ian Dash. I'd like to know what statistical tests the Central Limit Theorem is used for.
Adrian Olszewski
I learned from your explanation that the CLT is about the means of the data and an independent t-test compares the means of numeric variables between the two categories, so with a very large sample size, this test can be used for non-normal data.
Ira Robbin , if you read Student's original paper, you will find in the third paragraph an explicit statement that the t test assumes that the data are normally distributed. The t distributions are computed according to this assumption. If the data are not normally distributed, the t test will give you, at best, an inaccurate answer. In highly non-normal data sets like the ones in my examples, it will give you a meaningless answer.
If you resample the data sets in the examples I gave, you are jack-knifing and creating new data sets that are more normal-looking. You can then to use parametric tests on your new data sets, but the data sets you are now testing are not the original data sets, they are transformations of those data sets that look entirely different. Conclusions you draw from those transformed data sets do not necessarily apply to the original data sets, particularly conclusions about difference.
The point of the paper was to show pairs of example data sets that were obviously, strikingly different, but which showed no difference under parametric testing. Such data sets were the reason that non-parametric tests were developed.
As Adrian Olszewski said above, there are hundreds of statistical tests available. Each of them is devised for a particular set of applications. Clinging onto one or two of these tests and trying to apply them to every application regardless of their suitability is a desert island strategy. It is like trying to use vise-grips on nuts and bolts because you don't have any spanners. If you are on a desert island and that is the only tool you have, maybe you have to make do with it. But if you are not on a desert island and you can borrow or buy some spanners, they will do the job much more quickly, easily and elegantly without rounding off the heads and mangling the surfaces.
Mehdi Azizmohammad Looha
At some large N - you are correct, it can*. The question is how big N is required, because the validity is asymptotic and we don't know it a priori. / * provided the mean is a reasonable measure for such data, which I cannot confirm not seeing the data. It's always a matter of the domain knowledge, some compromises and agreements. /
The t-test is not only about the means. Every parametric test shares the same logic: the magnitude of some effect is compared against some measure of a dispersion, to see how well the effect is detectable (or discernible). And the t-test is no different. It has a nominator (difference in means) with own properties, and the denominator (with the dispersion) - with another properties. We are interested in the property of the whole - the test statistic.
The CLT (especially supported by other powerful theorems like Slutsky's + the knowledge that the limiting distribution for a few key ones is the normal distribution), applies to various unbiased estimators, not just means, but also variance, even beta coefficient of a regression model. But it doesn't mean that all of them behave the same way at the given N, not to mention being combined into formulas (square roots, ratios).
In the text I quoted, let me highlight the part, which says that the means are not everything and we rather need to think about the entire test statistic.
For some data it may require N=20, 50, 300, 500, maybe even more (I don't go that far at work, so I don't know). Definitely non-normal yet symmetric and unimodal data will be "nicer, more preferred for the CLT" (but be careful of the "fat tails"), than multimodal data with certain properties, like a mixture of skewed distributions "spoiled" by the presence of outliers (at one side; outliers at both side may compensate each other).
If you, then, believe that your sample size is enough (some hundreds to thousands), that means properly describe the data, ideally - if you know that others did it this way with good results, then sure - give it a try. Asymptotically it will work. Or use permutation t-test (assuming the variances are equal, because this technique requires the "exchangeability" by definition), or Yuen-Welch t-test (based on trimmed means).
PS: Let's not forget that in statistics everything is based on some idealised model, some theoretical construct. There is no "true normality" (and cannot be in the real, physical life), true equality of variance, true this-and-that. Everything is approximate, asymptotic (OK, except the exact tests :-) ). You, as a researcher - decide, whether the conditions are met well enough, whether the statistical assumptions are met sufficiently, whether the approximations are enough for you. At some "big level", say - several thousands of observations - maybe there's even no need to test, if one has almost the entire population?
The test statistic is a function of sample size. Without the practical significance defined (the smallest meaningful magnitude; e.g. in medicine it's the MCID: the minimal clinically important difference), it may happen, that even a smallest difference, completely unimportant from the domain's perspective, will be (at some big N) found as statistically significant. Even worse - it may be claimed to be important, which may be far from the common sense and the scientific validity, meaningfulness.
PS to say that a parametric test can be meaningfully applied to non-parametric data as long as n is large enough is akin to saying that a square peg can be forced into a round hole as long as you use a big enough hammer. (This is not directed at you, Adrian Olszewski - I think from your comments about using non-parametric equivalents of the t test that we are in agreement.)
The only thing that I would like to know is what in the world is nonparametric data? The screenshot attachment handles the nonparametric test whatever in world can nonparametric data be? Best wishes David Booth PS I did find the definition given in the second attachment but my response to that is who cares.
@IanDash - I randomly simulated samples of size 100 for the two data sets of size 55. I then computed t stats based on those empirical x-bars and s stats. Plotting histograms for these simulations it seemed by eye to get reasonably close to a t curve on the same graph. What am I missing?
David Eugene Booth for "non-parametric data" substitute "non-normal data".
But how does this detail contribute to the discussion?
Ira Robbin you seem to be trying to demonstrate the Central Limit Theorem. I am not surprised that you have succeeded in doing this as the theorem is well established and widely accepted. How is this related to the question of difference between the distributions in question though? Are you suggesting that all problems can be solved by resampling? And are you suggesting that the two distributions in my example are in fact identical, within some error limit, because a parametric test says they are?
@Ian Dash I don't buy non normal data either look at the definition of nonparametric test please. Best wishes David Booth
I am saying that non parametric data makes no sense because it's not well defined in the mathematical sense. David Booth
Ian Dash To some extent - yes. The tests of non-normality (because we don't prove the H0 with them; the fact that normality is not rejected doesn't prove the data comes from normal distribution - examples attached) are limiting quite a lot:
- At small samples, where the CLT works poorly, the non-normality tests lack of power as well (Shapiro-Wilk or Anderson-Darling seem to perform the best). And the contrary - the more data we have, the more they are sensitive to various kinds of deviation from normality, even if the data looks almost perfectly normally.
Well, that's quite normal and agrees with our logic. The more data we have, the better we can see the "deviations". It's like having a good microscope and magnifying glass. The bigger the zoom, the more visible the details.''
But *this time* it poses a problem. At small samples, where it could be safer to use the non-parametric method, the non-normality tests "doesn't see" anything "worrying", while at (very) large samples, where we could possibly go with a parametric (or at least robust parametric) method, it suggest us switching to a non-parametric one. It's quite limiting.
That's where the graphical methods, like QQ plot, eCDF or the (kernel) density estimator (histograms are too dangerous here) may help us much more. OK, maybe except the Cullen-Frey (Pearson) plot, using the 3rd and 4th central moments (like the Jarque-Bera test), which can be fooled easily (attached example with a bimodal distribution classified as "not non-normal").
- there are over 25 tests of non-normality (about 5 in common use, depending on a field). Why? Because there's no just one kind of deviation from normality, and each is sensitive to something different. For example, Jarque-Bera only looks at the 3rd and 4th central moments. Kolmogorov-Smirnov looks at the max. distance between two eCDF (and cannot be used with parameters estimated from sample, it must be guessed. The Lilliefors modification or resampling allows for that), Cramer-von Mises looks at the entire range (integral), and Anderson-Darling is the CvM scaled in a way, that makes it more sensitive to deviations in tails. Shapiro-Wilk uses the Pearson of some sample scores against theoretical ones.
That's only a few examples, but it already shows, that if 4 tests look at something else, their "findings" may be different, mutually inconsistent, while we may not observe anything worrying, looking at the
- Sure, there are fields, where the use of such tests is demanded by the authorities for conservativeness. The field I come from (clinical trials) is a perfect example of it. Non-normal distribution? Sorry - most of the time - no place for argument!
But if the QQ/eCDF says it's still OK, one can always run both parametric and non-parametric method and compare the results for consistency; If they well agree, the parametric method may be reported, which usually makes the communication simpler - in terms of means. This is often acceptable.
An interesting discussion: https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless
Ian Dash Let me answer your both questions in a single response.
Regarding the first one, my previous comments said, briefly, that:
1) the t-test is not only about the distribution of means. Th entire test statistic matters. That's the meaning of the pasted quote from Wilcox book.
2) at some really big sample size - the CLT will "finally" (asymptotically) work* sufficiently well. But
a) we never know it a priori, as we don't have magic ball, the convergence may be slow
b) we may get correct answer for a wrong question: about the means, while other measures may be more appropriate. We may ask about pseudomedians, medians, dispersions, entire eCDFs, not just means (means may be equal, dispersions or shapes - not; and this will differentiate the two samples and may lead to different conclusions)
/* EDIT: The CLT is a proven theorem, a fundamental mechanism that "works" always, unconditionally. Saying "works well" here is a mental shortcut, meaning, that with increasing N the CLT "makes" the shape of the empirical test statistic close enough to the theoretical distribution. It expresses that "with the given data, at the given N, the properties of the calculated test statistic are -subjectively- sufficiently close ("convergent") to the expected ones", reflecting the asymptotic nature of it.
/
3) if one is 100% sure the research question is about means, but the distribution is non-normal, we *still* have methods giving us the parametric answer, mentioned later (permutation, bootstrapping, Yuen-Welch t, GEE).
Do we agree up to this point? :-)
Now, regarding your question what's my view on the non-parametric methods. Well, it's difficult to summarize briefly, so let me organize it in points:
1) First and foremost, as I mentioned above, it's always a matter of a research question. If one decides (for any reason: "traditions" in this field, consistency with the literature, request from a regulatory authority of a journal's reviewer, study sponsor),that it must be about means, then non-parametric methods won't give this answer. But not all is lost!
a) permutation test. It requires that both samples have the same dispersion (data are "exchangeable" between them - this makes the H0).
a) The Yuen-Welch t-test (the Yuen part deals with outliers and partially skewness via p% trimmed means, the Welch part - with heterogeneity of variance)
c) [Weighted or robust] Generalized Estimating Equations (GEE estimation) followed by appropriate Wald's test. It gives the sandwich estimator of variance and doesn't require the normality of residuals ( equivalent to normality in groups). Of course it makes sense for 3+ groups and 2+ effects (covariates), but will work in the 2-sample case as well :-)
d) if no p-value is needed, but only the statistical significance - the bootstrapped confidence interval for the difference in means, ideally the BCa (bias corrected and accelerated) confidence interval. If the CI covers 0, the difference is not significant. Only remember to shift means to mimic the H0 (test must be done under the H0).
* https://stats.stackexchange.com/questions/20701/computing-p-value-using-bootstrap-with-r
* https://stats.stackexchange.com/questions/386586/why-shift-the-mean-of-a-bootstrap-distribution-when-conducting-a-hypothesis-test
2) We may ask another question: how are the "big data" non-normal exactly?
a) if it's skewed visibly, at large sample size, then it's clear the data don't come from an additive process. It's either multiplicative one (e.g. log-normal), or with non-trivial mean-variance relationship, or it's a mixture of distributions (maybe some controlling variable, some con-founder, that could separate them, is missed?), or there are a large outliers? (in my field it's not unusual to have data spanning 7-8 orders of magnitude in one direction, from 0.001 to 10000, with totally valid observations lying 10xSD from the mean). In this case using the t-test is asking for troubles. Even if the CLT will work, the result may be meaningless for this data.
b) if it's symmetric but multi-modal, the obtained answer may not even belong to the domain and it's clear that a single mean cannot sufficiently describe this layout. It's rather about observing how the Tukey's 5 numbers (all quantiles) are changing. Or compare entire eCDF (or QQ).
c) if it's symmetric, unimodal, but still non-normal, The CLT may work well in this case. It "likes" symmetric distributions (I saw someone showing this by expanding the t-distribution into Edgeworth series, where the 3rd cumulant has greater impact than the 4th). A note: if the tails are "fat", this may suggest some phenomena (let's hope it doesn't come from something like Cauchy).
And there we have the question about the necessary sample size. For symmetric data, from my experience, N=100-200 sufficed, as long, as there were no fat tails.
In the latter case, 200-300 sufficed. But several times it did NOT suffice with a mixture of distributions with serious skewness, multiple modes and "alone" outliers lying quite far in a single direction (e.g. the LDL-cholesterol in patients under familial hypercholesterolaemia under treatment, or the PSA hormone in cancer patients)
d) another problem is with discrete data, like drug doses, counts or scores. The means may refer to completely unobserved outcomes, and the distribution may be naturally skewed. But this is a separate, broad topic, and I don't want to make this one too monstrous.
3) When we answered the previous question, and it turned out, that:
- mean isn't the best measure for our data
- the nature of the data may require a lot of observations for the CLT to work sufficiently
- and we have not that much - 20-500 (depending on case), or even 1000, but the data are of a "complex structure" (remember, it's not only about the means, it's the entire t-statistic),
we may need to answer additional question, namely: can be our research question reformulated (ideally - a priori stated in the Statistical Analysis Plan, or ad hoc, after the data are collected) to other hypotheses?
Sometimes it may be complicated! "Difference in distribution shapes" is not as straightforward as "difference in medians", which is not the same as "median difference", which is not the same as "stochastic equivalence" (that's what the Mann-Whitney (Wilcoxon) tests), which is not the same as "dealing with transformed variables"
/ And I immediately advise against the Box-Cox applied to inference problems! Please check my answer here: https://www.quora.com/Why-is-the-Box-Cox-transformation-criticized-and-advised-against-by-so-many-statisticians-What-is-so-wrong-with-it/answer/Adrian-Olszewski-1?ch=10&share=b727f842&srid=MByz /
In this case, we may decide to switch to non-parametric methods:
a) quantile-based methods, like quantile (mixed - for clustered/longitudinal data) regression followed by appropriate joint tests for the main and interaction effects (for 2 samples = comparing quantiles, e.g. medians)
b) methods assessing the stochastic equality, like the Mann-Whitney (Wilcoxon), or, better, the Brunner-Munzel - for the same reason we should always use the Welch t:
Article Psychologists Should Use Brunner-Munzel’s Instead of Mann-Wh...
The interpretation will depend on the distribution of samples, as the MW(W) tests 3 kinds of hypotheses, contextually:
- overall equality of distributions (stochastic equality), if we make no assumptions of the samples (e.g. Conover WJ (1999) Practical non-parametric statistics, 3rd edition. New York: John Wiley & Sons.). It means, that we cannot say if the differences come solely from the locations (e.g. medians), but also dispersions and shapes itself. It's all mixed together and tells us only if two distributions differ from each other (but totally differently, than, say, Kolmogorov-Smirnov)
It uses so called relative effect size, defined as: p=P(X
A comment on the CLT you are wrong that there has to exist a sufficiently.large number. See definition of convergence in the real.numbers. no such finite value needs to exist Best wishes David Booth
Sure, it doesn't. CLT is unconditional. It's all about making the distribution of the t statistic close to the theoretical CDF enough to ensure the desired properties: power and type 1 error rate. It's all approximate.
Adrian Olszewski BTW, your assertion (2) is spurious and I am not in agreement with you on it. You seem to want to put words in my mouth as well as in other people's mouths. It is not a good habit to cultivate.
As I have explained in my paper, the CLT is not a tool, it is a theorem. Resampling is a tool. Bootstrapping is a tool. Theorems are (generally) mathematically provable. When something is mathematically proven, it is pointless trying to debate whether it works or not - unless there is an error in the proof of the theorem.
The original question here was whether parametric methods can be used to determine difference between two non-parametric data sets of size n = 550. I have shown already in my paper why it is very unwise to rely on parametric methods when testing non-parametric data - you can arrive at a completely wrong answer with 100% confidence. No amount of prevarication over definitions of what is parametric or non-parametric, or normal or non-normal, can change this fact. No amount of resampling will change this result either. And the relative accuracy of parametric methods on non-normal data doesn't increase as n increases, it decreases, because the confidence level of the non-parametric method continues to increase while the confidence level of the parametric method remains at zero.
@Ian Dash
I'm not sure where I put any words into anyone's mouth. Could you, please, cite the sentence? In the other comment you said that you are not sure if we are in agreement, so I re-phrased my previous answer, in a bit different words, and asked if we agree up to that moment. I did it to easier find the source of a potential disagreement. That's, in my opinion, the basis of a good communication. I thought you will find this fair. Apparently I was wrong. Now you explained where you disagree with my assertions, so I can respond to that.
But I would like to ask you, if you don't mind, to avoid strong words about my habits in the future, please. I may be wrong in my opinions, I may fail to understand you, but that's not a reason to be harsh. We all spend our free time here to learn something and exchange experience. Tell me where I'm wrong, and it suffices.
On the CLT:
Yes, the CLT is a theorem and it "works" unconditionally. It is some fundamental mechanism that exists and is proven "to work". By saying "works" in this context of the sample size I used a mental shortcut, meaning, that with increasing N it "makes" the shape of the empirical test statistic close enough to the theoretical distribution. It expresses that "with the given data, at the given N, the properties of the calculated test statistic are -subjectively- sufficiently close ("convergent") to the expected ones", reflecting the asymptotic nature of it.
I mentioned it later in my response to prof. Booth, who caught this too.
On the "parametric data":
I'm not sure what "parametric data" actually means, I've never seen this term before. For me, data are data, methods can be (semi) parametric or not.
EDIT: OK, I found it - you equate parametric data with normal data. Well It doesn't have to be normal, it can be of any theoretical distribution that makes sense, e.g. Poisson (there are such parametric tests).
Initially, before I found your statement, I guessed you mean the data, which distribution resembles* well a certain, common distribution, like the normal one. By "resembles" I mean that either a test or the QQ/eCDF plot doesn't show visible deviations from it. In this case, the data and difference between them can be sufficiently summarized by the distribution parameters: mean and variance. And, when making inference about the differences, by employing a parametric test, the distribution of the input data is essential for valid use of this test, namely the distribution of the test statistic (and all related properties) depends on that.
/ * Actually, data have their own distribution, not any theoretical one (including the normal one, having infinite support, while most-if-not-all real data are somewhere limited/truncated). We are the one who assigns a theoretical distribution to them, claiming, that frequencies of the observed data are approximated by the theoretical distribution *well enough*. It's forcing the data to tell our story, based on our experience, observations, beliefs and a common sense. All our further conclusions depend on this subjective choice and belief the unseen rest of the data in the theoretical population share this property as well. All models are wrong - some are useful. /
For example, if our data look "likely to come from the normal distribution", of same variance (for simplicity) in both samples, it's sufficient and meaningful to compare just their means when speaking about the differences between them. And for the inference we rely on some properties on the assumed normal distribution, namely the t-test statistic depends on the normality of the raw data.
So it seems my guess was consistent with your understanding ("normal data").
I partially addressed the need to think about the way of meaningful summarizing the data in my previous post.
Another issue:
There still remains an issue of summarizing data and their differences with just a single property - regardless of the approach: by using a concrete distribution parameter (like mean) or a non-parametric one, like using medians. This is why I wrote, that in certain cases rather a set of numbers, like the 5 quartiles (min, q1, median, q3, max) might be better to describe the differences.
Using a single property, regardless of approach (parametric or not) is like summarizing the impedance (a complex value) with just one real component - resistance. It's incomplete, doesn't give the full picture and may be misleading. Sometimes there is not just a single and interpretable measure of the difference between distributions other than "their eCDF differ" or are not "they are not stochastically equal". This isn't very informative, but that's not a method fault. It's just impossible to characterize complex, multi-aspect (location, dispersion, shape) differences with a single, well interpretable term.
I say that to highlight, that even choosing a non-parametric method (and thinking we're safe now) we may still be asking for meaningless outcomes.
/ By the way, this is why many textbooks put additional "constraints" on the sample distributions to make the interpretation of a test easier, meaningful. I showed this by the occasion of the Mann-Whitney (Wilcoxon). This approach limits the use of such tests anyway. /
Conclusion:
I don't advocate - more - I'm strongly against using parametric methods thoughtlessly at any cost by collecting huge data and believing it will suffice, as you mentioned in your metaphor ("a square peg can be forced into a round hole as long as you use a big enough hammer") in one of your posts. I don't believe in that approach and I have always been criticizing such approach.
I wrote what can be done, if reporting mean is requested or *still sensible* and there are methods designed exactly for this purpose, like those I mentioned. They are not parametric in that they don't require assuming a certain distribution and rely on its parameters for the inference, but still return the answer either in terms of means (semi-parametric GEE, permutation test), or some robust modification (Yuen). In every case it requires thinking about our research question and justification of the used methods.
Trying to apply parametric test to complex data data may be pointless. But also choosing incorrect non-parametric test can make an issue (yet for a different reason).
But there are no ideal situations. The data may look perfectly normal, forming a straight line on the QQ or making the theoretical and empirical CDFs overlap, yet normality test may still find it "non-normal" and suggest using a non-parametric method instead, where no real issue for a parametric method actually existed. And vice versa. For instance, the Jarque-Bera test can be fooled that the data look for it "normal enough", while it's not, what I showed an example in one of my previous posts.
Statistics is full of approximations and asymptotic properties resulting in approximate results of acceptable properties.
Example:
If I have two samples describing age (which cannot be normal anyway - age is truncated at both ends, say [0-150] for humans, while the support of the normal distribution is infinite), and both are approximately symmetric and unimodal, consisting of a few tens of observations, with no problematic outliers, forming approximate "a mountain" shape, the use of the t-test won't be a "crime", even if the non-normality test reject the null hypothesis about normality. The interpretation, the sense will be preserved in this case.
The discussion was hot. For me it doesn't matter whether we use parametric, or non parametric. Any kind of data can be analyzed either with parametric or non parametric.
Mwoya sure it can be we like smaller errors in our conclusions overall especially in drug trials. D Booth
When we have large sample, the test statistic follows normal distribution, hence, Z test can be used. Even t test becomes Z test in this case.