Can we use parametric tests for data that are not normally distributed based on the central limit theorem, especially if we have a large sample size?
Both, 31 and 17354 are larger than 30...
And it makes a difference if the distribution is strongly skewed or not. Or if we talk about a binomial response with p close to 0 or 1.
Yes.
However, it would be better to use a distribution model that better fits to the data.
Do you mean parametric test or test that assumes the residuals are normally distributed? I'll assume the later (since otherwise I don't understand the question), but you should clarify. How big a sample? There has been a lot written on how procedures like t-test have low power when there are outliers. A good review is:
Article How Many Discoveries Have Been Lost by Ignoring Modem Stati...
Daniel Wright Thanks Daniel for your reply, I meant tests that assume normality e.g. t-test or ANOVA. I had a discussion with a colleague and he states a sample that is larger than 30. I will take a look at the article - Thanks!
Both, 31 and 17354 are larger than 30...
And it makes a difference if the distribution is strongly skewed or not. Or if we talk about a binomial response with p close to 0 or 1.
Jochen Wilhelm Thanks for your comment, so if you have a sample size of 200 and you want to compare the mean of two unpaired groups, do you use a t-test or Mann-Whitney test? Do you base your decision on the distribution i.e. draw a histogram or q-q plot?
There is no finite n for which the central limit theorem (CLT) always applies, so the brief answer is that one can't assume that that inference will be accurate with statistical models that assume a normal distribution for the errors.
If the CLT applies (e.g., for averages of well-behaved distributions such as normal, binomial etc.) then the rate of convergence towards a normal sampling distribution depends on the shape of the distribution - being faster for symmetrical distributions and slower for distributions with heavy tails (for instance). So if the shape depends on the value of the parameter being estimated - as it does for say binomial or Poisson then convergence could be very fast or very slow.
e.g.,
- binomial proportion with mean = 0.5 requires only fairly low n to be approx normal
- binomial proportion with mean = 0.0005 requires only very high n to be approx normal
The "is 30 right" question has been asked before (https://www.researchgate.net/post/What_is_the_rationale_behind_the_magic_number_30_in_statistics). One of the problems is that outliers increase the standard deviation (because the residual is squared before being summed) more than the mean.
Just to show Thom Baguley 's point, here is sampling 1000 people from almost normal distribution (99.8% normal, sd=1, .2% normal, sd=100) and the observed percentage significant is less than have the nominal value, thus showing the power is low. A point made by Fisher, Tukey, etc.
> g ps xs for (i in 1:1000){
+ x
Yes you can but it depends on the nature of the testing problem. With regards to central limit theorem, the testing problem for instance Test for randomness of data set converges to normal distribution while others may converge to chi-square.
Regards
Osaid Alser, the point is that the MW-test does not "compare the means" (i.e., test hypotheses about mean differences). It does not even "compare medians", as many say. The only test I know that is about mean differences is the t-test. If you are interested in testing mean differences but the assumptions the t-test is based on really make no sense, you may bootstrap the null distribution of the mean difference. A sample size of 200 seems to be ok to go for a reasonably robust bootstrap approach.
According to the MW-test, I'd like to cite Ronán Michael Conroy's answer:
>>>
It's worth noting the actual title of Mann and Whitney's paper : On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other (1). That's exactly what it tests. In fact, if you divide U by the product of N1 and N2, this gives you the proportion of cases in which an observation from one sample is higher than an observation from the other sample.
t-tests are, in fact, pretty robust to non-normal variables (there's a big simulation literature on this). The real problem is that people who use the Wilcoxon Mann-Whitney don't understand what hypothesis they have just tested!
1. Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann Math Statist. 1947 Jan 1;18(1):50–60.
Also worth clarifying, it is the sampling distribution of the statistic (e.g., mean) that is converging on normal under the CLT. The raw data won't change distribution.
Daniel Wright I don't understand your example. Both "populations" have a mean of 0, so H0 is true under the simulation. I don't understand how you can discuss "power" in this context.
The distribution of the p-values under H0 is not uniform giving a more conservative test.
If the samples are generated including an effect:
x
If you are using a t-test and the sample size is at least 50, then yes.
An alternative approach to the bootstrap mentioned briefly by Jochen Wilhelm is to do a permutation test based on the mean difference or on the statistic usually identified with the t-test. You just need to do the permutations to correspond to whether you have a paired or unpaired situation. If you can do enough permutations, the probabilities obtained for the null distributions are exact for any distribution of the data and for any sample size. But the interpretation of those probabilities may differ from the usual as they refer to different probability spaces. However they still provide a valid significance test.
Many thanks for your comments, really appreciated!
Daniel Wright How do you do a bootstrapping test using STATA? Do you mean this command: https://www.stata.com/manuals13/rbootstrap.pdf?
Here's another magic number, and mine is smaller than yours!
https://blog.minitab.com/blog/adventures-in-statistics-2/how-important-are-normal-residuals-in-regression-analysis
"The good news is that if you have at least 15 samples, the test results are reliable even when the residuals depart substantially from the normal distribution".
" For simple regression, the study assessed both the overall F-test (for both linear and quadratic models) and the F-test specifically for the highest-order term.
For multiple regression, the study assessed the overall F-test for three models that involved five continuous predictors:
I don't understand the experiment/simulation they have done.
5 predictor variables + square terms = 10 coefficients, leaving 5 d.f.. And this with serious violation of assumptions?
And what distributions did they use?
Apart from that it's not the problem that the nominal type I error would not be met. If you have just one residual d.f., you end up with the fact that any (proper) test keeps the nominal type I error rate -- but the power is miserable! This is nothing they mention. They only mention that "there is a caveat if you are using regression analysis [of data clearly violating the assumptions] to generate predictions." Now what? Hooray because the type I error rate is kept but no power and no predictions. Sounds like a bad deal.
(Kelvin, that's not against you. I am just pointing out that such recommendations as given in the link you posted are stupid, and I know that you know that this is stupid. I am afraid that other readers might not)
Doctor Mohamedraed Elshami ,,,
There are 2 points regarding your answer:-
1 - Regarding sample size is 50 , you sir may mean about FA & PCA (check power point attached).
Please check
What is the sample size for EFA?
https://www.researchgate.net/post/what_is_the_sample_size_for_EFA
2- Regarding T test / Mann-Whtiney Test (Please check 2 screenshots)
Please check Essential Medical Statistics textbook by Betty Kirkwood, Jonathan Sterne page 56 (Chapter 6)
Table 6.1 Recommended procedures for constructing a confidence interval. (z0 is the percentage point from the
normal distribution, and t0 the percentage point from the t distribution with (n 1) degrees of freedom.)
, and page 345 (Chapter 30)
Table 30.1 Summary of the main rank order methods. Those described in more detail in this section are shown in
italics.
Also check
https://en.m.wikibooks.org/wiki/Statistics/Testing_Data/t-tests
Mohamedraed Elshami None of the references you cite actually presents a reason for the magic number, and they look more like cookbooks than research. There's a lot of that about, to be fair – simple answers to complicated questions. The book you show extracts from has some odd notions, such as that the Mann-Whitney test is an alternative to the Wilcoxon. It is exactly the same as the Wilcoxon, which is why it gives identical results. It also doesn''t actually tell you the hypotheses you are testing (kind of important if you are to run a test!). Not impressed.
There's a lot of literature on the performance of the t-test (which is a simple OLS regression with a binary predictor) and I've never come across 50 as being a threshold for anything. Can you point me to where this figure is actually calculated or to simulation studies?
And as for EFA, sample size is heavily dependent on the number of variables, so there's nothing magic about 50 here either.
Jochen Wilhelm & Kelvyn Jones :
Re that Minitab Blog post, I wonder if the author(s) meant 15 observations per variable in the model? I tried to look at the two white papers mentioned near the end of the post, but both links are broken. I've written to Tech Support at Minitab to see if they can provide links that work. Will post them here if I receive them!
Cheers,
Bruce
An obvious concern of using parametric tests whether there are any harmful effects of misspecification. This problem can be avoided by estimating this distribution through the use of the Nonparametric Maximum Likelihood technique or by using a transformation to transform the non normally distributed data to normal or close to normal.
I just received this from Tech Support at Minitab. Emphasis added.
"I’ll pass that error [i.e., broken links to the white papers] onto the web team. In the meantime, all the white papers including the two on Regression are located here:
https://support.minitab.com/minitab/19/technical-papers/
And the 15 refers to the number of rows of data, regardless of the number of variables you have. Whenever you run regression, all columns must have the same number of rows, and that number should be at least 15."
I am astonished by that claim, and don't believe it for all the reasons Jochen Wilhelm listed in his Nov 29 post.
Details (such as they are) about the simulations for multiple regression are in Appendix C of this document:
HTH.
Quick answer is no.
Question is, why your data are apparently not normally distributed even with a large sample size ?
Does it come from a graphic observation, a test (by which one...?)...
Prior testing your data, you may "pre treated" them with some standard function and see if you get normality.
Such as : log, sqrt...
It would give another information to you about your data dispersion and you could test your pretreated and normally distributed data.
Regards
The 30 comes from the fact that the tabulated t-tables tend to stop at 30 df. This has nothing to do with the issue in the question but that's where it comes from. For a little history of this see this link: https://www.google.com/search?q=n%3E30+implies+the+central+limit+theorem+history&rlz=1C1CHBF_enUS874US874&oq=n%3E30+implies+the+central+limit+theorem++history&aqs=chrome..69i57.33394j0j1&sourceid=chrome&ie=UTF-8
The history of Statistics, Mathematics and the Sciences in general is quite interesting. I recommend it to you. Best wishes, David Booth
@Bruce Weaver Thanks for mentioning the nonparametric maximum likelihood approach. I had no idea that it existed. For any others like me here's a link:
https://www.google.com/search?rlz=1C1CHBF_enUS874US874&ei=890jXsbDJJXJtQbO0YvIDQ&q=nonparametric+maximum+likelihood&oq=nonparametric+maximum+lik&gs_l=psy-ab.1.1.0l3j0i22i30l7.867702.885655..894311...1.2..4.1015.10832.8j8j6j0j1j1j5j2......0....1..gws-wiz.....6..0i71j33i160j33i10j0i362i308i154i357j0i67j0i273j0i131.eXFLGOhxiyQ
Jochen Wilhelm and Bruce Weaver Yes it is scary but software vendors can often say things like that. I have seen similar stuff from others. Best, David I have also heard colleagues say since Microsoft is a big, Famous company Excel can't have errors. Well ….. Best, David
Parametric models for non normal data or in other words non-linear statistical models for density estimation problems. In such scenarios, you may use Statistics on Stiefel & Grassmann manifolds : Book Statistics on Special Manifolds. Lecture Notes in Statistics
Another good Book is Directional statistics defined on non linear data: Chapter In Directional Statistics
For Aspects of multivariate statistical analysis, also see Book by Muirhead :https://www.booktopia.com.au/aspects-of-multivariate-statistical-theory-robb-j-muirhead/book/9780471769859.html?source=pla&gclid=EAIaIQobChMIu4-n8fqO5wIVSgwrCh2gOAEBEAQYASABEgJ_AfD_BwE
The central limit theorem tells us the data should be approximately normal for large sample. If your data is still not normally distributed for large sample, I suggest you use the non parametric equivalent for the required parametric test
There are 2 trade-offs to consider when assessing any procedure from a frequentist POV:
1. Robustness for validity. Will the t--test falsely reject the null at a higher rate than the pre-specified alpha? In general, the t-test *is* robust for validity in that the type I error rate remains near the nominal alpha when assumptions are false.
This is the "robustness" that secondary sources on research methods tout when they defend the general use of the t-test for non-normal data.
2. Robustness for efficiency: What these proponents fail to consider is that the WMW (Wilcoxon-Mann-Whitney) independent samples test is nearly as efficient as the t-test under normality (aprox. 95.5%), *at worst* 86.5% efficient (ie. distributions with thinner tails than the normal), but it can be *infinitely* more powerful in certain cases of heavy tails or skew.
I've read a number of papers simulating these results, and they all come to the same general conclusion that asymptotic analyses conducted in the 1940's and 1950's also discovered.
It is very hard to beat the simplicity of the Wilcoxon test without either making an assumption (ie. Bayesian methods), or peeking at the data (Robust or Adaptive Methods).
R. Clifford Blair discusses this debate in a historical context in this link.
https://digitalcommons.wayne.edu/jmasm/vol3/iss2/22/
Also worth looking up are papers by Shlomo Sawilowski.
Robert Ryley ,
you say that, in case of skewed distributions, the power of the MWM test is higher than that of the t-test. But the MWM test tests a different hypothesis than the t-test. How can you comper the power? Isn't this is like comparing apples and peaches?
One might argue that both test the same hypothesis (zero expected difference) under the assumption that the distributions can only differ by a location shift. In this case the power of MWM can be (much) higher for skewed distributions. But if the distributions are skewed, the effect to test is almost never a location shift. I would be thankful for having a single practical example of a variable with a skewed distribution where the relevant effect is a pure location shift.
The MWW test the stochastic ordering assumption -- to what degree are values in group X > Y? The real difference is the parametric t test makes a scale assumption (data are interval or ratio), while the MWW only makes the assumption that the data can be ordered.
We can always convert an effect size from one model to another by multiplying by the appropriate scale.
We can also see the decision result under both procedures conditioning on the same data set. That is how the simulation studies work. We specify a distribution, draw samples from that distribution, then see how power and alpha behave empirically.
I don't see the problem with comparing the procedures at all, and neither do the hundreds of papers that compare the 2 via simulation.
FWIW -- I think mean differences are used in circumstances when the MWM/proportional odds model would be a more appropriate choice.
Misconceptions Leading to Choosing the t Test Over the Wilcoxon Mann-Whitney Test for Shift in Location Parameter
https://digitalcommons.wayne.edu/coe_tbf/12/
Fermat, Schubert, Einstein, and Behrens-Fisher: The Probable Difference Between Two Means When σ_1^2≠σ_2^2
https://digitalcommons.wayne.edu/jmasm/vol1/iss2/55/
Article Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothes...
The problem I see is the following:
P(X>Y) > 0.5 (MWM significant) and at the same time, for the same data, E(X-Y) > 0 (t-test significant). Now if you rejet "some H0", what do you conclude?
PS: Just because something is said or written or done very often does not make it correct, and this is never a reason to give credit. How many stats books do you find where "probability" is defined as a limiting relative frequency? In how many is written that failing to reject H0 means to accept H0? These things don't become correct just because they are repeated so often.
Jochen Wilhelm You wrote:
Quote:
P(X>Y) > 0.5 (MWM significant) and at the same time, for the same data, E(X-Y) > 0 (t-test significant). Now if you reject "some H0", what do you conclude?
In principle, shouldn't you should pre-specify what test to use before seeing the data, or pre-specify a protocol for an adaptive/robust test?
In reality, there is never a reason to do both tests as far as I can tell. If I were presented with the results of conflicting tests, I'd favor the MWW over the T test. I'd also prefer to see the actual p value, rather than "reject/fail to reject." An actual estimate would be best.
If your argument is that these problems/questions are better placed in an estimation framework, you would be in excellent company.
I only objected to the idea that these procedures were not comparable because the hypotheses were different. Both procedures map scientific hypotheses to number systems (reals for the t-test, naturals for the MWW). Both systems share the ordering assumption, but differ on the idea that the scale is equally spaced.
Where you can do a t-test, you can also do a MWW. Do you disagree with this? I'm not exactly sure what your criticism entails.
The most you can conclude from your example (or any hypothesis test, for that matter) is that, (from a frequentist perspective), based on the data, the researcher would reject the null model of no effect -- ie. the data are compatible with the existence of a discernable effect.
What that "rejection" entails behaviorally is context sensitive, and outside the realm of the operating characteristics of the decision procedures.
Thank you Robert for your response.
From my point of view, testing is a minimalistic estimation problem. You test to see if the data at hand are sufficient to estimate an effect with sufficient precision to compare the estimateed effect to the hypothesized (tested) value H0. In a t-test this is very simple. You get an estimate of µ, xbar which happens either to be > H0 < H0. You want to know if the data allows you to conclude whether µ < H0 or µ > H0. If xbar is your estimate of µ, you must judge is this is precise enough to be "clearly different" to H0. If the data is too noisy, too sparce, or the difference between xbar and H0 is too small (or any combination of that), you would not dare to decide on that (i.e. "fail to reject H0"). But if the data provides sufficient information, you can say: "since xbar > H0 and xbar is estimated with sufficient precision, I can conclude that mu > H0" (similar if xbar < H0, and similar for two-sample tests where the effect would be a mean difference [µ1-µ2]). The difference to an estimation problem is just that we don't care what the actual precision is. It is "semi-quatitative", allowing us only to exclude on of the sides of H0 (if the data is incompatible with H0 and xbar < H0, then all any hypothesized value > H0 is incomatible with the data (so whatever the effect is, it must be < H0). A better picture is given by a confidence interval which inverts the test and includes all testable values which the test would not be able to reject.
[a proper estimation is not possible in a frequentist framework and needs to pin down a posterior]
So the test is based on some kind of (incomplete) estimation. And the t-test is based on estimating the expectation, whereas the MWM test is based on estimating a probability. What concerns me is that Pr(A>B) > 0.5 and at the same time E(A) < E(B). I agree that the interpretation is "outside the realm of the operating characteristics of the decision procedures", but a sensible decision must be based on an understanding what is decided about. And to my understanding, there are different things in the case of t and MWM. I also think that demonstrating "some differencee" without precisely saying what kind of difference (e.g. a difference in the expectation or a stochastic inequivalence, or something else) is often quite useless. If I have a drug or substance possibly affecting the mortality, I want to know if the mortality is affected negatively. I need to define precisely what I mean with mortality and what I mean with a negative effect. It is clear when I focus on the expected (change in) mortality. It would also be clear when I focus on some quantile (change) or on the mode (even modality might be an interesting aspect). Having data that lets me reject the hypothesis that the distribution of mortality in an exposed and an unexposed population is exactly the same is of little practical use. Even the claim that a randomly selected person from population A is more likely to die than one randomly selected person from population B is usually not very relevant in practice.
I know my examples are not perfect, but I hope they were not too stupid. Please feel encouraged to tell me where I am wrong.
Jochen Wilhelm:
I wish your perspective about testing being incomplete estimation was explained to me years ago. It would have saved me much time and effort of going through original papers and numerous texts to figure out why I never found reports of hypothesis tests all that persuasive. (I read a lot of psychological, medical, and rehabilitation studies).
I wholeheartedly agree that any researcher should carefully think about how the question of interest relates to well-known statistical methods.
Unfortunately, it seems that papers I find simply assume the naive normal theory methods (whether testing or regression approaches) are always appropriate. For applied problems, I'd say they are rarely appropriate.
The argument that "normal theory methods are robust to type I error" is only half the story. What about power? What happened to thinking hard about the Type II vs Type I error (from a frequentist perspective)? If you compare these procedures on a mini-max basis, the mean is not robust at all (neither is the standard deviation), and the distribution free estimate (the Hodges-Lehmann estimator aka pseudo-median) would be preferred even if there was just a 1% chance of having outliers from mixture of normal distributions.
If the researcher is so confident in the normality assumption, why not just use the Bayesian version of the t-test?
This is all well-known statistical theory, but I don't see it often in the papers that I have access to.
I'd like to make the case that the nonparametric estimate -- the Hodges-Lehmann estimator (median of pairwise differences for 2 sample problem) is more useful.
1. It remains valid on data having a scale of ordinal and above. An WMW estimate on groups assessed on an ordinal scale still makes sense. T tests on ordinal data do not (from an inferential POV).
2. For a location parameter when outliers exist, or there is skew, the mean is a poor estimate of location. The HL estimate would be preferred, even for interval and ratio data.
3. If you don't want to use robust or adaptive parametric estimation methods (that require peeking at data), nor do you want to specify some prior distribution (Bayesian methods), it is hard to object to the distribution free methods based on ranks for test, and pairwise differences for actual values.
They can be specified before seeing any data. They handle outliers well. The inference will have the promised frequency properties, and you will avoid the objection of "forking paths" that can be brought up with informal model fitting after the data have been collected.
In comparing these methods, Gottfried Noether in his (sadly out of print) intro to statistics based on nonparametrics wrote:
"Returning to the question of the investigator who wants to choose between the Wilcoxon and t-test, we can conclude that unless the investigator has reliable evidence that the conditions of the classical two-sample problem -- a shift model of 2 normal or near-normal populations apply, the Wilcoxon test offers much greater assurance of reliability than the t-test. Even if the assumptions for the two-sample problem are fully satisfied, statistical theory shows that the t-test is only marginally superior..."
Also worth reading:
Article Elementary Estimates: An Introduction to Nonparametrics
.
Robert Ryley , kudos for the reference to Noether's book !
for the record :
Introduction to Statistics: The Nonparametric Way
Springer Texts in Statistics
by Gottfried E. Noether
.
One of the very few introductory books in statistics where you learn that there are better questions to be asked about your data than "is it gaussian ?" !
And it does not seem to be out of print ... but the price is ... well ... it's Springer, you know, the yellow ink for the cover must be very expensive, indeed.
Used copies are easily available anyway and affordable ; highly recommended !
.
Fabrice Clerot : He has a section on p value reporting where he discusses the Neyman-Pearson pre-data perspective vs. the Fisher post-data perspective. This alone would reduce a lot of confusion over so-called "hypothesis tests" which really only make sense in a quality control context.
It is amazing how far you can go with these simple rank based procedures that rely on combinatorics.
An interesting advanced exercise would be to study the Hogg-Fisher-Randall adaptive rank test, understand why it maintains its size (alpha) and how it relates to the "Garden of Forking Paths" critique that Andrew Gelman justifiably criticizes.
https://statmodeling.stat.columbia.edu/2016/02/17/youll-never-guess-what-david-cox-wrote-about-the-garden-of-forking-paths/
While the above discussion is interesting and useful, if one goes back to the original question ... " Can we use parametric tests for data that are not normally distributed based on the central limit theorem, especially if we have a large sample size? " ... one sees that it is rather more general than the simple univariate case currently being discussed. There are many tests for complicated questions that start from an assumption of multivariate normality (for example tests for eigenvalue decomposition of sample covariance matrices). The summary of robustness given above is incomplete even for the fairly standard regression situation, since I believe that the simulation results were that, while the t-test was fairly robust, F-tests were not.
One should also recall that the usual approach to maximum-likelihood testing (a "parametric test"), that yields an asymptotic chi-squared distribution, relies on theory using a central-limit theorem for the sum of contributions from individual observations.
invariably it is presumed that large sample size data follows normal distribution. A little bit of skewness though exists still we can carryout parametric test. Or else you can transform data to log scale and still you can perform parametric tests.
Let H0 be “the process is In control”. Consider the case (Example 7.6 in the Montgomery book Introduction to Statistical Quality Control, 7th edition, Wiley & Sons) where he writes “A chemical engineer wants to set up a control chart for monitoring the occurrence of failures of an important valve. She has decided to use the number of hours between failures as the variable to monitor”. Here are the data (exponentially distributed), named lifetime:
286
948
536
124
816
729
4
143
431
8
2837
596
81
227
603
492
1199
1214
2831
96
Montgomery and others conclude that H0 cannot be rejected… and is “plausible”!
You cannot invoke the Central Limit Theorem.
What do you do?
Following Nelson (1994) Montgomery solved this problem by transforming the exponential random variable to a Weibull random variable such that the resulting Weibull distribution is well approximated by the normal distribution: transformed data=(exponential data)^(1/3.6).
THEN Montgomery draws the Control Chart related to the data well approximated by the normal distribution.
AND finds that H0=[“the process is In Control”] has NOT to be Rejected: high “p-value”!
ACTUALLY the process is Out Of Control and H0 has to be Rejected!
Dear: Parametric tests cannot be used as long as the normal distribution of data is not achieved, and there are other alternatives to the tests such as non-parametric tests
@@@ Haider Raid Talib
I do not agree that """ Parametric tests cannot be used as long as the normal distribution of data is not achieved """.
Haider Raid Talib Please provide a reference for this remarkable assertion.
Always quality of data guides the process of analysis. In statisical reasoning and inference it is important to keep up the assumption and assertions there by better decisions can be drawn.
Well, the whole conversation was interesting. Recently I have faced similar kind of problem. My sample size is more than 750 but sk test is showing non-normal distribution. The variables have several outliers but I shouldn't remove those. In that case what should I do? t-test or other non-parametric test? I am afraid that by doing non-parametric test the result won't give actual idea. Jochen Wilhelm
If you have an idea about a better distributional assumption than "normal", then use this. This remains important because it might be that estimating mean differences might not very meaningful. Otherwise, if your model is not overly complex and you are sufficiently ignorant about the adequacy of the model, the sample size seems large enough to rely on the central limit theorem.