Some studies indicate that it is not required for large sample size. What can be considered as a large sample size? Any relevant literature is appreciated. I want to run a One-sample t test for a sample size of around 400 participants.
What large enough is depends on how severly the assumptions are violated and on the kind of violation. For instance, skew is the most severe kind of violation. A strongly skewed variable (ge. log-normal,or gamma) needs a much larger sample size than a variable with a very non-normal but symmetric distribution (e.g.a uniform or even a U-shaped distribution).
400 is rather large for a single sample, unless the distribution of the variable is not extremely skewed.
I say "distribution of the variable" not "distribution of the sample". It is difficult to estimate the distribution of a variable from samples, and usually samples also need to be quite large to get a resonable estimation of the distribution of the variable from which the data are sampled. Again, 400 is a good sample size to estimate a distribution. This may hel you to identify an appropriate transformation to effectifely remove skew (if this is a problem).
Hello Adithya Bandari. Given the way you worded your question, I think you already know that the necessary normality condition for a one-sample t-test is that the sampling distribution of the mean is approximately normal. IMO, the population from which you are sampling would have to be extremely non-normal for the sampling distribution of the mean (with n = 400) to fail to approximate a normal distribution reasonably well. But it might help if you told us what the variable is, and what is already know about its (population) distribution. Thanks for clarifying.
PS- You can tinker around with online simulators like the one listed below to get a sense of what the sampling distribution of the mean looks like (for various population distributions) with n = 400.
If the sample size is 400 participants, you must apply the Z test and not the t test, this is due to the central limit theorem or the law of large numbers.
No, the dependent variable does not need to be normally distributed for the one-sample t-test. In fact, neither the dependent nor independent variable needs to be normally distributed. The normality assumption applies to the distribution of the errors (Yi −Yˆi Y i − Y ^ i)
Pablo Navarro, a one-sample z-test for a mean requires one to know the population SD. I see nothing in the original question suggesting that the population SD is known. Therefore, I assume that what you meant to say is that with n = 400, you could (if you wished, and if it was more convenient) use the standard normal distribution (rather than the t-distribution with df=399) when computing the p-value. (Notice I said could, not must.)
If that is what you mean, I agree that one could do that. But with many stats packages, it is much easier to just compute the one-sample t-test.
If the sample size is 400 participants, you must apply the Z test and not the t test, this is due to the central limit theorem or the law of large numbers.
Pablo Navarro , you leave me confused... could you explain why the t-test is not allowed for large samples? And what this all has to do with the CLT or LLN? I don't see any a reason or connection here.
AFAIK, the t-test accounts for the uncertainty in estimating the variance from a sample. This uncertainty will smaller in larger samples. It may be that the remaining uncertainty is negligible for really large samples, in which case the result of the t-test will still be correct.
The z-test requires knowing the variance. If the variance needs to be estimated form the sample, a z-test cannot even be performed. One could use the sample estimate as the given, known, true value the variance, neglecting any remaining uncertainty. For large samples, this (strange) procedure would give results (almost) identical to the t-test, because there is only negligible uncertainty remaining.
If the variance is known (not estimated), using the t-test would be considerably disadvantageous (yet correct) for small samples, not for large samples, because the estimated variance may be considerably different to the known, correct value (that would be ignored by the t-test, whereas the z.test would use the known, correct value).
I am fully lost what the CLT or LLN might have to do with that. Would be great if you could enlighten me.