There's a lot that can be said about the normality assumption for t-tests... But for now I'll just add a clarifying comment that for the paired t-test, it's the differences of the paired values that the normality assumption applies to, not the individual "groups".
Regarding normality, the necessary condition is that the sampling distribution of the statistic in the numerator is approximately normal. Sampling from a normally distributed population of raw scores (or paired difference scores in the case of the paired t-test) is a sufficient condition, but not a necessary condition.
You should be aware of the little difference between the t-test and Wilcoxon. Instead of testing the mean of the differences, Wilcoxon signed-rank test the, I would say, the difference between the two "positions" of the two distributions (the two samples). the logic behind it is that if an effect /difference exists, the position of the second sample moves, and the test will detect that shift in the whole sample, not just the mean. Fortunately, most of the time Wilcoxon test just does the job as a good alternative to the t-test. Sometimes it is even preferable because it is not sensible to the outliers.
Salah Bouabdallah , I think that's fair... But just to clarify, with the paired-sample Wilcoxon signed-rank test, the first step is that the values for the paired observations are subtracted, and then the test is then carried out on a single set of values. So if you think of the test as one of "position" or "location", it is a comparison of the set of differences to that of the "null" or "default" value.
I'll just offer this quote about relying on the central limit theorem from Rand Wilcox.
Rand Wilcox, 2017, Modern Statistics for the Social and Behavioral Sciences. Section 7.3.4.
QUOTE
Three Modern Insights Regarding Methods for Comparing Means
There have been three modern insights regarding methods for comparing means, each of which has already been described. But these insights are of such fundamental importance that it is worth summarizing them here.
• Resorting to the central limit theorem in order to justify the normality assumption can be highly unsatisfactory when working with means. Under general conditions, hundreds of observations might be needed to get reasonably accurate confidence intervals and good control over the probability of a Type I error. Or in the context of Tukey's three-decision rule, hundreds of observations might be needed to be reasonably certain which group has the largest mean. When using Student's T, rather than Welch's test, concerns arise regardless of how large the sample sizes might be.
• Practical concerns about heteroscedasticity (unequal variances) have been found to much more serious than once thought. All indications are that it is generally better to use a method that allows unequal variances.
• When comparing means, power can be very low relative to other methods that might be used. Both differences in skewness and outliers can result in relatively low power. Even if no outliers are found, differences in skewness might create practical problems. Certainly there are exceptions. But all indications are that it is prudent not to assume that these concerns can be ignored.
Despite the negative features just listed, there is one positive feature of Student's T is worth stressing. If the groups being compared do not differ in any manner, meaning that they have identical distributions, so in particular the groups have equal means, equal variances, and the same amount of skewness, Student's T appears to control the probability of a Type I error reasonably well under nonnormality. That is, when Student's T rejects, it is reasonable to conclude that the groups differ in some manner, but the nature of the difference, or the main reason Student's T rejected, is unclear. Also note that from the point of view of Tukey's three-decision rule, testing and rejecting the hypothesis of identical distributions is not very interesting.
Hello Ali Azeez Al-Jumaili. As I said in my earlier post, the necessary normality assumption for a t-test is that the statistic in the numerator has a sampling distribution that is (approximately) normal. The shape of the sampling distribution depends on both the sample size and the shape of the underlying population of raw scores. If the population is perfectly normal, the sampling distribution will also be normal, even for n=1 (because it will be an exact copy of the population distribution in that case). For population distributions that are reasonably symmetrical, n=30 will likely be enough. But for other population shapes that are further from normal, n=30 may not be enough. The further from normal the population is, the larger n must be. Without knowing more about the population you are sampling from, it is impossible to say whether n=30 is sufficient in your case.
Here are a couple simulators you can use to visualize how population shape and sample size interact to determine the shape of the sampling distribution.
The distribution of the original variables is not an assumption of the paired t-test, it is the distribution of the difference (in the population). If you do not want to change your hypothesis (i.e., doing something like the rank based tests), you could do bootstrapping.
I mean distribution of estimated means in repeated sampling (as you mentioned in simulation studies).
I think routine normality tests on simulated means can be possible.
But I am looking for a way when we do not have a simulation.
Edit:
I add this note that normality tests on many simulated means may be over-powered. Therefore, we should consider significance level of normality rejection as e.g. 0.001 instead of 0.05.
Seyyed Amir Yasin Ahmadi 1) but when do you have enough repetitions of a study that it would make sense to test for normality and 2) I would not rely on statistical tests to test for normality in the first place. Nothing will be perfectly "normal", so you are right about the power. Therefore, use visual inspection tools for example (and not only for the sampling distribution).
Your bootstrapped statistics do have a distribution! And you'd look at the distribution just to see if it has odd peaks (e.g., with small n discrete data) and things like that, but wouldn't test for normality (). You might alter your process to take this into account (see the different types in Efron and Tibshirani [and elsewhere]).
You should recall that significance tests involve two steps: (i) choosing a test statistic; (ii) finding the probability distribution of that statistic under the null hypothesis.
For a paired-test with any sample-size, you can get an exact test without assuming normality (or any distribution), by doing a permutation test. For that, the null hypothesis is that the items in each pair could equally-well be swapped without affecting the chances of that outcome being observed. For a small sample-size, you can get the exact distribution by doing a full enumeration of all possibilities: otherwise a random subset of possible permutations can be used. Plotting a histogram of the null distribution will be revealing; particularly if the sample-size is small or if distributions of the underlying variables are unusual in some way.
Permutation-tests are subtly different from bootstrap-testing in terms of the assumptions being made and the algorithms being used. The former have the advantage of potentially proving an exact test, while the latter have the advantage of being more widely applicable.
However, modern thinking is that one should attempt to provide a confidence interval for some type of effect-size, rather than just dong a simple test. This must involve some thinking about modelling any potential effect, If you are worried about the assumptions being made for the null hypothesis, you should be even more worried about those required for confidence intervals, as they are rather more structural. A practical starting point is to look at the plots you will have done as part of examining the data: in particular, look at a scatter plot showing for each pair, the first plotted against the second. Thinking about such a plot would also help you decide whether the t-test-statistic you are starting with is appropriate for testing for whatever type of potential effect seems reasonable to consider testing-for.