Let’s say that I have one sample (of size n) of random variables. I want to perform one-sample t-test for following hypothesis:

  • H0: population mean mu = 0;
  • HA: population mean mu > 0 (one-tailed test);

According to the Central Limit Theorem, I know that my sample mean Xbar comes from an *approximately* normal distribution with the mean mu (population mean) and STD = SE (standard error). I don’t know the population standard deviation, so I use a t-distribution instead and SE = s/sqrt(n), where s is the st.dev of a sample.

Then, I calculate a t statistic as (Xbar - munull hypothesis) / SE, which corresponds to the t score on that t-distribution. And my p-value would be the probability of obtaining the results >= t score given the null value mu (0 for my example).

I have seen people doing different stuff:

  • Run a normality test (like Shapiro–Wilk test) on the sample. If the test doesn’t support the hypothesis that the sample is normally distributed, then run a non-parametric test for the sample (like Wilcoxon Signed Rank test). That doesn’t make much sense, since I don’t think we care that much about the sample distribution (nor population distribution, since CLT applies for any(?) pop.distribution);
  • Check if n>30, then run a t-test, assuming that normality assumption holds. This does make some sense, but 30 looks like an arbitrary number;
  • Check that the sample distribution is symmetric (no extreme values) and if it is the case, then run a t-test. If not, then non-parametric test). This also makes some sense. But the question here, if I have outliers in the data, can I simply exclude them from the analysis?

So the question is, do I really need to check the distribution of a sample. And if so, why would I do that?

(included an image just to illustrate my words)

More Ruslan Klymentiev's questions See All
Similar questions and discussions