Please answer "true," "false," or "it depends," and then explain. (No, this is not a test! I am just interested in people's responses.) Thanks for your time! Tom
No. It can be shown that effect sizes dramatically vary with small sample sizes (Schönbrodt & Perugini, 2013) and that especially small samples tend to overestimate the true effect size, whereas larger samples are more conservative (Loken & Gelman, 2017). Therefore, this would not be very convincing for me.
Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584-585.
Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize?. Journal of Research in Personality, 47(5), 609-612.
No, large samples tend to better capture the probability distribution used for the test. A t-test assumes certain parameters, and as mentioned above, effect sizes and distributions might vary substantively in small samples (e.g., non normal distributions).
In such a scenario, it might be a better approach would be to use non-parametric tests, which are designed for testing mean differences when you can't assume anything about the distribution of a sample. Mann-Whitney test comes to mind.
No. The common assumptions made when doing a t-test include those regarding the scale of measurement, random sampling, normality of data distribution, adequacy of sample size, and equality of variance in standard deviation. Mr Lucas Monzani and Mr Rainer Duesing already pointed the issue regarding the small sample size, and how the effect sizes and distributions are affected. In such a scenario, I agree with the solution (non-parametric test like Mann-Whitney test) proposed by Lucas Monzani
Mosharop Hossian and Lucas Monzani I really do not know why you both suggest to use a non-parametric test??
1) I you explicitly want to know the mean(!) difference of two samples, the Mann-Whitney test cannot tell you this, because it measures the stochastical superiority of one group vs. the other. No information about the means itself. If you make the additional assumption that both samples have the same distribution and are continous, you can say that a significant result may be interpreted as evidence for a shift of medians.
2) Depending on the underlying distribution assumed, the Mann-Whitney test can have a lower power as compared to a t-test.
Therefore, I find the suggestion very critical, since the Mann-Whitnes does not test the hypothesis you are typically interested in, has sometimes a lower power and is also not assumption free.
I would add a different approach. Absolute effect size, P value and sample size are the sides of a triangle.
Absolute effect size: the difference between H0 and point estimation of H1 (for t test it is very mean difference).
P value: as a "practical" definition for t test, is the probability of the H1 to be on the other side of H0 (X2 for two-tailed P value) (i.e X2 of the area of the H1 ~100% CI cross the H0).
Higher sample size results in narrower CI of H1 point estimation, and vice versa. Considering the triangle, more samples are needed to detect a significant association for narrower absolute effect sizes. Although lower sample sizes make lower power to detect the difference significantly, very larger sample sizes are not necessarily better.
Thank you for your responses! Now, how would you address this opposing argument: "Yes, if the t-test is significant for a smaller N this is evidence that the effect size in the population must be large. As sample size increases, so does the statistical power to detect a small effect size. The converse of this mathematical fact is that small sample sizes can only detect large effect sizes."
Positive: Significant result in a small sample is very good, because it shows that our hypothesis was to powerful to be approved via a small sample size with a large effect size.
Negative: Even a large absolute effect size has a wide CI crossing the smaller absolute effect sizes obtained from larger sample sizes.
Thomas E. Becker I thought I already addressed this. The quote (an by whom is it, btw?) "Yes, if the t-test is significant for a smaller N this is evidence that the effect size in the population must be large." is simply wrong, as I demonstrated with my two references. The quote would be true, if we could rely on that the estimated(!) effect size in small samples reliably represents the true effect size in the population. But as demonstrated, this is not the case, more or less the opposite is true. Not only do estimated effect sizes vary very strong in small samples (Schönbrodt & Perugini) but they also tend to OVERestimate the true effect size (Loken & Gelman) as opposed to large samples, which are more conservative and tend to underestimate it. Therefore, I would expect to find larger ES in small samples by chance and hence get a significant result. That's also why you do not test after each participant and continue if it is not significant and stop if it is! Also, the whole thing does not take into consideration "Smallest Effect Size Of Interest" (SESOI) or "Region Of Practical Equivalence" (ROPE), but that is another topic.
The quote would only be true if ceteris paribus the estimated effect size in small samples would be reliable and would stay constant with increasing sample sizes, which is not the case. But as a thought experiment: if this would be the case, that small samples would estimate the true effect size as reliable as large samples, why not being maximally efficient and only sample 2 data points? Do you see the problem?
Edit: I have to slightly correct answer. It would not be "no evidence" for a large effect size in the population, it would simply be not good or convicing evidence imho.
Thomas E. Becker, I had to face a similar issue with a paper, where we conducted a field experiment and detected a moderae to large effect size in a relatively small sample. We had to halt data collection due to COVID-19, so increasing the sample size was not an option.
Of course, reviewers raised the "small N" problem in the first round of reviews. Given that I did not have the level of psychometric understanding and proficiency than Rainer Duesing demostratesdin these posts, I tried an alternative (and hopefully) innovative approach to deal with this issue.
Thanks to a great suggestion by the editor, I conducted a second-order meta-analysis of existing first order meta-analysis on the topic to determine the corrected effect size in the sample of studies. In that way I could identify that, and provide evidence that the large effect size we identified in my reference group was consistent with prior findings in the literature.
In my case, that was enough to convince reviewers with the added value that we summarized the prior empirical findings on the literature of that field. Whereas problaby not the best possible solution from a psychometric perspective it was good enough in the context of a global pandemic. I attached the article if it is of anyone's interest (and it is open access too!).
Ranier and Lucas: Thanks for the great insights. Now, let me push the opposing argument a bit further, with no implication that I accept the argument:
"The idea that small sample sizes are less reliable is true but is accounted for by the larger confidence intervals (and the corresponding critical value for the t-test) that come with smaller sample sizes. So, if the t-test for a small sample is significant even with the wider confidence intervals (and larger critical value), then the effect size in the population must be quite large - at least large enough that a small sample can detect it."
(Ranier: The quotes represent the viewpoint of a hypothetical reviewer. I am providing them for rhetorical means and to emphasize that I am not necessarily advocating the argument.)
"The idea that small sample sizes are less reliable is true but is accounted for by the larger confidence intervals (and the corresponding critical value for the t-test) that come with smaller sample sizes. So, if the t-test for a small sample is significant even with the wider confidence intervals (and larger critical value), then the effect size in the population must be quite large - at least large enough that a small sample can detect it."
The bold parts are what I mentioned before:
"Negative: Even a large absolute effect size has a wide CI crossing the smaller absolute effect sizes obtained from larger sample sizes."
Thomas E. Becker As Rainer mentioned, effect size estimates from small samples are more likely to exaggerate the true effect size. For example,
I measure the height of 3 WNBA players (M = 72 inches) and 3 men selected at random (M = 66 inches) - because 1 of the three men happened to be particularly short, say, 5'4". The effect size in my small sample is quite large (6 inch difference). However, the average male height in the U.S. population is 69 inches, and the average height of a WNBA player is approx. 71 inches.
If I had measured both populations, the true effect size is considerably smaller (2 inches). By having such a small sample, my effect size estimate exaggerates the true effect size in the populations (referred to as a Type M error).
I understand Blaine, but the argument is not that one can accurately estimate population effect sizes from small samples. It is that a signficant effect of a test (e.g., t-test) based on a small sample must mean that the effect size in the population must be large. Otherwise the test would not have been significant. To Ranier's earlier example, if N = 2, the effect size in the population would have to be enormous for a t-test based on the sample to be significant. (See above regarding the relation between sample size, confidence intervals, and critical values of a statistic such as t.)
Thomas E. Becker I'm well aware of the relationship among sample size, confidence intervals, and critical values. My example clearly illustrates why the argument "a signficant effect of a test (e.g., t-test) based on a small sample must mean that the effect size in the population must be large" is incorrect.
"...otherwise the test would not have been significant"
There seems to be so much wrong with the logical reasoning of your (imaginary) reviewer. His statements are all deterministic “if A… then MUST B”, but this is simply not true and that is a fact. It is about probabilities and in frequentistic statistics especially about the error in the long run, not about the sample at hand. And that is an important point, you have a sample and the t-statistics is a sample statistics and is NOT about the population. Therefore, the correct statement would be “The effect size in the sample with a given sample size was large enough, so that the test statistic was considered “significant” according to the conventions.” And also strictly speaking, we do not know anything about the population effect size at all. The t-test (like the other NHST procedures) tests the probability pr(Data | H0), i.e. how probable are my data at hand, given the fact that the H0 is correct. So, since we already assume that the H0 is correct (and in the most simple form that we are talking about here, effect size = 0, i.e. no difference, no correlation), our tests shows, how often we can expect a result like ours or more extreme, if the H0 is correct. This directly implies that the true effect size may be 0 but our sample effect size deviates from it! Your t-test result shows you that in fact the effect IS 0 AND your large sample effect size may occur nonetheless. Edit: in fact every effect size may occur under the H0, but only with different probabilities, but all are > 0, non-zero.
The next thing to consider, which only has been mentioned indirectly. Even if we would use the senseless dichotomy of “significant vs. not significant” as only criterion for decision making, we could consider confidence intervals. They show which range around the sample effect size can be considered as compatible with it and which area outside not. In small samples we will see that the range is typically very large, i.e. a lot of values are compatible with the sample effect. If the lower bound is not far away from your null value, this would not be very convincing either. Therefore, it has been proposed not only to test a point null hypothesis, but to declare a range of values, which are considered equal to null. For this you should use the SESOI I have mentioned above and you may test it with the Two One Sided T-tests (TOST) procedure or the mentioned ROPE in the Bayesian framework. This would also help to disentangle “statistical significance” and “practical relevance”. Why is it not done? Maybe Cohen (1994) was right:
““Everyone knows” that confidence intervals contain all the information to be found in significance tests and much more. . . . Yet they are rarely found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large!”
The last point to consider: why is there a small sample size at all? If it is the result of poor sampling, maybe because there were not enough participants available, I would even cast more doubts about the results. On the other hand, maybe the researchers made an a priori power analyses and expected a large effect size or declared that all other smaller effect sizes are of no practical relevance (SESOI) and calculated the small sample size. This would be a little bit more convincing to me, BUT it still does not solve the unreliability of small samples. Coming back to the original point from above, a significant result does not take into account the type II error, i.e. 1-power, therefore if the large effect size in the sample was a fluke, you will not be able to reproduce the result in upcoming studies. To calcualte the power, you need to determine the effect size a priori and not post hoc with the sample effect size. Post hoc power analyses of significant results are as useless as mindless p-value reporting. If the result was significant, we already know that the power was large enough for the sample effect size. It is nothing more than a transformation of the information. So, if this would be considered as an argument, it is not.
And to repeat it again, the t-test is a) a test of the sample not the population and b) a test of compatibility with a range of values (the CI) and not an estimate for the reliability of the effect size.
As said before, "significant" effects in small samples are not "no evidence", but bad or at least not convincing evidence.
I also found a nice conclusion in Greenland et al (2016):
"Any opinion offered about the probability, likelihood, certainty, or similar property for a hypothesis cannot be derived from statistical methods alone. In particular, significance tests and confidence intervals do not by themselves provide a logically sound basis for concluding an effect is present or absent with certainty or a given probability. This point should be borne in mind whenever one sees a conclusion framed as a statement of probability, likelihood, or certainty about a hypothesis. Information about the hypothesis beyond that contained in the analyzed data and in conventional statistical models (which give only data probabilities) must be used to reach such a conclusion; that information should be explicitly acknowledged and described by those offering the conclusion."
It is not about small samples itself, but the point "Information about the hypothesis beyond that contained in the analyzed data..." may be interpreted in that direction. Information about the sample (small or biased or what else) need to be incorporated to draw conclusions, and small samples do not help much to draw strong conclusions.
Their points 13., 18., and 20. may be also of interest.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology, 31(4), 337-350.
P.S.: I acknowledge that it is not your standpoint you represent here, so please do not take my answer as any offense.
Upon further reflection, the revised argument I proposed at the end of my previous comment doesn't even work. Specifically, the term "implies" is inappropriate. Instead, it needs to be:
"Assuming an effect truly exists in the population and the sample is representative of the population, a significant effect of a t-test based on a small sample suggests the effect is large in the population."
Why "suggests"? Because no inferential statistical test contains 0 error. Even using a representative sample and a sufficiently large sample does not eliminate the chance of a decision error entirely. A Type I error is still a Type I error, even if I had only a 1 in a million chance of committing it in my study.
Thanks to all for your extremely thoughtful responses, references, and examples. I agree with many of the points and doubt that they are widely known or understood. Warm wishes and happy researching! Tom