Is there any minimum number of data points that needed to be reached in each of my cell to ensure the validity of ANOVA's result conducted in behavioral research or educational research?
I don't agree that it is that simple as Andrew explained.
Or it is even simpler.
"Power" is a concept that is rarely sensible in academic research. It is very typical that a researcher has no idea what a "minimum relevant effect size" may be, and how the investments for conducting the research and wins of correctly posivive findings can be balanced. There is no sensible way to value wins and losses in academic research, and hence there is no sensible way to set significance level and power. (the convention of using 5% as significance level is often more harmful than beneficial, though)
I further must note that "power" is not the probability of the right answer, not even "essentially". And neither is alpha (the level of significance) the probability that a difference is "true". Both power and alpha have only a frequentistic interpretation: they are probabilities of data (more precisely: of "more extreme test statistics calculated from data") what can not simply be translated to a "probability of an effect (or hypothesis)". A p-value of 0.01 tells you that -if H0 is true (what we do not know)- such or more extreme statistics would be expected in about 1% of similar studies.If I reject H0 always when p
you want to have enough data to have a certain power for your test. Power should be above 80%.
The power of your test depends upon the variability of the data, the differences between groups, what you are measuring and a lot of other stuff. For example, if you had 3 groups with averages of 5 25 and 45 and your standard deviation was 1, you could get away with 2-3 samples per group and have good power. If your differences are 40% vs 46%, you need to have about 200 samples per group.
Statistical power is essentially the probability of getting the right answer. the alpha value, usually 0.05 or 5%, is the probability a difference you see is true. Power is the probability of finding a difference if it exists.
I don't agree that it is that simple as Andrew explained.
Or it is even simpler.
"Power" is a concept that is rarely sensible in academic research. It is very typical that a researcher has no idea what a "minimum relevant effect size" may be, and how the investments for conducting the research and wins of correctly posivive findings can be balanced. There is no sensible way to value wins and losses in academic research, and hence there is no sensible way to set significance level and power. (the convention of using 5% as significance level is often more harmful than beneficial, though)
I further must note that "power" is not the probability of the right answer, not even "essentially". And neither is alpha (the level of significance) the probability that a difference is "true". Both power and alpha have only a frequentistic interpretation: they are probabilities of data (more precisely: of "more extreme test statistics calculated from data") what can not simply be translated to a "probability of an effect (or hypothesis)". A p-value of 0.01 tells you that -if H0 is true (what we do not know)- such or more extreme statistics would be expected in about 1% of similar studies.If I reject H0 always when p
Boy, these are deep waters we're treading in. I don't disagree with either Andrew or Jochen, but I will respond to this in a slightly different manner. You ask about cell size in ANOVA, but that does not tell us how many parameters you have? The more parameters, the bigger the cell sizes (and I am not sure what you are talking about when you say "cell sizes", but I am assuming you mean a sample size that has adequate presence across the distribution of covariates)... In any case, without more information, we're all arguing for sufficient sample sizes with sufficient distribution across all covariates.
I agree that the number of parameters is a key consideration. In order to get an insight into the sample size you must have some information on the residual variation, ie after fitting the model. If you prefer not to rely on power calculations then another option is to construct confidence intervals based on the intended model and assumed variance. Perhaps the easiest way to do this is to simulate some data using the assumed variance and fit the model and calculate the confidence intervals for means of interest. However this method (and power calculations) depend on having some idea of what the results will look like and this information may not be available a priori.
As you can read in the previous answers: specify your design.
Note: the required sample size is affected by:
-degrees of freedom: number of cases compared to number of parameters (see Ariel Linden too)
-adequate representation of the population; to me it seems impossible to represent the population well with, let's say, 10 cases. (depending on population size of course)
-power.
If you focus on 'validity', power is the less important issue.
Jochen, what if he has to pay a lot of money for each participant? I think there are many instances where the question of the minimum sample size is a a reasonable question to ask, even if the answer is complicated.
I also think that power analysis is absolutely the answer, as Andrew indicated, and that power analysis is extremely sensible. The issue of not knowing the effect size is easily side-stepped by performing the power analysis for different hypothetical effect sizes.
Jochen, I think there is a strong case to be made that null hypothesis significance testing is silly and I think that's your point (as well as correcting Andrew's loose wording) but don't you think that's a confusing point to be made in answer to this question?
These are all excellent posts. Let me just say that we are all talking about the same ultimate issue, but we're all coming at it from different angles. Ultimately, the question about "how big of a sample do I need (in each cell)" is directly related to power (after all they are both components of the same formula). What affects power is sample size (and vice verse), but also a direct function of how many parameters are in the model (which in turn is a function of sample size and distribution). So far we're all on the same page.
Alan brings up an equally important issue - and related - which has to do with finding the smallest sample size needed to achieve the desired power to detect an effect. This is a matter for planning the study given the funding (and other) constraints.
Again, all these are valuable and complementary responses.
Alan, thank you for your critical response. I really appreciate it.
Note that I mentioned two alternative routes to answer the question:
route 1: arguments only within statistical considerations. Here the power is the only "solution". But I put it in quotes to emphasize that this is a 'procedural' solution (it defines the problem in a statistical sense and defines the procedure to solve this so-defined problem - but is is not the actual solution). The key problem is only shifted towards the question what a minimum relevant effect size might be and what an adequate power might be to detect such an effect. Such questions can be answered only in a sensible way by having some measurable costs and benefits.At least one of those can usually not be quantified in academic reaserch settings. Therefore was my thread: there is a statistical solution, what turns out to be NOT a practical solution because it shifts the important questions to a place where they still can not be answered.
route 2: arguments only withing practical considerations. What is the aim, what is the fundamental reason for this question? I think it's the publication of research findings. Publicability is not provided by "scientific quality" per se but by the understanding and recognition of such quality (or the lack thereof) by reviewers. At the end, papers get published that convince the reviewers. There is no need to claim a 90% power of some minimum relevant effect (again the question: who will give sensible arguments why this chosen effect is relevant and a slightly lower effect is not?) when the data is striklingly improbable under the null hypothesis?* A problem arises when the result is not significant. Without an apriori defined power there is no way to interpret such results at all ("accepting H0" is not an option). So it is questionable what the whole research is for when I can not draw any conclusion. The practical solution is to make all possible efforts to get a "significant" result, and this means: increase the sample size as much as you can. You are right here: when you had an idea of a relevant effect size and if you had a good estimate of the variability in your data, then a power analysis can tell you if your resources can be sufficient to achieve this aim (with a given probability). But this way a power-analysis is not used to specify the sample size but rather to decide if the experiment should be performed with the available resources at all.
* One can run into a further problem: Given the power is 90% for H1: t>2 against H0: t=0 and the test gives p
I would venture to guess that nearly every researcher in every field manipulates the power calculation in order to get the desired power with the lowest sample size. The problem, as Alan alludes to above, is that funding is not easy to come by. So if you know you have limited funding, or if you known that you will never recruit the necessary number of subjects, you can "make the calculations work" in order to proceed.Ariel
Jochen, physicists tell us that at a quantum level the things we observe to be solid are, in fact, mostly empty space populated by constantly moving particles of indeterminate location. But explaining Schrödinger's cat to a student footballer would not help her understand what "offsides" means.
I don't disagree with anything you've said, I just doubt that it's helpful to disparage power analysis in answering someone who is at the level of statistical analysis that they're wondering what the minimum sample size is.
how will you determine the "necessary number" in the first place?
There is only one way to answer this: define a relevant effect size and define a desired power (and know the variability of the data). Then do a power calculation.
Now, how to define what effect is "relevant"? You typically have no clue (the answer I typically hear is: "well, any effect would be good to find"). And what power is appropriate? Will 80% be fine? Why not just 51%? Or shouldn't it better be 90% or even 95% or 98%? Would be nice, but...* And finally you have to know the variability of the data. Well, if the research is new, information from literature is only of little help, and estimates from small samples are very unreliable (if they are available at all), so usually guessing the expected variability is another major problem. So there is no sensible way to get the answer. The way is clear and objective, but the input is a mere guesswork.
It is such a guesswork that I wonder why we will not simply guess the required sample size directly ("my gutt feeling [and previous experience in similar experiments] tells me that we may need n=20" or something; seems more resonable than to say "my gutt feeling about the variance is s²=0.2 and my guess about an interesting effect is 0.4 and my vague opinion about my desired power is 85%, so I get n=...").
So I totally agree that getting a fund is the key issue, and when they require a power calculation, than this is usually tweaked to just produce a feasable sample size (that would be funded and that is managable). And as it concerns only a publication, an a priori defined (sufficently high) power is required ONLY when non-significant results have to be interpreted (a rare case to my experience). A significant result is usually never a problem (given experiment and test are appropriate and required assumptions are reasonable), because the test will keep the error-rate, also for tiny samples. There won't be more false "significant" results in small samples, so this is ok. Low power is not a concern when the result is significant.
I am aware that some reviewrs are not happy with tests on small samples, they claim that such tests are "not reliable", but nobody could explain me yet why this should be so. Tests control the type-I error rate, and -when done correctly- they do independent of the sample size. Now reviewers say that it is impossible to check the assumptions in small data sets. True. But assumptions are assumptions and no facts about nature. They should be reasonable and not "true". If I have doubts about the reasonability of assumptions then the whole test is inappropriate anyway. If I somehow check the conformity of the data with the assumptions and then decide what test to chose (on the very same data) undermines the control of the error-rate - so there is no point in testing. And formal hypothesis tests on assumptions just blows the whole story off, since the question of the power arises again, because here we do interpret a non-significant result!
After all this I can only repeat my statement: if you get a significant result than the sample size was (obviousely) ok. If not, do some more experiments if you can and see if you will get it "significant". You will only publish when it is significant, so everything should be fine. And if a reviewer then still says that the sample is too small, then do more experiments or submit your manuscript to another journal (hoping for a more benign reviewer).
Note that this is referring to "common practice" and to "practicability". This is NOT "good scientific practice". Actually, to my opinion, good science is quite distorted by the focus on tests and p-values and "significances". The control instance in science cannot be a p-value, it must be the (successful) replication by other researchers. But this is not part of our scientific "culture" (again you will need funding and you won't get it for re-doing and confirming old findings from other groups, so you must do new things, and the only way to judge their "relevance" will be the p-value again...).
*Should a society invest a lot of money and resources to fewer projects so that they can be performed with high(er) power or wouldn't it be wiser to distribute the money to more different projects among which will likely be some that show clear effects also with smaller samples?
Jochen, I think we are in agreement on most issues (in this query and others). One point of clarification: I do think that different fields of research (disciplines) abide by different standards (whether explicit or implicit). For example, I evaluate large scale interventions, where, as the name implies, the sample sizes are large. That said, I get a fair amount of argument from reviewers that say that the sample size was not sufficiently large to show an effect. I argue back that the intervention had no effect, and that if you need a sample size of tens of thousands, you have bigger issues at hand.
As for determining the sample size: If you have no prior data to guide your calculations, you go to the literature. Naturally, it is fairly easy to find prior research that had large effects with small sample sizes. Armed with that information, you can posit that you expect to achieve similar results based on those previous findings. Obviously if there is no literature, and you have pilot data, you can perform the calculations without conjecture.
Ariel,that's one of the nice things about RG, that people from different fields find together and discuss. We all can learn from each other. I think it is good to have both, isolation (to develop new ways to tackle problems) and communication (to exchange the solutions and identified difficulties) between different fields. It's like evolution in nature, an inter-play between genetic isolation and exchange and recombination of genetic material...
Regarding your argument that the intervention had no effect: this is a too hard statement. The key point in power analysis is to define a minimum relevant effect size that is larger than zero (by some finite amount). This must be defined before the study. Then the study is performed using the required sample size. Given the power was 99% one can say that the chance to get a non-significant result when the "true" effect is AT LEAST RELEVANT is lower than 1%. Hence one does NOT expect to get non-significant results when there was a relevant effect. Now if the actual result is in fact "non-significant", then we can conclude with 99% confidence that the effect (what ever it is exactely) is SMALLER than relevant. This is the way you should argue. The reviewers are right when they say that the detection of a non-zero effect is always and only a matter of sample size. So you can never show that the effect is exactely zero, but you can show that the effect is below the limit of relevance.
It may problematic to define what a "relevant" effect is. Often we have no clue about that, and this is one of the problems in academic research. There often is no objective criterion, and anything different to zero may be interesting... but then the whole argumentation via tests and significance and power actually breaks down. And if we choose the relevance-limit arbitrarily then the rest of the procedure (testing...) is as arbitrary (and it would be wiser to describe the observed/estimated effect and its uncertainty).
I think it is quite atypical to find prior data in literature (at least to my experience and in my field). At best there are vaguely related experimental setups and vaguely related responses, so one might give a rough guess what to expect in the own experiment. But even then there is a very high risk that the reported effects are biased. When of 20 studies (I am talking about parts of PhD projects and other reaserch work) 19 are "not successful" but one shows a "significant result", then just this single "successful" experiment gets published. Even when it is one particular question/experiment: it might have been repeated several times until the results were "significant" - and this one is then published. Thus, the literature is likely a collection of "lucky draws" and largely over-estimating the effects and underestimatinig the variability. Another drawback when using published data is that the results are usually not replicated by others, so there is no information available how results will vary when the lab, the experimenter, or the time of the experiment changes.
So: yes, looking to literature can give a hint. But not more than a hint, and even this must be taken with great care and may not be better than a "wild" guess based on own diffuse experience.
Thank you for discussion (although we in fact agree on most issues; but still it helps to get thinks in order in my brain). I enjoy it!