The key in both cases is estimation of standard error (assuming bias and nonsampling error are negligible - which may not be true), for example, the standard error for a mean. This is dependent upon inherent population standard deviation and sample size. We may 'guess' standard deviation based on preliminary information. For yes/no cases we may take a worst case. For various designs, beyond simple random sampling, there are complications. But the basic ingredient for both power and confidence studies is standard error from standard deviation and sample size. Confidence intervals, however, will often be more practically interpretable. You may just need an estimate of relative standard error: estimated standard error of perhaps a mean (or total), divided by the estimated mean (or total).
If you are able to meaningfully compare probabilities for alternative hypotheses, that might sometimes be desirable, but I have not had need for a hypothesis test in many years (much less "significance," which is a misleading term), and think this generally overused and misused. - I do remember sequential hypothesis testing to be an interesting way to decide between two 'simple' alternatives, which employs a variable sample size.
Thanks for your replies. I have read in well-cited papers that confidence intervals are better to inform readers about the possibility of an inadequate sample size than do post hoc power calculations. See:
A CI is more useful if the estimate is important. A p-value is suffcient if only a test is important. The test is just to judge if one is sufficiently confident about the direction of the effect, but not about the size of the effect. Especially when the effect is relatively large or is estimated with high precision, a test (which would give a very tiny p-value) is not very instructive, but it might be interesting to see if (how well) the data would be compatible with a smaller or larger effect as well.
However, the CI is still just about the same thing than the p-value and simply shows the test result from a diffrent angle. And because it is about the test, it is about the data, not about the estimate. A direct statement about the estimate would require Bayesian statistics. But given a flat prior and a sufficiently large sample size the CI can serve as a sufficiently good approximation for the credible interval of the estimate.
A p-value does not stand alone well. What is "big"? What is "small"? Note that convention on that does not fit with "big data." A p-value in a vacuum is rather meaningless.
P-values and confidence intervals are both sample size dependent because standard errors are. But a p-value takes the standard error and converts it to something which is not interpretable by itself. That has caused a great deal of misuse.
Sample size is contextual both within a general framework of similar studies and within the study itself. I can run an experiment on perch with 10 replicates and be rightly criticized because perch are common fish. The same sample size on rare pelagic sharks might be astounding. A study that killed 60,000 black cutworms might be just fine, but I might have troubles if I used 60,000 baby rabbits.
My thought is this: If the sample size is in line with existing publications or substandard with extenuating circumstances [costs (time, or money), risk (bodily harm), rarity], then the study should be published. It is unfair to punish one research group for a failing of all. It is then our responsibility to write editorials to journals and point out the problems associated with small sample sizes in terms that the readers of those journals can relate to.
At some point every graduate student in the biological sciences should have to answer the questions posed to the ASA board in James' linked article. However, it begs to have the questions changed a bit to:
Q1) What do you think is an appropriate p-value for deciding to reject the null-hypothesis?
Q2) How did you make this determination, or more importantly how should others make a similar determination that fits their research project objectives?
In my opinion, it is not a matter of "punish one research group for a failing of all" but of informing the reader that the results of the study fall by itself due to the insufficient sample (and therefore, statistical power). It is a responsibility as researchers to report failures that may affect the conclusions of a study and, therefore, other researchers on the subject.
In this case, there are no extenuating circumstances and the authors did not calculate a priori power so their results are underpowered.
Ah, but it is a punishment. If half of the equivalent studies in the field use a sample size of 20, then how do you recommend rejecting the manuscript based on a sample size of 20 in this case? You have chosen to reject a manuscript that is within accepted limits. Would you go to similar studies that have been published and estimate the power in those studies? You will probably find that they are under-powered as well (or that they were written in such a way that you cannot make the calculation).
That said, if the accepted number of replicates is 20 and this paper has 3, then please reject it -- but at that point you shouldn't need a power calculation to arrive at that outcome.
Try this: go to the literature and find five equivalent manuscripts. Run the power calculation. You can then write in the review that the range in power of these five manuscripts was from 0.4 to 0.78, and due to the small sample size in this manuscript the power was 0.34. It is therefore likely that the results are not repeatable. I would be happy to review the manuscript again if the authors can increase their sample size by at least 4 replicates to bring it in line with established research. Realize that your sample size of five manuscripts is subject to the same criticism that you are giving this manuscript.
One other condition: is it reasonable to assume that the information needed for a power calculation was present before the experiment began?