If we conducted a control study without determining the sample size and power of study, is it possible to calculate the power of study at the end (after data collection is completed)?
No! If you want to look post-hoc, look at the confidence interval instead.
Why would you look at power for a study you have completed? Arguably you would do it because you wanted to know whether or not you could trust a negative result.
The argument would go something like this "I didn't get a statistically significant result, but then for an effect size of x my power was only 50% so this doesn't really tell me very much."
But if you look at the confidence interval you will see the range of values that are consistent with your data, and if this includes an important effect size, then you know that your study was uninformative. Confidence intervals are almost always more informative than significance tests.
Of course, for a non-significant result, if you calculate power using the effect size seen in your study you are bound to get low power. You then have a beautifully circular argument for resurrecting your hypothesis and concluding that your experiment just wasn't big enough. So never do that.
If you are doing a genuinely post-hoc analysis - that is trying to use power analysis to make sense of the results of a study you have completed, not to lan the next study, then the basis rules are:
1. Don't do post-hoc power analysis;
2. If you really must do post-hoc power analysis, don't do it yet;
3. If you are forced to do it now and can no longer delay, make sure that you never use the effect size observed in your results.
Even a priori, power analyses are based on a ehole load of assumptions about the nature of the response, the variances and the effect size. Always remember to look at power under a range of scenarios, and remember that we tend to be over opti istic about both effect sizes and variances!
Yes, you can, but it is not that meaningful and not a good practice. Because if result is significant, power is not of interest. If result is not significant, power is asked by reviewer sometimes, but it can just tell that given that sample size, power is not enough. It will be a explanation of why result is not significant.
I must disagree. Calculating the power of a study retrospectively is a useful tool when evaluating published findings. Further, it can be informative when planning a new study. The advantage here is that you have actual variance estimates from the study population of interest. Given that, for example, you can set the significance levels, power, and Ns to various levels, and then solve for effect size. This is an invaluable tool when planning a new study.
Confusing retrospective power and prospective power.
Power as defined above for a hypothesis test is also called prospective or a priori power. It is a conditional probability, P(reject H0 | Ha), calculated without using the data to be analyzed. (In fact, it is best calculated before even gathering the data, and taken into account in the data-gathering plan.)
Retrospective power is calculated after the data have been collected, using the data.
Depending on how retrospective power is calculated, it might be legitimate to use to estimate the power and sample size for a future study, but cannot legitimately be used as describing the power of the study from which it is calculated.
However, some methods of calculating retrospective power calculate the power to detect the effect observed in the data -- which misses the whole point of considering practical significance. These methods typically yield simply a transformation of p-value. See Lenth, Russell V. (2000) Two Sample-Size Practices that I Don't Recommend for more detail.
See J. M. Hoenig and D. M. Heisey (2001) "The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis," The American Statistician 55(1), 19-24 and the Stat Help Page "Retrospective (Observed) Power Analysis" for more discussion and further references.
I always recommend calculating the effect size (ES) to report study results. ES is directly related to the power of the study. Something can be statistically significant but practically unimportant. ES is based on the square of the standardized regression coefficient, the effect size is extremely helpful. If you are using SAS, request the STB option.
Since we have only n=6 so far, here's my contribution to the sample size.
As Fisher observed, calling in the statistician after the study is over may be no better than asking for a post-mortem to understand what the study died of.
Post-hoc power analysis on an under-powered study can let you plan for a new, better-designed study.
To determine whether the study were underpowered, I would not do a post-hoc power analysis. I would look at the confidence interval of the non-significant result. A broad confidence interval which contains values that could be meaningful indicates an underpowered study. A narrow confidence interval where the extremes are so close to zero as to have little visible impact tells me that the study was not underpowered, it just looked for something that was not there.
If you're looking at power post hoc, you may want to look at the entire body of evidence for a given comparison of interventions. That would typically be in the context of a meta-analysis. You can estimate the optimal information size (computationally, or by nomogram; by number of events or by relative effect measure e.g. RR) when examining pooled effect estimates. This will tell you if your meta-analysis is adequately powered to exclude a significant treatment effect when none was observed. Otherwise, I wouldn't make too much of a single under-powered study.
Post hoc power analysis assumes the observed effect to be the true one, which is probably not the case. You might want to see (if you haven't, already): Hoenig & Heisey (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. The American Statistician, 55 (1).
"Like trying to convince someone that buying a lottery ticket was foolish (the before-experiment perspective) after they hit a lottery jackpot (the after-experiment perspective)."
Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results Ann Intern Med 1994;121:200-206
Hi, I was taught that you decide what effect size is reasonable (i.e. what effect size is deemed significant), estimate the variance, decide alpha and beta, then see what N is required. Then you know how to construct your study, and also not gather more samples than are necessary. Thus post-hoc power analysis is pointless for that study, but may assist in designing a follow-up study, or for conducting meta-analysis of related studies.
There is a software, "G-power" which I think is is free to download. You can put the info of your research. (N, alpha, Beta, type of test, etc.) for calculating your power (real beta). (retrospectively)
Also you can calculate a reasonable Sample size to achieve your desired Power before starting a study (prospectively).
take a look at its graphs to, there are full of information.
It is always useful to know if the power was adequate or not, especially if we can apply the learning to our future analysis. Yes, we would have preferred to know to power earlier to contain our Type II Error (Risk to Customer), however, at least as a postmortem analysis, it could help. Good luck. If you have data, please send over and I will help on how it can be done.
It is not just for researchers but also for critical decision makers in positions where risk to producer and/or customer is involved. The more serious the product is, the higher is the level of knowledge needed across the organization.
I have to say, first, that calculating power is also a tricky thing when you do it a priori. In this case, you use a expected prevalence or effect size, but... what happens if prevalence is different in the end? For instance, you design a study assuming that the number of people who will die of X will be of 10% in the first year if you use a specific drug and 20% if you use placebo. But in the end only 3% of people have died using that drug and 5% using placebo. It's a nice thing but your power calculation was not useful because you just use "expected" values that may happen or not. Or imagine that you want to know if diabetes prevalence is lower in one city when compared to another and you just use estimated prevalence for a small survey for calculating power. Again, you may miss the "real prevalence". Of course, if you knew the real prevalence you wouldn't need to do that study... but I just want to point out that a priori power calculating is many times just a "theoretical design". You have to do it because you need a starting point, but it is a very solid base and may fail. In this case, it could be useful to calculate "retrospective power", as a way to explain why you did find a non-significant result in the end and also as a useful tool for future studies.
One approach I've used with some success when attempting to delicately assert the null in the case of a continuous measure is to use the standard deviation(s) observed and sample size used to calculate the function describing power as a function of a range of theoretical population mean differences. The smallest mean for which this function exceeds, say 95%, could be defended as have been effectively ruled out. In essence the standard deviation from your data is a superior or equivalent substitute for what you might have used in an a priori power analysis...and you're appropriately ignoring the observed mean difference.
A point that's important to highlight in general: In the case of a priori power analysis, one should anticipate their observed data is unlikely to be the same as that which was used in the power analysis. This isn't pointing to error or a mistake. The principle is based on the assumption that any given experiment/study is merely a sample from a population of studies who's density functions can be estimated...most of which are not expected to have the actual population means. These density functions are then used to estimate the proportion of samples drawn from, say the noncentral t distribution might be anticipated to fall below the critical t value from a central t-distribution with the same degrees of freedom.
No one study/experiment can hope to know if by virtue of being different from the estimates used in the power analysis, an error in that estimate has been uncovered...since the knowledge of that would exclude the rational from studying it in the first place...that is, if we knew the population parameters we wouldn't be sampling with our one little experiment in the first place.
No! If you want to look post-hoc, look at the confidence interval instead.
Why would you look at power for a study you have completed? Arguably you would do it because you wanted to know whether or not you could trust a negative result.
The argument would go something like this "I didn't get a statistically significant result, but then for an effect size of x my power was only 50% so this doesn't really tell me very much."
But if you look at the confidence interval you will see the range of values that are consistent with your data, and if this includes an important effect size, then you know that your study was uninformative. Confidence intervals are almost always more informative than significance tests.
Of course, for a non-significant result, if you calculate power using the effect size seen in your study you are bound to get low power. You then have a beautifully circular argument for resurrecting your hypothesis and concluding that your experiment just wasn't big enough. So never do that.
If you are doing a genuinely post-hoc analysis - that is trying to use power analysis to make sense of the results of a study you have completed, not to lan the next study, then the basis rules are:
1. Don't do post-hoc power analysis;
2. If you really must do post-hoc power analysis, don't do it yet;
3. If you are forced to do it now and can no longer delay, make sure that you never use the effect size observed in your results.
Even a priori, power analyses are based on a ehole load of assumptions about the nature of the response, the variances and the effect size. Always remember to look at power under a range of scenarios, and remember that we tend to be over opti istic about both effect sizes and variances!
Power analysis is essential and certainly required in grant applications and other proposals. The problem is that it's very speculative. If a "non-significant" finding is the result, it's good to be able to say that a moderate effect size would have been detectable given the sample size., but it wasn't. One should not, however, as many here suggest, try to determine how many cases would have been required to find an effect of the size observed. sincce there will always be an answer. With enough cases, any effect can be significant.
Anyone looking to understand why this might be done should read design sensitivity by Livsey. Also, many of us do not have the luxury of designing an experiment with the N needed for the best stat analysis, does this mean the experiment shouldn't be done?
My answer is also conformed with Mr. Fubing Tang. If you have calculated the sample size before data collection, the power of the study is 1-type ii error. Some times we are not at all doing the calculation of sample size before data collection. It is possible to calculate the power by substituting the n (sample size) and Type i error (5%) values in the formula and calculate type II error. By this way retrospectively also the power of the test may be calculated.
I believe that you SHOULD calculate power of statistics retrospectively (as stated correctly by Miguel Marcos) in the following cases:
1. There are no prevalence data to calculate the power a priori
2. You can't really estimate the sample size a priori due to no prevalence data
3. If you want to prove that the significance level of your statistics seen in your study is in fact of a valid significance
4. Since there is no way of calculating the effect size a priori (but merely keying in an assumptive value), it would be better to calculate the power/effect size at the end of the study and document it as such.
Mervyn Thomas is providing the best advice in this thread: Do not perform post-hoc power analyses. The manner in which many of the other commenters are suggest that it be used is easy to compute but highly unreliable.
For reasons nicely articulated by Gelman & Carlin (2014), not only are effect size estimates from small studies highly volatile, but statistically significant studies with small samples tend to dramatically inflate apparent effect size. If your study was underpowered to begin with, then the effect size estimate following a significant result will necessarily be inflated, sometimes dramatically so (a "Type M" error, or error of magnitude). Furthermore, if the true effect size is especially small (e.g. if it's effectively zero) then an alarming share of significant results will report the wrong direction of the effect, in addition to exaggerating its size (a "Type S" error, or error of sign). These phenomena are further exacerbated by the unavoidable incentive to conceal null results and elevate significant ones: The stronger the publishing bias, the more misleading a post-hoc power analysis will be.
Do not confuse a statistical procedure that is easy with one that is useful or informative. Trust the statisticians on this one.
Although it is not ideal, you certainly can. Also, whether a priori or post-hoc, it's not always easy to do them right so I would suggest getting someone on board who knows what they are doing. All that said, a post-hoc power analysis can indicate whether you had the power to find your observed effect size(s), which is especially useful if you are running a pilot study.
Carl, Hoenig and Heisey (J. M. Hoenig and D. M. Heisey. The abuse of power. The American Statistician, 55(1):19–24, 2001.) term retrospective power analysis for data analysis “An Abuse of Power”. The problem is that, whenever a test is not significant, retrospective power at the observed effect sizes must allways be low, and whenever a test is significant retrospective power must always be high. The Hoening and Heisey paper has over 450 current citations: not because it is original but because it provides a clear and well written account of a problem which applied statisticians encounter very frequently. It very neatly expresses the dominant understanding of power analysis in the statistics community.
A much more rigorous account of the folly of retrospective power analysis can be found in Hacking’s magisterial work on the logic of Statistical Inference [I. Hacking. Logic of Statistical Inference. Paperback re-issue. Cambridge University Press, 1965., pages 95-102] in which he demonstrates that Neyman Pearson inference provides a “before trials” rather than an “after trials” decision rule. In that context, power is a property of the decision rule which is set up at design, before examining the data and has no role in data interpretation. Indeed from a strict Neyman Pearson perspective, data interpretation is an entirely deterministic matter of applying a decision rule which is determined a priori. Neyman and Pearson themselves write [J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 231(694-706):289–337, Jan 1933.]:
We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.
The widely accepted alternative to retrospective power analysis is the consideration of confidence intervals. Unfortunately confidence intervals must also be recognised as before-trials intervals (as Hacking [ page 159-160] points out). But abusing the concept of a confidence interval in this way has fewer consequences than the abuse of retrospective power analysis From my somewhat partisan perspective, that is because a confidence interval is usually pretty similar to a Bayesian credible interval - which has exactly the sort of a posteriori meaning you are looking for.
By all means calculate power for your next study based on the effect sizes and variances seen in your current study (but do so with caution). But never attempt to use power to interpret the results of a study you have already undertaken.
No reputable statistician will tell you that you should calculate retrospective power.
As Greg points out, you can certainly complete the mechanics of calculation, it is only when you try to use the computed [power that you come to grief.
This is an excellent discussion on power analysis. In my previous studies, I have divided my studies into two parts where prevalence information is not available. And then I used it for the subsequent study, which I think is the best thing one could do. Thanks a lot guys.
you may need to calcualte how much you need in the sample. If the power calculation shows that you have the minimum number, then it is ok to go for the retrospective data.
To do power analysis to estimate your sample size, you have to write your hypothesis, and based on that you decide what statistical test you will use. It should be one of the inferential statistics. so you need to determine the following: alpha {standard to be .05}, power [standard to be .80], effect size {small, moderate, or large, each test has its own value, you can find these values in the net}. Then download free programs to calculate the sample size such as G. power.
At that moment I have this problem. I submitted an article where there is no sample size calculation, because it is taken from a larger survey to which had a sample calculation, but not for this topic.
Many of you advised to note the confidence interval, but how to make this observation to linear regression results? For studies with association measures as Relative Risck or OR, I can see a wide confidence interval or narrow, but how about linear regression results?
Statistical Minimum Sample Size = fn(Alpha Risk, Beta Risk, Minimum Difference In Central Tendency I Wish to Detect, Existing or Expected Standard Deviation and Power). So we have SMSS = fn(5 Variables). When you input these parameters into the function (say using Minitab), the output gives you the SMSS needed as well as the power value. One can also input the power value with a known SMSS value.
Is this what you were looking for? Or is there something else?
If post-hoc power analysis is based on the observed effect size, it will not be comparable to the a priori power analysis (before conducting the experiment) in which the effect size is expected. In this case post-hoc power could, in fact, be calculated from P value, and thus it would be not only useless but misleading.
After you have the results and you find non-significant differences between groups, it might be more relevant to ask which would be, at the same P-value threshold, the minimum detectable difference based on the available sample and observed variance. If the minimum detectable difference is way too high for the aims of your study (e.g., an environmental risk assessment), then the only conclusion is that more research is needed, and next time you should have a better study design (e.g., larger number of sampling or experimental units, and/or more precise measurements).
As others have stated, no you should not calculate power retrospectively. It is meaningless.
See: Gilbert, G. E., & Prion, S. K. (2016). Making sense of methods and measurement: The danger of the retrospective power analysis. Clinical Simulation in Nursing, 12(8), 303–304. https://doi.org/10.1016/j.ecns.2016.03.001
Article Making Sense of Methods and Measurement: The Danger of the R...
My feeling is that there is value to calculating power post hoc in studies where it was not calculated a priori; however, the key point is that you should use the criteria that the authors would [sensibly] have used had they conducted an a priori power analysis for sample size. So doing can tell you whether the study was underpowered at the outset. Perhaps more interestingly, it can tell you if a published body of research has been underpowered but nevertheless reported postive results. This approach might be useful in a systematic review both with or without meta-analysis.
Retrospective analysis commonly used to assess clinical outcomes, treatment patterns, and healthcare resource use and costs for more rare health conditions, for very specific indications, or in cases where the required parameters are not captured in large data sets.
Therefore, there is no perfect method for calculating the sample size or there is no specific formula for sample size calculation.
The do’s and do not’s for determining the sample size of a retrospective study are:
A rule for quickly determining sample size is 10 cases (charts) per variable, in order to obtain results that are likely to be both true and clinically useful. It is acceptable to have a minimum of seven or five events per predictor.
Three commonly used sampling methods in a retrospective chart review are convenience, quota, and systematic sampling. Inconvenience sampling, the most common method, suitable cases are selected over a specific time frame; in quota sampling, a predetermined number of cases are sought from each site or diagnostic determinant; in systematic sampling, every ‘nth’ case is selected from the target population.
It is recommended that researchers contact the institution’s research ethics board coordinator, as they can provide valuable and time-saving site-specific information and assistance. Any changes to the research protocols generally need to be submitted to the review board for an amended approval.
It is noteworthy that the retrospective chart review is conducted for quality assurance, to evaluate medical professionals, gather information to train new medical professionals or address compliance issues of a third party.. We should specify a systematic sampling procedure or quota sampling (if needed) to attain the targeted sample for the study or you can be taken all the cases in a specific cohort.
I potentially am in a similar situation and would like your opinion, if possible:
I collected data (i.e., I recorded sounds) and a master student is processing it (i.e., measuring each sound), but it takes a lot of time. For his master thesis, he used a sample of X sounds, and conducted preliminary stats.
Now, we want to continue the work to publish it, and wonder whether we need to process more data and if so, how many, so that i) he does not spend too much time coding "unnecessary" sounds and ii) we have reliable results.
I think that a power analysis could be the answer, but I read everytime that power analysis must be only conducted on pilot data. Can we consider that the X sounds he coded are pilot data? If so, can we include them in the final article if we run a power analysis on this sample size?
In other word: can we do a power analysis during the study to estimate when he can stop collecting data?
Importantly: the stats he conducted for his Msc thesis are not the ones we will keep in the article because these are simple and probably a little wrong (he was in a rush to finish before the deadline) and I want to run more elegant tests. So we are completely blind to the results, we have no p-value yet or anything. Just to clarify that we are not trying to p-hack the paper! :)
I would absolutely consider the first set of sounds as pilot data. You are interested in sampling the "population" of recorded sounds so that you can make inference to what you would find if you coded all the sounds in a reliable way.
The best model of a cat is a cat, and preferably the same cat. Coincidentally from Norbert Weiner, https://en.wikipedia.org/wiki/Norbert_Wiener a leader in your field of acoustics!
So, if I understood well, I can consider the current dataset as pilot data, this is great news. But then, is it ok to keep that in the final study? (I ask since, most of the time, pilot data remain pilot and are not included in the final analysis) Including this "pilot" dataset into the final sample won't be an ethical/statistical problem?
There's a priori power analysis (before the study), post hoc retrospective power analysis - to calculate achieved power, and there's also sensitivity analysis which allows computing the required effect size. I'm not sure if this is relevant, but G*Power software could help with the calculations - it's the one that I've been using in social sciences.