I thought Bonferroni corrections were needed when multiple paired comparisons were conducted within the same experiments (for different conditions). Do I really need to apply Bonferroni corrections when the t-tests are conducted on different tasks that are never analyzed together?
Since you have given yourself multiple chances to find a difference between the two groups (i.e., multiple tasks), you have inflated the chances of getting at least one significant difference by chance. So yes, some correction for that is needed.
The Bonferroni correction assumes that all of the hypothesis tests are statistically independent, however, and that is almost surely false. If two of your tests have some aspects in common (e.g., they would be influenced by some of the same physical or mental abilities), then there would be some dependence. The probability of making at least one Type I error would then be less than Bonferroni assumes, and the Bonferroni would be an over-correction (reducing power).
It sounds to me like your best approach would be to start with a multivariate comparison between the groups, such as multivariate ANOVA or discriminant function analysis (the scores on the tasks are the different DVs). The multivariate approach controls for the multiple chances to find differences, and it does so without assuming independence of the DVs. This would be a good way to establish that there are some differences between groups beyond chance (p
As long as the data are independent across tasks, then you shouldn't need to correct for multiple tests. If you are running multiple analyses on the same data, then you do need to correct for the multiple comparisons.
Leonard is correct. I want to note that it also depends on whether the t-tests are tied to independent hypotheses.
The question is what kind of error-rate you wish to control.
If you generally reject H0 at alpha then you control the rate of false rejections at alpha*100%.
If you consider many experiments (not neccesarily being related!), some of them may have p
What do you mean with multiple tasks that are never analyzed together? Could you explain your question in some more detail? What you want to do is determine the 'family' of comparisons. The fact that this will always be somewhat arbitrary points towards the inherent difficulty of drawing any strong conclusions on such exploratory analyses. If you are talking about a single study, you could test your main prediction using a .05 alpha level, and a Bonferroni correction to test 'whatever else happens'. This will be the most conservative approach. You can present these results as exploratory, and perhaps report effect sizes and confidence intervals, as long as you remind yourself exploratory analyses can never be used to test a theory.
Daniel, what I mean is that I have tested two populations of participants on several different tasks (comparison of numbers, recognition of patterns, processing speed and speech articulation speed), that I have ran 4 different T-tests for each of the tasks and that a reviewer (the editor actually) asks me for a correction. But because my tasks are independent I thought I did not have to correct anything.
Then, I think I understand Jochen's point. If we run 100 t-tests, some of them will be significant just because of the probability law.
Igor - As a matter of fact the hypotheses are completely independant. Even more, for some of the test I expect a difference but for others I don't.
Since you have given yourself multiple chances to find a difference between the two groups (i.e., multiple tasks), you have inflated the chances of getting at least one significant difference by chance. So yes, some correction for that is needed.
The Bonferroni correction assumes that all of the hypothesis tests are statistically independent, however, and that is almost surely false. If two of your tests have some aspects in common (e.g., they would be influenced by some of the same physical or mental abilities), then there would be some dependence. The probability of making at least one Type I error would then be less than Bonferroni assumes, and the Bonferroni would be an over-correction (reducing power).
It sounds to me like your best approach would be to start with a multivariate comparison between the groups, such as multivariate ANOVA or discriminant function analysis (the scores on the tasks are the different DVs). The multivariate approach controls for the multiple chances to find differences, and it does so without assuming independence of the DVs. This would be a good way to establish that there are some differences between groups beyond chance (p
I think there is great disagreement on the philosophy surrounding the necessity and extent of adjustment for multiple comparisons (e.g., Perneger, 1998, 1999). Numerous methods exist to adjust results of individual tests. Among these the Bonferroni technique is a widely used option but too conservative. A more powerful variant of the Bonferroni technique is the Holm sequential procedure (Aickin, 1999; Proschan & Waclawiw, 2000; Zhang, Quan, Ng, & Stepanavage, 1997).
Aickin, M. (1999). Other method for adjustment of multiple testing exists. British Medical Journal, 318, 127–128.
Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments.British Medical Journal, 316, 1236–1238.
Proschan, M. A., & Waclawiw, M. A. (2000). Practical guidelines
for multiplicity adjustment in clinical trials. Controlled Clinical Trials, 21, 527–539.
Zhang, J., Quan, H., Ng, J., & Stepanavage, M. E. (1997). Some statistical methods for multiple endpoints in clinical trials. Controlled Clinical Trials, 18, 204–221
Perneger, T. V. (1999). Adjusting for multiple testing in studies is less important than other concerns. British Medical Journal, 318, 1288
Actually, there is no real "yes" or "no" answer possible for this kind of question. You can be very strict and say "yes", a correction is needed because you have the same sample size and participants and that you run multiple tests with them.
But this is only half true. If you really have different and independent hypothesis, than it is meaningful to disregard such a correction. It just makes no sense because you are testing different hypothesis! So a "no" would be the correct answer.
I would need a better understanding of your experimental design - I will try to describe what I did understand from your answers.
* You are working with two groups, experimental and control
* Each group conducts the same tests (comparison of numbers, recognition of patterns, processing speed and speech articulation speed)
* You want to compare the results of each test between the groups
From my understanding, there is no correction necessary. Consider the following scenario: a psychologist and an economist are using the same participants for their studies, e.g. for a survey, just to save money and time. If they analyse the data independently, no one would ever ask for a correction of their results.
But this is the important sentence: If you dont mix the results/tests, then you are fine. If you draw at the end a concluding summary of all four tests, one could argue that they are not really independent.
Like Jeff already said, a MANOVA might be a better way to start. If this MANOVA is statistical significant, according to Field t-tests for post-hoc analysis are ok.
However my personal opinion is, that no correction is necessary, if I described your experimental design in an accurate way.
You may use Holm’s Sequentially Rejective Bonferroni Test, described in Keppel (1991) and Kirk (1995).
to give a simple answer: yes, you should use some kind of correction (it is however discutable whether you should use Bonferroni correction or another one).
If your different tasks are linked at one principal question, you should use the correction.
I oppose using corrections for multiple tests, although I do it anyway to avoid journal rejection. As Perneger TV pointed out (BMJ 1998;316:1236-8), "Bonferroni adjustments imply that a given comparison will be interpreted differently according to how many other tests were performed....Most proponents of the Bonferroni method would count at least all the statistical tests in a given report as a basis for adjusting P values. But how about tests that were performed, but not published, or tests published in other papers based on the same study? If several papers are planned, should future ones be accounted for in the first publication?"
In other words, if I view an interaction as significant, and comparing just those two measures without others yields a significant p-value, then would I be dishonest by withholding the results of other comparisons? What if I said I didn't happen to do them? In short, my ability to gain notice for my analysis is punished if I happen to have measured several other variables.
See also Brandt J, Clin Neuropsychol (2007;21:553-68): "There is also a peculiar logic to the Bonferroni correction. Why should interpretation of a given statistical comparison differ depending on how many other comparisons are also performed? If you go to your physician for a series of blood tests, you want to know whether any of them differ from normal."
"So, what is the solution? Here, it is simple: Let each statistical test speak for itself."
In 2001, Delaina Walker-Batson and colleagues (Stroke 2001;32:2093-8) found that amphetamine paired with speech therapy was associated with improvement on language tests in aphasic patients relative to placebo. However, at 6 months follow-up the difference remained significant, unless one applied correction for multiple comparisons.
So was there an effect, or not? If we treat the results as insignificant because of correction for multiple comparisons, are we missing out on a treatment that is truly efficacious for aphasia? Or should we be subject to the tyranny of multiple comparisons?
Health care is in the balance. Shall we not choose wisely?
You could certainly run ANOVA or MANOVA, whichever is appropriate, and add Tukey's HSD for post hoc contrasts to look at direct comparisons. Tukey's is a reasonable test to control Type I eeror rate inflation as long as you have fairly equal sample sizes and you are comparing means. Bonferroni tends to be too strict, multiple T tests too loose, Tukey's HSD is many times just right and strikes a good balance.
Victor raises a few interesting questions (though I do not agree with all the conclusions, especially Brandt, 2007). And I would like to try an more concrete approach to the question. Because my impression is that a lot of papers about this topic hide behind vague and abstract statements instead of giving a clear answer to that question.
Why do we need a correction of the type I error (or sometimes also type II)? Let us look at one and two-sided t-tests. A two sided t-test uses alpha/2 on either side of the probability distribution. This is in fact a Bonferroni correction. We need it because now we have two stochastic events that support the alternative hypothesis. From stochastic we know that a disjunction of two events has a higher probability to occur: the probability p(A OR B) = p(A)+p(B)-p(A AND B). And if A and B are independent events p(A AND B) equals zero so only p(A OR B) is reduced to p(A) + p(B). In a two-sided t-test, this would mean, that that chance to prove my theory is doubled (given that the theory is bound to the alternative hypothesis). That is why I have to reduce the alpha limit.
If - on the other hand - my hypothesis was different. If there were two theories (instead of one alternative hypothesis): if Mean1 > Mean2 would suggest my first theory, Mean2 > Mean1 would suggest a second theory, and Mean1 = Mean2 just leaves the scientist baffled in his lab. In this case, no correction would be necessary. There is no disjunction of stochastic events.
The crux is, that we should not only look on the numbers but on the theories we investigate. A scientist has a theory which leads to psychological hypotheses. The psychological hypothesis is translated into a statistical hypothesis. And if there is more than one stochastic event favoring one single psychological hypothesis, then we risk inflation and need to correct the type I error.
I would be glad to receive any comments on these thoughts!
@Catherine: I hope this helps.
@Jan Seifert: Great comment. I absolutely agree. Its sad, that the relationship between statistical and "psychological" hypothesis is so often neglected.
I assume that different tasks mean different dependent variables (DVs). If so, correction or no correction depends on how many t-tests you need to run on the same DV. If it is more than one test, I would adjust. If it is only one test on one DV, I would not.
There is a situation that you have multiple DVs in a study and you want to run a t-test(s) on each of them. Before you decide how many t-tests you want to run, you probably want to check correlations of those DVs. If they are highly correlated, you can just pick one or two to run your t-tests (if this makes sense to what you are doing). An example is a retention test and transfer test. They might be strongly correlated. Rather than running two t-tests, you can just run one test using either retention or transfer test if it is sufficient to whatever you measure.
Some people mentioned it:
The important question is: Are the different tests used to draw the same conclusion?
You can do many different and completely independent tests to show that that the
two populations can be distinguished. But your conclusion: the populations can be distinguished is always the same. So you need to do a correction, since with an increasing number of tests you inflate the chance to get a significant answer as others have stated above.
Independence of tests: It depends on how they depend on one another.
If different tests use the same method, any correction might overcorrect your
results. (See Jeffs answer.) However, independence of tests is not the
ultimate criterion here. What matters is:
Independence of final conclusion(s) is relevant for the question whether you need a Bonferroni correction.
@Victor Mark: A sentence such as
See also Brandt J, Clin Neuropsychol (2007;21:553-68): "There is also a peculiar logic to the Bonferroni correction. Why should interpretation of a given statistical comparison differ depending on how many other comparisons are also performed? If you go to your physician for a series of blood tests, you want to know whether any of them differ from normal."
just shows that Brandt 2007 did not understand statistics. If you do 10000 blood tests
for a patient, you expect some of them to deviate significantly by chance. Certainly, the doctor should have a look at them individually, but this does not contradict the logic of the Bonferroni correction.
If one blood test shows that the patient has too many white blood cells you might be alarmed even if 9999 other parameters are normal. But the Bonferroni correction does not tell you to ignore this one test or weight down. All depends on the conclusion you want to draw.
Hypothesis 1: Your patient has too many white blood cells. You did just one test and it was significant. You do not need to correct this single result. You should do more tests to show whether your patient has leukaemia.
Hypothesis 2: The overall picture of your patient is significantly different from an average person. In this case you did many independent tests for this same hypothesis and you need to do a correction. In the end the amount at which the one significant test deviated will be important. A major deviation might still be significant after the correction.
(This is similar to what Jan said above.)
If you are not satisfied with the statistical properties of multiple t-tests you should look for other tests (ANOVA or MANOVA, see above).
Yes, it matters whether the DVs are, essentially, measuring the same construct (in which case the MANOVA approach is ideal), but I also think you need to consider the risk of making a Type 2 error (of finding no difference where there was one). We tend to emphasise the Type 1 error risk and adjust conservatively, accordingly. But what if this is novel or exploratory research and to do so stops us reporting a real, and important difference between groups or treatments which needs further exploration? Perneger is not the only author who argued that the Bonferroni correction was inappropriate at times. See also Rothman http://www.ncbi.nlm.nih.gov/pubmed/2081237 (Epidemiology. 1990 Jan;1(1):43-6. No adjustments are needed for multiple comparisons. Rothman KJ.). I have successfully argued in a number of papers now that no adjustment is required when the research is exploring novel or exploratory hypotheses. Moreover, if we emphasise effect sizes more than p values, this problem also diminishes in impact.
As a reminder, t in t-test is written in lower case. Student's t it is, and if he were alive, he would probably invite you to have a Guiness ;-)
Given that many researchers will conduct, and report, a series of closely-related studies, should a 'career Bonferroni' be applied?!
I agree with those who suggest you read Perneger- I was able to use it to convince reviewers Bonferroni was not needed for an analysis they questioned.
Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. British Medical
Journal, 316, 1236–1238.
That said, ANOVA/MANOVA should used used over t-tests if appropriate... it may not be from what you described.
A few points that previous contributors may have overlooked:
a) As Scheffé pointed out many years ago (in 1953, to be precise), an omnibus ANOVA does not satisfactorily control Type I errors in subsequent comparisons. Say you have several conditions in the design and there is only one real effect amongst the various possible comparisons of those conditions. In this case, the probability of a significant ANOVA F is not alpha, but the power of the test (which is greater than alpha). If you then perform many t test comparisons (because the 'protective' ANOVA is significant), the probability of making a Type I error amongst all but one of them is what it would have been had you not performed the ANOVA first. That is why Scheffé recommended using the omnibus F distribution for all comparisons, not the t distribution (which is F with 1 df in the numerator).
b) The same consideration applies to MANOVA, but on a larger scale. Assume you have several conditions all measured on several variables, and that there is one true effect with one comparison on one variable. Again, the probability of rejecting the omnibus multivariate null hypothesis is the power of the test, which again is greater than alpha. All but one of the subsequent comparisons are truly null, but the probability of rejecting any one of them is what it would have been without the 'protection' of the MANOVA test. In short, ANOVA and MANOVA provide no protection for Type I errors.
c) MANOVA makes sense when, according to the substantive theory being researched, it makes sense to use a linear weighted combination of the DVs (i.e., when such a composite variable is theoretically interpretable). It is not appropriate otherwise. Moreover, in such circumstances it might be better to define the linear composite beforehand, as theory would mandate it, rather than tacitly adopt a linear composite that maximises the ANOVA effects (which is what MANOVA does).
I feel the need to respond to the comment made by Jochen Wilhelm, in which he said: "For instance think of 20 experiments from which 4 showed a "significant" result (p
Nice comment, Barry. This is a very important and largely unrecognized fact: "the overestimation of population effect sizes, because we tend to take note of only the sample effects that are large enough to yield statistical significance"
I agree with Jeff. But please avoid Bonferroni if you can, it kills your results. You could try using cluster based statistics.
I need to correct some of what I said in my earlier comment (to do with 'protected' comparisons).
To do so, it might be best if I begin with a true omnibus null: that is, when all population means are equal. In this case, making all the usual assumptions, the probability of a spuriously significant ANOVA F will be alpha (say 0.05). In such a case, even if the probability of rejecting any protected comparison were to be 1.00 (which is silly, I know), the probability of a Type I error occurring in the research would by 0.05 (1.00 x 0.05)*. It is in this case (only) that ANOVA F limits the Type I error rate to alpha. The same applies to MANOVA. It's only when there are no mean differences on any dependent variable that the upper limit of a Type I error with any comparison is alpha.
* NOTE 1: This calculation assumes that the overall ANOVA F and subsequent comparisons are statistically independent, which they could not be. However, it suffices for the point I'm making about alpha as the upper limit of a Type I error when the overall (omnibus) ANOVA null hypothesis is true.
* NOTE 2: This is why my earlier statement that "…the probability of making a Type I error amongst all but one of them [the comparisons] is what it would have been had you not performed the ANOVA first" was incorrect. That statement occurred because I tried to simplify earlier drafting, but mangled it instead.
Suppose, contrary to the fully null case, that there is one mean different from the others in the population and that the probability of detecting a difference between this mean and the others with ANOVA is 0.15 (i.e., the power of the ANOVA F test). In such a case, rejecting the ANOVA null hypothesis is a correct decision, and its probability is 0.15. Nevertheless, many of the comparisons that could (and probably would) be made after the significant ANOVA are null in the population, and rejecting any of them with a 'significant' test would be a Type I error. The upper limit for such errors is no longer alpha (0.05) but the power of the ANOVA (0.15). That is why Scheffé and Tukey proposed alternatives to ANOVA for post hoc comparisons (and for more complex contrasts). The same considerations apply to MANOVA, but with the complication that F in this context is computed on a weighted linear composite of the dependent variables. Even so, correctly rejecting a null MANOVA hypothesis will result in the upper limit of the comparison alpha being equal to the power of the test, not alpha.
So, to address Catherine's question, what would I recommend be done? My answer involves a two component strategy:
1. With those aspects of a research design for which clear predictions can be made on the basis of a well articulated theory and, ideally, supporting prior evidence, specify theory-contingent comparisons (or more complex contrasts) and use the t (or F with one df) as your test distribution (c.f. Rosenthal & Rosnow, 1985).
2. For those aspects of the research design where (1) does not apply but you want to snoop around to see what might be, then use the approach proposed by Rodger (http://en.wikiversity.org/wiki/Rodger%27s_Method), which is well summarised by Roberts (2011) who also has created an SPSS program to implement the procedures. Rogers' approach sets the average post hoc Type I error rate at alpha for a set of uncorrelated contrasts, and is much less conservative than Scheffé, Tukey or Bonferroni.
If, despite the recommendations of many applied statisticians, you're having trouble getting planned contrasts accepted (as in 1 above), then use Rogers' approach throughout. It is far more statistically coherent than ANOVA followed by t tests, and is more sensitive to real effects than the Scheffé, Tukey and Bonferroni procedures while keeping the average relevant alpha at the value you (and your editor and reviewers) want it to be.
References
Roberts, M. (2011). Simple, powerful statistics: An instantiation of a better ‘mousetrap’. Journal of Methods and Measurement in the Social Sciences, 2, 63-79.
Rodger, R. S. Rodger's Method (http://en.wikiversity.org/wiki/Rodger%27s_Method, accessed on 12-05-13).
Rosenthal, R., & Rosnow, R.H. (1985). Contrast Analysis. Focussed Comparisons in the Analysis of Variance. Cambridge University Press, N.Y. (ISBN 0-521-31798-7)
But wait, there's more:
I feel should relate my previous comments to a few of those above:
A. Contrary to Laurence Nolan and Barry Cohen, I don't think a MANOVA is often appropriate even when the dependent variables are conceptually related (as contemplated by Jeff Miller). Why? (i) Because MANOVA doesn't set the upper limit of comparison/contrast Type I errors to alpha when the overall null hypothesis is false; and (ii) because MANOVA employs a post-hoc weighted linear combination of the dependent variables, not one based on their theoretical relationships.
Furthermore, if we accept Barry's conjecture that many published studies include more true effects than null ones, then point (i) here is even more pertinent. In that case, everything else equal, the power of the overall F test will be quite high (perhaps 0.60 to 0.70), setting the upper limit of a Type I error (with the null effects) to be 0.60 to 0.70, not alpha.
B. I agree with Romola Bucks that we need to keep the probability of Type II errors firmly in mind (and Rodger's approach does just that), but even so, some researchers have made some disquieting observations about the rate of false positives in published medical and psychological research (Ioannidis, 2005; Simmons, Nelson & Simonsohn, 2011).
C. I couldn't agree more with Barry Cohen (and Jochen Wilhelm) about the need to report, but be careful not to overestimate, effect sizes. Hayes (1973), for example, developed a measure he referred to as Omega squared to address this problem, and several others have been proposed.
References
Hays, W. L. (1973). Statistics for the Social Sciences. Second edition. New York: Holt, Rinehart, and Winston.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 0696-0701.
Simmons, J. P., Nelson, L D., & Simonsohn, U. (2011). False-positive Psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science Online, Oct 17, 1-8.
I would just add that MANOVA should not be thought of just as a possible means for controlling Type I error when testing multiple DVs; it can actually be more powerful than any of the individual DVs depending on the relationships among DVs (especially when they are NOT highly positively correlated with each other). But it is indeed true that MANOVA can easily capitalize on chance relationships among the DVs, so its results need to be interpreted cautiously and replicated to be trusted.
Barry, I agree with both your points (that MANOVA can be more powerful than ANOVA, and that its outcomes need replication) and would add another variant.
MANOVA can be practically useful when, for example, you have a number of measures to hand and want a composite that will maximise your ability to distinguish between groups of individuals (or conditions applied to individuals). The point to note, however, is that the composite variable is being used in its own right here, not as a gateway to differences on the individual variables. And, to repeat your caution Barry, the weights used in the composite will have to be replicated and adjusted until they stabilise. In these kinds of applications, the composite variable may remain a 'black box' practical combination of the original variables, or it might lead to a theoretical interpretation of how & why the variables combine for the purpose they are being used for.
Having looked back over the answers to Catherine’s opening questing I think I can add a bit more information. This is instalment 1.
I reminded myself (somehow) of a Monte Carlo study I included in my Masters thesis submitted many lifetimes ago (don’t ask, I’ll tell: IBM mainframe, Fortran coding, Hollerith cards, overnight runs, agonising minor errors).
I was simulating a one IV design with 10 conditions, examining 9 orthogonal planned contrasts amongst the sample means, each applied to 10 repeated measures on one DV. In the present context, the repeated measures could just as well be 10 standardised dependent variables (hence the relevance to Catherine’s question).
I ran 200 ‘experiments’ and varied the population correlation between the repeated measures (DVs here) from zero between all DVs to 0.90 between all DVs. Uniformity of dependence between DVs is not realistic in a real world context, but suffices for what I want to say here.
I examined the case where all IV means were equal in the population (with alpha = 0.05 per contrast), as well as the case where there was a pattern of population mean differences such that the power of each contrast test was 0.95 (Type II error rate per contrast, beta, equal to 0.05).
1. There were two points worth noting in the resulting data:
a) The alpha and beta rates were as intended (0.05) over all experiments and all DVs within experiments. That is, the overall alpha and beta rates were not affected by the dependency between the DVs.
b) However, what changed with increasing dependency between the DVs was the pattern of errors within each experiment. When the DV correlation was zero, the pattern of errors per experiment was random, but with increasing dependence the errors within experiments tended to clump together, strikingly so when the population correlation between the DVs was 0.90.
2. The upshot of this information is fairly obvious, I guess.
a) When the is no population dependency between the repeated measures or DVs in a study, then the conditional probability of making an error given that you have already made one, is the relevant error rate (alpha or beta, as appropriate). That is, the errors are statistically independent.
b) When there is dependency between the repeated measures or DVs in the population, then the conditional probability of making an error given that you have already made one is greater than the relevant error rate to a degree that grows with the dependency.
3. I’m not sure, but I think 2(b) may be in conflict with Jochen Wilhelm’s postscript.
John, thank you for your valuable contributions. It is a great pleasure to read!
To your point 2b: Dependencies will create "clumped" errors, but if one considers all possible dependencies, I suppose you will sometimes get more and sometimes get less errors than expected so that Bonferroni will be too conservative in controlling the long-run error rate over the (hypothetical) population of all possible dependencies. I am lacking the mathematical skills to proof this (nor would I understand a proof if I encountered one...).
So I must aggree to your point 3: I am also not sure if 2b is in conflict with my earlier postscript.
Jochen and others:
I'll attach a PDF file to this reply that gives you the graphs I made of the Monte Carlo Type 1 errors I mentioned in my last reply (I drew the graphs by hand, on squared graph paper, such was the technology available at the time!).
Before I say more about the graphs, I should sharpen my 2a and 2b statements above, as follows:
2a) When there is no population dependency between the repeated measures or DVs in a study, then the conditional probability of making an error in one measure/DV given that you have already made one with another measure/DV, is the relevant error rate (alpha or beta, as appropriate). That is, the errors are statistically independent.
2b) When there is dependency between the repeated measures or DVs in the population, then the conditional probability of making an error on one measure/DV given that you have already made one on another is greater than the relevant error rate to a degree that grows with the dependency.
And I should add important caveats to both those statements:
• About 2a: As previously mentioned, my Monte Carlo runs were structured according to a repeated measures ANOVA design. The use of a common error term in such an design introduces a slight dependency in the Type 1 (and Type 2) errors obtained in the data. It's not visible in the relevant graph, but it was there (faintly).
• About 2b: The planned comparisons I examined did not involve interactions across the repeated measures; the same set of (group) contrasts was applied to each of the 10 repeated measures. Had I been more realistic and included interaction contrasts between the group IV and the repeated measures, then the dependency between contrasts would have been more pronounced than I found, even with low population correlations between the repeated measures (and my original wording of 2b would have sufficed!).
More about the graphs:
1. There were 200 'experiments' each involving 10 repeated measures (which for the present purpose we are construing as DVs). The experiments (and the DVs nested within them) are represented on the horizontal axis, and the number of Type 1 errors on the vertical axis. Tick marks demarcate the experiments, with every second experiment having a thicker horizontal axis to try to make experiment boundaries more apparent. There are 20 experiments per line.
2. You will need to enlarge the graphs on your screen to see them clearly.
3. The term 'Rho' included in the graph title at the top of each graph is the population correlation between the repeated measures/DVs, and the word 'Occasions' in the title refers to the repeated measures/DVs.
4. What you will see Jochen is what you anticipated: as dependency between the Type I errors increased, so does its complement, the probability of correct retention of the Null Hypothesis (seen as increasing clumping of blank spaces within experiments as population dependency increases.
Hi, yes you need to use the Bonferroni correction for a series of t-tests as each time you test you are increasing your risk of Type I error. So each t-test is expensive as far as statistics safety is figured. Look up Boniferroni and then you can test more without the worry or cost!!
Be warned: This is a LONG posting
I’ve been vacillating about whether to make another contribution to Catherine Thevenot’s question about multiple test control (MTC), both because I’ve probably said enough already and because Catherine must have sorted out the issues with the journal editor by now.
Well, I can’t resist another go, prompted in part by the conflicting replies to Catherine’s question and by the more general issues those replies have raised.
My approach here is to consider what I would look for as an editor, depending on the character of the research in front of me.
A. My ideal quantitative hypothesis-based research:
a. provides a clearly expressed chain of argument from theory (and previous evidence) to prediction to method to hypothesis to statistical model to analysis and then to interpretation;
b. involves a pattern of predictions that follows closely from the author’s theory, but is less plausible from the perspective of another theory;
c. ensures that the links between the hypotheses (expressed in method-based variables) and the statistical models (tests) are very tight, ‘like hand in glove’ (refer to Rosenthal & Rosnow (1985) whom I have cited previously).
When such conditions are met then, as far as MTC is concerned, I would not require any adjustment for the multiple tests involved if three further provisos were to be met:
d. the statistical tests (amongst the IV conditions) are, as far as possible, uncorrelated
e. if there are multiple DVs involved, they measure different constructs, whether or not they are correlated.
How do I justify this approach, and how does it relate to Catherine’s question?
f. In the circumstances I have described, the theoretical lens is focussed at the decision level, not the whole experiment (which may include other considerations – more on that later). Put another way, each statistical test is tightly meshed into an aspect of the framing theory and has direct implications for that aspect of the theory. The rates of error (1 and 2) should be set at that level.
g. Even so, as an editor or reviewer, I’d also be looking at how well the total pattern of evidence is congruent with the predicted pattern before being convinced about the theoretical cogency of the research.
h. From what I have gleaned about Catherine’s research, it might fall under this heading. Much would depend on the theoretical distinctiveness of the DVs, how theoretically unique the pattern of her predictions is, and how well it is confirmed in her statistical evidence. Everything else equal (I did study economics in high school!), a two group design is probably not sufficiently ‘complex’ to satisfy my uniqueness criterion, but that would depend on the theory and previous evidence.
B. What about quantitative / hypothesis testing research that is less than my ideal? Well, obviously that depends on where the blemish lies.
a. If there is a disjunction between the hypotheses and the statistical tests (as is so often the case) I would require a reanalysis. A frequent case is the use of a multiple-IV ANOVA to address questions that are only part of the full design. In these cases, ANOVA F tests often precede tests that are theoretically relevant, and the latter are often tested in a profligate manner (all pairwise comparisons, for example). I dislike such research for three reasons:
i. ANOVA designs rarely get to the substance of theoretical predictions in the social and behavioural sciences. They might be appropriate for agricultural research (where Fischer first devised them), but they are usually too blunt to be theoretically cogent for our needs (hence the frequent resort to subsidiary tests).
ii. When used as an MTC, ANOVA does not control Type 1 errors at the nominal alpha for any subsidiary tests, as I have explained in earlier postings.
iii. Theoretically relevant errors are clumped on an experiment-wise basis: Type 2 if global F is not significant, Type 1 if it is significant. (Think of a Monte Carlo study like the one I described in earlier postings, in which there are no differences between the population means in a one-IV design, experiment-wise alpha is 0.05, and 200 'experiments' are run . If all assumptions are met, 10 of 200 ‘experiments’ would have a significant global F, and only in those 10 would further tests be performed, some being ‘significant’ also. In 190 experiments there would be no Type 1 errors, but in 10 others there would be smiles all around. That’s what I mean by experiment-wise clumping.)
What would I (as an almighty editor) require in the reanalysis? Well, much would depend on how solid the chain of argument was (see Aa above). My judgment would turn on whether the hypotheses are sufficiently enmeshed with the theory and are sufficiently frugal in number (as uncorrelated contrasts dictate, for example) to warrant a test-wise alpha. If so, I’d probably ask that Rodger’s post hoc method be used rather than a planned contrasts approach.
If the chain of inference is ‘modest’ in substance, I’d require that the effective alpha depend on the degrees of freedom needed to include all the theory relevant hypotheses. For example, if there are seven theory-relevant hypotheses embedded in a larger ANOVA design, I’d ask that those seven hypotheses be extracted and tested with suitable contrasts, using a conservative MTC like Scheffé (with numerator df=7), or Bonferroni (with decision alpha = ‘usual’ alpha divided by 7). I doubt that I’d be charitable enough to suggest Hochberg, particularly if the papers by Ioannidis (2005) and by Simmons, Nelson, & Simonsohn (2011) came to mind (refer to a previous posting of mine).
It’s likely that I would also point out how wasteful the original study was by using an off-the-shelf ANOVA design when a more parsimonious bespoke one would have been more cogent.
b. What if the research is unashamedly ‘exploratory’, with only a loose theoretical framework, and even looser connections between theory, method, hypothesis and stats? I’d probably reject it (I’m being stern, remember), but at the very least I’d expect a conservative post hoc MTC like Scheffé, Bonferroni or Tukey using an experiment-wise (not family-wise) alpha. There’s no doubt that I would have the concerns of Ioannidis (2005) and Simmons, Nelson, & Simonsohn (2011) uppermost in my thinking about such research. This clearly puts me at odds with Romola Bucks and Laurence Nolan in their earlier posts.
C. One requirement I would stipulate for all of the above cases is that a suitably adjusted index of effect size be reported for all tests. This often provides a salutary reminder of the modest contribution our procedures make amidst the fuzz of individual differences and measurement error.
This requirement might (or might not) be helpful in the exploratory case: even if a contrast is not ‘significant’ according to a conservative MTC, its effect size might be sufficient for it t be included in the discussion.
D. At some later time I’ll make shorter posts about two issues we’ve all ignored or taken for granted, namely that:
a. We are using inappropriate conditional probabilities for our tests
b. Most behavioural and social science research does not use random sampling (as distinct from random assignment)
John, awaiting your next postings :)
Thank you for taking that time and efforts.
I still wonder what p-values would help in your scenario A. If effect sizes and estimation uncertainties are given here (as you pointed out in C), what more would a p-value (or a formal "test") add? The same essentially applies for you scenario B-b (exploratory research).
We are still at two different levels here: the level of the study (including a theoretical framework, logical reasoning, expert knowledge, consodidated opinions...) and the level of an "industrial quality control" (unsupervised, controlling error rates based on cost-benefit calulations, particularily not icluding any further "soft" information available).
Would be in interesting approach to ask the JOURNALS to control THEIR false-discovery rates... each new publication would require to re-judge the old results.
Hi everyone,
This is a very interesting thread, although I'm a bit late in discovering it. I thought I could perhaps contribute to the discussion by pointing you towards discussions nearer to my own field in the medical sciences. A seeming advantage of biomedicine is perhaps that an underlying "metaphysical truth" is perhaps easier identified than in psychology as we have succesive layers of evidence from biochemistry, molecular/cell biology and (animal)physiology to build upon in a line of evidence.
Now to the issue of testing multiple complexly related hypotheses. Say your department has a line of research and you of course investigate multiple research questions and perform different experiments. The very similar questions usually end up in one paper (and here you're obliged to correct for multiple testing), if questions are slightly less similar and you have a whole bunch of evidence for each one, the line of research gets split up in several papers and you don't have to correct for multiple hypotheses. Some lines of inquiry might lead need nowhere, and you of course don't publish that, the negative data (from what only in hindsight were uninteresting hypotheses) don't feature anywhere in anything.
Now, where am I going with this? Have a look at the following article: http://www.nature.com/nature/journal/v483/n7391/full/483531a.html
It describes a survey that shows that of the 53 most promising lead compounds in cancer reseach only 6 were found show reproducible effects (reproduced by other labs). It should be added that evidence for each lead compound (paper) here consists of a full paper with successive experiments with target binding studies, studies on cellular function, animal experiments and even human pilot trials. Each rock-solid when considered in isolation. Commonly there was even a stringent multiple testing correction (usually some form of FDR control) in the first line of evidence (i.e. the cell-based screening assay), but not further down the line. The hypotheses were also seemingly unrelated, one compound targets, say, Cell division, another perhaps Apoptosis. A world of difference to a cell biologist. Only when you take the field as a whole you see the high false positive rate.
I guess multiple testing does kill your data (and hey, my own thesis or even my entire lab also relies heavily on the absence of multiple hypotheses correction and selective data publication). I just wonder how sustainable it is for science as a whole. Also how do you correct for it? Should you include multiple levels on hypotheses generated per lab/department, per (sub-)displine... ?
So, what do you think?
- Hendrik
I think: We should forget about hypothesis testing in basic research. But we should build up a culture of (dis-)confirmation.
In my eyes it is a very bad (and unscientific!) system where authors claiming new discoveries ("for the first time we...") get all the merits whereas a thorough and repeated confirmation is not at all rewarded (what -high impact!- journal would publish a manuscript just stating that the results of study xy were (dis-)confirmed?). This is, to my opinion, one of the biggest shames of our scientific culture/system. Gaining knowledge is a long, hard, stony process consisting of some few discoveries and many thorough confirmatory experiments. Instead our discoveries are too often based on chance (throwing all data that is not in-line with it before it comes publication) and a severe lack of public confirmation (confirmation sometimes is done, in many labs, before they start to work on a project; but failing to reproduce already published results is usually not published; and positive confirmations are neither so).
And, Hendrik, there is no point in controlling error-rates as long as the results are pre-selected...
Hendrik: I'll reply quickly because it's getting late Downunder. Some of the problem you have outlined is included in the paper by Ioannidis to which I made earlier reference (see below).
Perhaps some of the answer to the problem lies in ensuring that the patterns of outcomes we predict are appropriately complex (which underpins the uniqueness issue I raised above) AND appropriately tested and reported. In your context, complexity of pattern would surely include the various 'layers' of evidence to which your labs have recourse nowadays. So, to be pompous and make assertions about which I know nothing of the costs or complexities, I guess the whole pattern of findings should be published in one paper, or at least collated in one paper not too long after the individual components have been published.
Another part of the problem is that statistical inference is Claytons replication (dare I rehearse it? The replication you have when you don't have any replication). Frankly, there really is no substitute for replication, no matter how clever we become with statistical devices. In your context, the question may become 'Replication of what, specifically?'. Here I get recursive: see para 2 immediately above.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), 0696-0701.
Jochem
It would appear that we were replying to Hendrik at about the same time with a similar punchline: we need a replication culture, and statistical inference is no substitute for it.
Also, your last point is one of several concerns addressed by Simmons, Nelson & Simonsohn (2011): the ease with which research can be 'optimised' for publication.
Simmons, J. P., Nelson, L D., & Simonsohn, U. (2011). False-positive Psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science Online, Oct 17, 1-8.
Thanks John! I've missed the Ioannidis paper (and the point you were making) in your earlier post. I can't say I understand all of it, I guess I'll re-read it a couple of times until it sinks in.
I wanted to make a small but important edit to my long posting, but it was too long for the edit window!
It applies to the last, brief paragraph under B(a):
"It’s likely that I would also point out how wasteful the original study was by using an off-the-shelf ANOVA design when a more parsimonious bespoke one would have been more cogent."
I feared that I'd forget to add an important additional requirement, so I did. The paragraph should end as follows:
"To reinforce this point I would also require that the remaining systematic variance be bundled into one global F test and the result reported without decomposing that variance with subsidiary tests."
If you want a justification for this requirement, read Simmons, Nelson & Simonsohn (2011).
As Rolf Reber said above, you may choose Holm’s sequentially rejective method to correct your p-values.
In fact, I have taken the time to develop a calculator using Excel. It is available for download from my publications page on ResearchGate, and I have attached a link to this post.
The Abstract is as follows:
"This simple Excel calculator allows the user to quickly calculate Holm-Bonferroni sequential corrected p-values. Holm's "step-down" procedure is intended to control the familywise Type I error rate in a less conservative manner as compared with standard Bonferroni correction, and for that reason is considered an attractive method by many researchers. Instructions for use are provided in the spreadsheet. I have tested the calculator myself, but please bring to my attention any flaws that you may notice."
I sincerely hope it works fine and is useful for all interested. Any criticisms or suggestions are welcome.
https://www.researchgate.net/publication/236965525_Holm-Bonferroni_Sequential_Correction_An_EXCEL_Calculator?ev=prf_pub
Jochen
I'm replying to your comment of a day or so ago beginning: "I still wonder what p-values would help in your scenario A." I won't respond to the details, just to the general sentiment.
I was reminded immediately of a conversation I had a couple of months ago with Robert Lockhart, an Emeritus Professor of Psychology at The University of Toronto. He published a stats book in 1996 (see below) a major feature or which was its emphasis on making confidence judgements, not statistical decisions. Although it found a good market, its uptake was more limited than Robert had hoped. We agreed immediately on the reason: my avatar, The Editor.
Journal editors are bombarded with manuscripts, often on a daily basis, and they need to make choices amongst them. It's already difficult enough making the kinds of judgements I outlined earlier without also having to balance the numerics of confidence judgements. So, even though the use of confidence intervals has had a history as long as statistical significance testing, the latter has long since won the duel, unfortunately.
Lockhart, R. S. (2006). An Introduction to Statistical Data Analysis in the Behavioral Sciences. University of Toronto Bookstores, Toronto.
In addition to my previous post - here is the NEW link to the Holm-Bonferroni calculator. The old link doesn't work anymore. Thanks everyone.
Data Holm-Bonferroni Sequential Correction: An EXCEL Calculator
John, thank you for your answer. I certainly see the point. Interestingly, this brings me back to the lack of an appropriate "scientific culture". The chain of action seems to me like following:
1) scientists would like to acquire "knowledge", that is, unravel "interesting" observations. Main required properties (besides having the knowledge, brains, talent and creativity) are curiosity and play instinc.
2) they need financial support for their work and to live.
3) financial support is given related to the performance of the scientist.
4) performance is measured as publication output/cumulated impact points
5) good performance promoted a scientist's career, increases her/his influence (power) and salary. High-ranked scientists can become quite rich. - Bad performance will result in cutting the support and loosing the job.This makes science a business.
6) growing up in our culture frequently turns the primary aims (curiosity) into the socially valued aims which are career, making money, being influential/powerful.
7) in order to reach these aims and to avoid getting kicked-off, scientists are forced to indicate/display a good performance. Since this is measured in the numebr of impact points, they have to produce impact points by publications.
I don't have a solution how to make the system better. By some means we have to allocate resources, and we have to avoid major waste of funds. So somehow we need to separate "good" and "bad" scientists. But an appropriate evaluation of a scientist's contribution can't be merely based on publication output. Like in nature, the evolutunary "fitness" can only be assessed retrospectively, what is a major problem here.
It would probably have a positive effect on the scientific quality/honesty/trustability to uncouple the personal career and financial and job perspectives from the publication output. But I have no good idea how else to decide which person should get a chair, for instance. Every alternative I can think of ahs its own adverse effects. However, not getting more money for being a professor (or a "successful" professor) would keep away all the people just playing the game for personal reputation and making big money.This might put the actual research more into focus. Further, I would better separate academic/basic science (really "playing around" in areas where it is not clear what to expect), applied science (identify those aspects from the output of basic science that might be used for practical purposes and to develop the principles of how this may be translated to practise), and "industrial science" (adopt and optimise the the pricinples, developing the factual technology). We get more and more into the situation that basic research applications have to be justified by thier (expected) value for the society/economy. There are many examples of (to my opinion) adverse effects: atomic force enery (contrast: very little basic science on strange compounds which might help to solve the today's problem of storing electric energy), genetically modified organisms in agriculture (contrast: lack of knowledge of available cultivars and their properties), automobility (contrast: what reseach was conducted on really alternative concepts of mobility? the combustion engine was optimized for 100 years, the electric motor remained essentially unchanged). To give just three more or less debateable examples. But now it's really off-topic.
Jochen, and others:
You may be interested in reading, and listening to, the following:
Yong, E. (2012). Bad copy. Nature, 485, 298-300.
Ed Yong talking to Russ Roberts, June 4, 2012: Yong on Science, Replication, and Journalism. http://www.econtalk.org/archives/2012/06/yong_on_science.html
Nosek, B. A., Spies, J., & Moytl, M. (2012). Scientific Utopia: II - Restructuring Incentives and Practices to Promote Truth Over Publishability. Social Science Research Network. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2062465
Brian Nosek talks to Russ Roberts, September 10, 2012: Nosek on Truth, Science, and Academic Incentives. http://www.econtalk.org/archives/2012/09/nosek_on_truth.html
Holm's correction is an improvement over the Bonferroni correction but is considered conservative. Hochberg's correction is considered by some to be preferable to Holm's as it is slightly less conservative and closer to the false discovery rate (FDR). Benjamini and Hochberg developed the FDR and it is probably the best approach to the multiple comparison problem (but may be harder to implement without ready made software).
If you want something simpler, you can correct for the correlation between your outcome variables:
Sankoh, A. J., Huque, M. F. & Dubey, S. D. (1997). Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Statistics in Medicine, 16(22), 2529-2542.
Bundling all your outcome variables into one multivariate analysis is a good solution, as was suggested earlier. But it still leaves you with the post-hoc analyses to answer specific questions. Planning your analyses in advance helps, where you plan to do certain comparisons (rather than all) to address specific research questions.
A while back in this thread I said I would make brief comments on two issues that we hadn’t considered thus far. They probably deserve a separate thread, but they are generally relevant to the issues that Catherine’s opening question has prompted.
Here I’m focussing on the difference between random sampling and random assignment and the corresponding statistical models they give rise to. Bear in mind too that I think of statistical inference as Clayton’s replication, the replication you have when you don’t have replication.
The distinction I’m addressing has had a long history in the literature, so much of what I say here is better and more extensively argued by others (e.g., Edgington & Onghena, 2007; Ernst, 2004; Manly 1997). I’ll just make a number of bald statements.
RANDOM SAMPLING (from a population):
a) assumes that every ‘member’ of a ‘population’ has an equal and independent probability of being sampled in the study
b) is rarely achieved in the behavioural and social science because members of the hypothetical population(s) cannot be enumerated and hence their probability of being sampled equally and independently cannot be assured
c) encourages inappropriate generalising of the findings beyond the data obtained in the study
d) tacitly contributes to the paucity of exact replication.
RANDOM ASSIGNMENT (to the conditions of the study):
e) can be achieved in some research (in so-called ‘true experiments’)
f) can be emulated by computer randomisation and partial randomisation methods whether or not random assignment has occurred
g) can be used without reference to populations or to population parameters
h) explicitly discourages generalisation beyond the study (when so used).
RANDOMISATION TESTS:
i) have been used as non-parametric benchmarks for random sampling tests
j) are be better construed as study-specific statistical tests because they are based solely on the data obtained in the study (e.g., Ernst, 2004, p677).
SO, RANDOMISATION TESTS:
k) share a common assumption with much qualitative research, namely that the findings pertain only to the study concerned (and its implications for theory).
l) might encourage direct replication if researchers come to realise that they cannot generalise beyond the study being reported.
FOND HOPE:
m) that (l) happens, eventually.
REFERENCES
Edgington, E. S. & Onghena, P. (2007). Randomization tests. Boca Raton, FL: CRC Press.
Ernst, M. D. (2004). Permutation methods: A basis for exact inference. Statistical Science, 19(4), 676–685.
Manly, B. (1997). Randomization, Bootstrap, and Monte Carlo Methods in Biology (2nd edition). London: Chapman & Hall.
The central issue of this thread is the basis upon which an adjustment (if any) should be made for multiple statistical tests performed in one study. Several contributors have noted that this question is not simple one, for example:
* Jochen Wilhelm on May 8, 2013
* Athanasios Mazarakis on May 8
* Marek Nieznanski on May 8
* Graham Edgar on May 9, 2013
* Victor Mark on May 9
Much of the applied stats literature has assumed for many years that the decision-based alpha (the alpha for each contrast or comparison made in a study) should be set in such a way that the probability of erroneously rejecting one or more null hypotheses is no more than one of the ‘conventional’ alphas, say 0.05. The latter as come to be referred to as the Familywise Error Rate (FWER), whether it is based on the entire study (e.g., Ryan, 1962) or subsets thereof (e.g., ANOVA main effects and interactions). Quite a number of methods have been proposed to achieve this objective, ranging from Scheffé’s (1953) method, to the Holm and Hochberg variations of the Bonferroni correction. The Bonferroni correction is just an approximation to the calculation that applies when statistically independent decisions are made: refer to the following web site for a succinct summary:
http://www.fon.hum.uva.nl/praat/manual/Bonferroni_correction.html
A central point made by Victor Mark (when citing the Brandt article) is that, when a FWER procedure is used, the likelihood of revealing a true effect is inversely dependent on the size of the ‘family’ in which it is being tested. In other words, avoid using large experiments, or pretend that they are smaller than they are (as the Holm and Hochberg methods essentially do). Graham Edgar makes this point in inverse form:
“Given that many researchers will conduct, and report, a series of closely-related studies, should a 'career Bonferroni' be applied?!
Although not well known (but thoroughly peer-reviewed), the approach taken by Rodger (1974, 1975a,b) ensures that the decision-based alpha does not depend on the size of the ‘family’ in which it is embedded, provided the contrasts involved are linearly independent (a slightly weaker form of orthogonality). The procedure is based on a reintegration of the family-wise F distribution (Rodger, 1975a) that reverses the logic of the FWER. The full scope of Rodger’s approach has been summarised and implemented (in SPS) by Roberts (2011).
So, if you are going to persevere with ‘random sampling’ methods, and if you are undertaking relatively small scale studies (as distinct from the huge ones that have motivated False Discovery Rate methods), I urge you to become familiar with Rodger’s approach and with Roberts’ software. The method is no less restrictive in its assumptions (about independent decisions) than Bonferroni and its variants (Holm, Hochberg), but potentially far more informative and powerful.
Roberts, M. (2011). Simple, powerful statistics: An instantiation of a better ‘Mousetrap’. Journal of Methods and Measurement in the Social Sciences
2(2), 63-79.s
Rodger, R. S. (1974). Multiple contrasts, factors, error rate and power. British
Journal of Mathematical and Statistical Psychology, 27, 179-198.
Rodger, R. S. (1975a). The number of non-zero, post hoc contrasts from ANOVA and error-rate I. British Journal of Mathematical and Statistical Psychology, 28, 71-78.
Rodger, R. S. (1975b). Setting rejection rate for contrasts selected post hoc when some nulls are false. British Journal of Mathematical and Statistical Psychology, 28, 214-232.
Ryan, T. A. (1959). Multiple comparisons in psychological research. Psychological Bulletin, 56, 26-47.
Ryan, T. A. (1962). The experiment as the unit for computing rates of error. Psychological Bulletin, 59, 301-305.
There is a sense in which all of science should be subject to some sort of correction if we believe in p-values and significant hypothesis inference testing (which is being rejected by the new statistics as worthy of its acronym, and banned by some journals).
If I do 20 unrelated experiments of a dichotomous pairwise nature with randomly selected hypotheses/predictions, there is a strong likelihood that p
What you're ignoring is that fact that the probability of doing an experiment depends in a sense on the success of the previous experiment. Science is not like coin flipping where you keep doing it until you get heads (or whatever). If I do a manipulation that does not produce a significant result, the chances that I will do another experiment like it are drastically reduced. If I do ten experiments that do not produce a significant result, the chances of me ever being allowed to conduct another experiment would be close to to zero (although impossible to prove mathematically, I think most people would agree that is the case, sadly).
But, Brian, this is unfortunately not the case in all fields. There are many screenings done, for example to find some effective anit-cancer drug. Such screenings usually give several possible candidates. Even from careful theoretical considerations one might "desing" a particular drug; but now one usually will also test chemical relatives, and further one will use different application modes for testing the drug(s). Taken together, there will be a multitude of studies investigating one or more "candidates", and getting some significant results from worthless componds is likely. Example: Roche had initiated and sponsored 81 studies to "proof" the effect of Tamiflu (vaccine against H1N1). 9 were published, all showing very slight positive effects. I find it likely that these results is a collection of false-positives. Ok, ok, I may too harsh, since Roche says that it is going to publish all results. But read what they write (off-topic!):
http://www.rochetrials.com/
"Data reported on this Website may differ slightly from published or presented material and from data reported in the prescribing information for each product. These differences reflect standard differences in reporting data such as, but not limited to, use of arithmetic and geometric means and medians, reporting of interim and final analyses and differences in data handling rules specified in study protocols and country specific standards adopted by individual Health Authorities."
and further:
"While Roche is making great efforts to include accurate and up-to-date information through monthly updates, Roche makes no representations or warranties, express or implied (including, but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement), as to the accuracy or completeness of the information provided on this Website and disclaim any liability for the use of this Website or any website linked to it. Roche may change this Website at any time without notice but does not assume any responsibility to update it. All users agree that all access and use of this Website, and any website linked to from this Website and the content thereof, is at their own risk. [....]"
So:
- the original data ist not shown
- the results may differ from published results
- no warranties about the fitness of a particular purpose
- no warranties about the correctness of the data/results presented
- everything may be changed and edited
Wow.
Bonferroni is rather conservative, I use FDR, which is more "realistic":
†Benjamini, Y., Krieger, A. & Yekutieli, D. (2006) Adaptive linear step-up procedures that control the false discovery rate. Biometrika, 93, 491-507
http://dx.doi.org/10.1093/biomet/93.3.491
.
I agree that standard Bonferroni is too conservative, especially in fields such as phylogenetics where you might be dealing with hundreds of dependent contrasts.
Fortunately, the Holm-Bonferroni method serves as a sequential and more liberal means of controlling the omnibus Type I error rate.
I have attached a link to an update of my previous Holm-Bonferroni p-value calculator (see also in my ResearchGate publications). This version allows you to calculate up to 10,000 corrected p-values at one time.
Please cite this calculator as Justin Gaetano (2013) if you use it for data to be published. All criticism welcome: [email protected].
Reference:
Holm, S. (1979). A simple sequential rejective method procedure. Scandinavian Journal of Statistics, 6, 65-70.
Data Holm-Bonferroni Sequential Correction: An EXCEL Calculator - Ver. 1.2
To be very subjective on this... I like the false discovery rate as an alternative to such horrifically conservative things such as bonferroni corrections ;)
I have some questions about Bonferroni correction and was wondering if anyone of you could kindly help me with them as an expert of Statistics. I really appreciate it! I
Here is my situation: I am currently writing my dissertation in which I used a survey to collect data from two groups of participants and used SPSS to analyze the data. In order to answer Research Question Two “How do the competent and less competent university students differ in terms of their degree of metacognitive knowledge about English writing?”, two independent samples T-tests (or independent t-test, for short) were conducted on the two groups of participants (group A and group B). One was an independent T-test performed on the two groups regarding mean scores of the three sub-scales (person, task, and strategy variables), and the other was an independent T-test conducted on the two groups to compare them item by item (there are 50 items in the survey). My questions are as follows: Question 1. Since there are only two groups of participants involved in this study, is it necessary to perform Bonferroni correction to reduce Type I error ? I thought Bonferroni correction is meant for 3 groups or more, according to the video clip below: https://www.youtube.com/watch?v=BK7Ay49jM_8
Question 2. If an independent T-test is conducted on the two groups to compare the means of the two groups item by item with 50 items in the survey, is that considered one T-test or 50 T-tests? In SPSS, I only needed to run one T-test to compare the means of the two groups item by item for the 50 items.
Thank you so much! Best regards, Mike
To Daniel:
Thank you so much for your explanations ! That was very helpful!
May I ask you some related questions?
Is Bonferroni correction meant for 3 groups of participants or two groups?
It seemed to be for comparison among 3 groups, according to the video clip below: https://www.youtube.com/watch?v=BK7Ay49jM_8
Also, do you happen to know how to perform Bonferroni correction on SPSS
for the T-tests conducted between two groups?
I appreciate your help!
Best regards,
Mike
I have a related question: do I need to a bonferroni correction when dv's used in a series of ANCOVA are not sinificantly correlated?
Joan: two points.
1. What correlation between DVs does is increase the conditional probability of a Type I error on one DV, given a Type I error on another DV. However, correlated DVs don't alter the overall Type I (or Type II) error rate. Put another way, as the correlation between DVs increases, the errors increasingly co-occur, but they are offset by co-occurring correct decisions in other replications, leaving the overall proportion of Type I errors unaffected (see an earlier posting of mine in which I briefly described some Monte Carlo simulations).
2. If it appears to the editor and reviewers that your DVs are conceptually redundant (a significant finding one on being as useful as any other for your hypothesis/conclusion) then, as Daniel Krause says, you'll be asked to adjust in some way for that redundancy. However, if the DVs are such that one interpretation predicts one pattern of relationships across the DVs, but another interpretation predicts a different pattern, then you should consider a multivariate analysis in which those patterns are examined as designs on the DVs and tested like planned interaction comparisons.
Thanks Daniel and John for your quick respons. I will try to correct for multiple testing using the formula suggested by Justin.
Here, you can read this article about how to calculate bonferroni correction https://reneshbedre.github.io/blog/mtest.html
I reject the tyranny of corrections for multiple tests. I withhold decision about the adequacy of a neurological intervention until I see replication of an experiment by other investigators. Correction for multiple tests effectively penalizes one for using multiple tests. Thus, one can change one's interpretation of data simply by dropping several assessments and thus using a less conservative p-value.
See Brandt J. 2005 INS Presidential Address: neuropsychological crimes and misdemeanors. Clin Neuropsychol 2007;21:553-568; Perneger TV. What's wrong with Bonferroni adjustments. BMJ 1998;316:1236-1238.
I find Bonferroni's approach too conservative and simplistic. In my opinion, the best practice would be to clearly report all statistical tests conducted during the analysis. In average 1/20 comparisons from random samples of the same population will be statistically significant due to pure chance, so if significant tests are reported in isolation it will mean absolutely nothing and generate noise in the field, waste people's time and money. I would recommend the following: don't use Bonferroni to get credibility, report your methods with high accuracy, publish your data in an open platform and upload your analysis scripts for the reviewers and the whole scientific community to freely check/reuse/extend your work if they wish, be 100% transparent, and I guess nobody will ask you to correct p-values.