24 November 2014 14 3K Report

Here is one example of an analysis and reviewer response I've seen a few times, and I'm wondering what you all thing about it. I want to see if I can find a good way to explain it to people, to respond to reviewers, and to avoid it in the future.  

Example: Say we do an analysis reporting randomized experiment results for a basic 2-condition design (control v. experimental condition). We test differences between condition means for 20 variables. All variables are "independent" outcomes based on different questions in the survey. I say "independent" in quotes because while they may be related by being from the same respondents, and may have an objective correlation, they are not mathematical combinations of each other (i.e., they are not re-codes from the same survey item). Some are categorical and some continuous so we use a combination of independent samples t-tests and chi-square tests. No multiple testing adjustments are done. All variables are conceptually-relevant, so we're not just fishing here, although the work is exploratory. 

Results: We find one significant difference at the alpha = 0.05 level. 

The reviewer responds with something like "You have tested 20 items and only found one to be significant, which is the number expected by change at alpha = 0.05. I don't think these findings are very strong or reliable." 

Here are my questions.

Q1: Check my knowledge/reasoning

To me this seems like a misunderstanding of p-values. This is always hard to say in words, but here goes...I understand p-values (and alpha-level) as reflecting the risk of rejecting the null when it's true (the probability of a Type 1 error). However, they apply only to individual tests, not the cumulative number of tests you do in an analysis. Reviewer comments like the one above seem to reflect a notion of "study-wise error rate" (i.e., the probability of finding any statistically-significant difference where there truly is none in the population), but that's not what p-values are about in my understanding. My statistical training is mostly applied, so I'll admit I may be missing something here.

Q2: Should multiple comparison adjustments be done in situations like this? 

I understand the need for doing multiple comparison adjustments in post-hoc ANOVA tests (and similar) where the various comparisons are known to be correlated because they come from the same items and have the same statistical information, but situations like the one I'm asking about here are less clear. It seems to be pretty arbitrary how you group all the various tests you might do on a single data set for different purposes (i.e., how would you decide whether 20 or some other number goes in the denominator of the adjustment). 

It seems to me that if the survey items are correlated, then adjustments should be made, but if the are not (i.e., are truly independent), then no adjustments are needed. What do you think? Are there more formal ways to think about this? Is this "study-wise error rate" something I should be thinking about and adjusting for? 

Thanks in advance. 

More Matt Jans's questions See All
Similar questions and discussions