I've been researching this problem for several years from a statistical perspective. I haven't heard many rational, purely scientific reasons to always use p
The choice of the cut-off point to accept the alternative hypothesis (same as rejecting the null hypothesis) is totally arbitrary. The commun use of p≤0.05 was chosen by Fisher (1). He believed one chance out of twenty a sufficiently low probability for us to take the risk of betting on but argued that each scientific could set his own cut-off point.
I have read some strange interpretations of p-values and their distributions. This is mainly due to a confusion between Fischer approach of hypothesis testing where the null hypothesis (H0) can never be both true and untrue, and Neyman's and Pearson's alternative approach which assumes H0 can be true or untrue.
To make things clearer, here are a few points not to forget when using p-values in traditional hypothesis testing:
- Definition of p-values -
P-values quantify the probability of observing concomitant events under the null hypothesis. In other words, it provides the proportion of studies that would reveal the observed non-existing association had we done the study an infinite number of times.
- What we mean by uniformly distributed -
Secondly, when we say that under the null hypothesis (H0) p-values are uniformly distributed, this means it is uniformly distributed on a probability distribution. In no way does it mean we have as much chance of obtaining a p=0.2, p=0.03, p=0.000009. It means that the chances of observing these results are equal to the probability distribution of chance.
- Interpretation of p-values -
So to summarize, p-values provide a precise estimate of the probability of the null hypothesis been true. It therefore provides an indication on the internal validity of our efforts to prove our alternative hypothesis as being wrong. Rejecting the null hypothesis seems to be easier to justify with a p=0.000000004 than for p=0.0499999. We however have to keep in mind that p-values do not provide any indication on the probability of the assumption of association of been true, they do not provide any indication on the magnitude of differences, and provides no indication on the probability of wrongly accepting the null hypothesis. Reporting CI95%, R^2 is therefore much more informative.
References
1. Fisher, RA The Arrangement of Field Experiments. Journal of the Ministry of Agriculture 1933; Sep 503-513
Nothing. In fact people use something else all the time in situations with multiple testing, eg. gwas, sequential designg and the like. Often, however, this is done to ensure an overall 5% level ;-)
There always is a price to be paid for higher, respectively lower, type I error levels as well as benefits. A lower level means fewer type I errors, duh, but for good monotone tests it comes with lower power as well! A higher levels is just the opposit, higher power and more type I errors.
In some experimental settings, for instance not to expensive simulations, where you have access to all the observations you want(to wait for) you could go almost arbitrariarly low in level and just compensate for the power-loss with more observations.
I have been looking for situations my self, where one might want to control the type II error in stead of the type I error. In comparisons of medical departments of very different size(patients treated in a give period) I have argued that small departments was protected from being tagged as sub-standard by the power/sample-size relation induced by our type I error obsession.
Situations where I think type 2 errors are more important that type 1 errors: In natural sciences: protecting endangered species, natural resources management such as fisheries, climate change. In medical sciences: disease detection where cases left untreated become critical (i.e. cancer, HIV, etc.). That's just off the top of my head.
0.05 is a nice round number. people like nice round numbers. but it's not better than 0.0478 or 0.537, or even 0.0157. But the truth is that it sounds better. If you loose 1 finger (or toe), you loose 0.05 of them, and loosing less than a finger is not a significant loss of fingers :) Honestly, that's a fashion really, and it can be traced back to the influence of one of the most influential Statisticians of all times, Sir Ronald A. Fisher. See really comprehensive explanation by Gerard E. Dallal here: http://www.jerrydallal.com/LHSP/p05.htm
There is one possible "scientific" reason coming to my mind: To effectively control a type-I error rate, one must stick to a rule. If every scientist sticks to the same rule, the type-I error rate of the community is controlled. If one of the scientists would decide for some experiment to reject H0 when p
I think using a p-value of 0.05 or 0.01 or 0.001 is up to the researcher and I don't think that scientific journals or the scientific community will question findings because of a certain p-value used.
In my opinion calculating the power of the statistical test at that specific p-value is more important and from my experience it is something that is often omitted even in journals with high impact factors.
Vahid, as I understood the type-I erorr is a "false-positive", i.e. a false rejection of H0. This only applies to H0, and alpha is the rate *under* H0. The power, in contrast, is related to HA. You can not have a power under H0, and you can not make a false rejection under HA. Thus, takling about power under H0 makes no sense, and talking about alpha unter HA neither does.
I think here is the misconception: a p-value is not an individual-case type-I error probability! It is the probability to observe more extreme data *under* H0. A control of the type-I error is possible *only* in a statistical way, in the long-run, to limit a rate of such errors. The only way to do this is to stick to fixed decision rule. Since this rule must be given a-priori and independently of the data, the reate cannot be related to the data and hence not to the power or the type-II error rate. However, the type-II error rate depends on the chosen alpha.
Ups, this is turning into a heated debate about P-values... If I had to choose, I do prefer Jochen's than Vahid's definition. If it comes down to wording, I guess I'd say a P-value is "the probability of obtaining a test statistic as least as extreme as the observed one, conditional on H0 being true. There are many good papers discussing P-values and problems with their use, e.g. Anderson, D. R.; Burnham, K. P. & Thompson, W. L. Null hypothesis testing: problems, prevalence, and an alternative Journal of Wildlife Management, 2000, 64, 912-923 r Cherry, S. Statistical Tests in Publications of The Wildlife Society Wildlife Society Bulletin, 1998, 26, 947-953
.
an interesting historical note about the origin of this ubiquitous 0.95 value
http://www.jerrydallal.com/LHSP/p05.htm
(spoiler : that's all Sir Ronald's fault ... as always !)
.
.
in the above reference, Fisher's quote is quite interesting, again stressing the difference between Fisher's approach to statistical testing (as a tool for scientific inquiry, try to discard the worst hypothesis but keep working with the others, even the modest performers) and Pearson's approach (repeated fixed decision setting which must lead to a given precalculated false positive rate, this rate having been optimized from a cost/benefits analysis of the consequences of the decision of wrongly rejecting the null) :
"... it is convenient to draw the line at about the level at which we can say: "Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."...
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. "
.
In fact, when we are not even 5% sure that a hypothesis is true, it is better to reject it. That was the original idea behind fixing the level at 5%. However, this value was fixed arbitrarily, without any mathematical reason. Accordingly, some people may possibly fix the level as 10% or 1%, once again arbitrarily!
Now a lot seems to be confused here.
Tiago is correct: "the probability of obtaining a test statistic as least as extreme as the observed one, conditional on H0 being true." - this is also what I wanted to express.
Fabrice showed why 0.05 is used so often, as "convention". However, this 0.05 was not chosen to control an error-rate. I believe that Fishers intention was like follows: if you have no clue about anything what to expect, and you see some effect, than be a little careful: do not over-interpret the observed effect if it would be already likely expected to find such (or stronger) effects under H0. Fisher never controlled error-rates, he rejected H0 not only based on comparing p to any fixed level. When H0 was more resonable on a-priory grounds rejected it with lower p-values, and when H0 was not resonable he rejected it with higher p-values.
Hemanta confuses P(H0|data) with P(data|H0). They are not the same! (replace "data" by "a sufficient test statistic as the one observed or a more extreme one" if you want to be closer to the technical terms of the test theory)
Well, I have made my comment with reference to Pr ( whether H0 is true for given data ) only!
@ Vahid note Jochen did said "*under* H0", which is the same as "conditional on H0" or "on the condition of null being correct". I would say wording apart, we might all be saying the same thing. The fact is that often though, we might have the same notion but words distort that notion!
Surely nothing should prevent you from using p=0.05 as your threshold value, in fact if it were not for trying to keep with conventions you could as well use p=0.03, p=0.07, p=0.10. It is all a matter of choice, I suppose. However, it is always good practice to use the 5% level since most people use it in literature, thereby, keeping to conventions. Remember that this cut-off value of e.g. p=0.05 also very much depends on how you state your hypothesis. Is it a one sided test? or a two sided? what study are you performing? is it a clinical study or a social and/or economic study? all these factors weigh in while deciding the stringency level (p=0.05 less, = or greater) to be used. Good luck, I hope this helps!
This debate is very interesting. I believe it does not matter whether the P value is 0.05 or less than that. The P value is a statistic, that should be interpreted by the biology or nature of the parameter, or trait in question. I would like a situation where results are presented even where the P value is 0.09 to 0.06 as not significant, but the P values are indicated as such, so readers could interpret the results based on the nature of the parameter under study.
One of the things we do under hypothesis testing to prevent ourselves from rejecting a true null hypothesis is to minimise the likelihood of such rejection, this probability is what statistician call p-value. Therefore the p-value only tells you the possibility of rejecting a true claim so that if this probability is 0.05 then there is a only five percent likelihood of throwing away you true null hypothesis. This is so chosen because in reality other smaller values are very rare in real life situation and so choosing a value like 0.01 which means only 1% likelihood of throwing away the true null hypothesis is almost impossible thus rendering the situation under consideration almost perfect and void of no error which are mostly not true. The p-value of 0.05 is therefore a better probability value which when violated by the test the it is real and genuine. Thank you
Just an example:
In phase II design we don't want to stop too early a promising drug, so we prefer having a better power. As we can only have small sample size, we set alpha at 10% (usually bilateral) and beta at 10% or 5%.
I would also say that if you set your threshold (according to circumstances) BEFORE your analysis, you'll be right! But justifying your alpha cut-off is not really obvious...
Nothing prevents you from using an alpha that is not equal to 0.5. (and hence a rejection for p less than or equal to 0.05).
A hypothesis test is a type of proof by contradiction. We begin by assuming the null hypothesis is true. Based on this assumption, we generate a p-value. The p-value is the probability of observing your sample (or an even more extreme sample) IF the null hypothesis is true.
A small p-value tell you that either:
a) your sample data is not likely to appear IF you assume it came from the null hypothesis.
OR
b) that assuming the null hypothesis to be true doesn't look like a good assumption. In this case your sample data could be quite likely to appear, from some other distribution.
If we have to "guess" which occurred: something with a high probability (more likely to occur) OR something with a low probability (less likely to occur), of course we'll pick the former.
So, a small p-value tells us that probably the null hypothesis is false, but possibly it's not and you got a sample that has a small probability of occurring.
We only use the 0.05 cut-off since Fisher (he of the F-test) was asked what he considered a small enough probability to reject the null hypothesis. He said "1 in 20".
best
The wierd thing is that a there is no "unlikely" p value under H0. The distribution of p under H0 is uniform, so it is equally likely to get any p in the interval [0;1]. Under H0, getting a p=0.001 is as likely as getting a p=0.734, or a p=0.128, or p=0.992, or...
Therefore, having just a single p-value cannot give any indication about the (dis)credibility of the null hypothesis (any other p would be equally likely!).
Therefore, nothing particular can be concluded from a single p-value. The only work-around is to reject H0 by a fixed rule: whenever p
More important than adhering to the consensual cut-off 0.5 p value, is to interpret the p value you get, either
I don't want to start any controversy but, I am afraid to say that p-values are well know to be an incredibly misleading way of judging scientific evidence.
Why do I say this? One is usually interested in the probability P(H_0|data) which is exactly what the p-value does not provide. So say you get a p-value of 0.05, what do you think P(H_0|data) will be roughly? If you are rejecting the null hypothesis at this level you would want P(H_0|data) to be small here, say of similar value to 0.05. It is not.
The size of P(H_0|data) can only be found in a Bayesian framework and the answers are shocking, to quote from the abstract of the paper below:
"data that yield a p-value of 0.05, when testing a normal mean, result in a posterior probability of the null of at least 0.3 for any objective prior distribution"
This means that if your p-value is 0.05 then P(H_0|data) > 0.3. (Yes, that is correct: greater than 0.3).
This means you may have rejected the null hypothesis, but it is actually true with probability greater than 0.3.
Hence p-values are in fact incredibly misleading and I would rather we tried to stop using them completely.
If this is the first time you have heard this, it is shocking I know!
Please see the following paper for details.
Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence,
James O. Berger and Thomas Sellke, Journal of the American Statistical Association, Vol. 82, No. 397 (Mar., 1987), pp. 112- 122
The choice of the cut-off point to accept the alternative hypothesis (same as rejecting the null hypothesis) is totally arbitrary. The commun use of p≤0.05 was chosen by Fisher (1). He believed one chance out of twenty a sufficiently low probability for us to take the risk of betting on but argued that each scientific could set his own cut-off point.
I have read some strange interpretations of p-values and their distributions. This is mainly due to a confusion between Fischer approach of hypothesis testing where the null hypothesis (H0) can never be both true and untrue, and Neyman's and Pearson's alternative approach which assumes H0 can be true or untrue.
To make things clearer, here are a few points not to forget when using p-values in traditional hypothesis testing:
- Definition of p-values -
P-values quantify the probability of observing concomitant events under the null hypothesis. In other words, it provides the proportion of studies that would reveal the observed non-existing association had we done the study an infinite number of times.
- What we mean by uniformly distributed -
Secondly, when we say that under the null hypothesis (H0) p-values are uniformly distributed, this means it is uniformly distributed on a probability distribution. In no way does it mean we have as much chance of obtaining a p=0.2, p=0.03, p=0.000009. It means that the chances of observing these results are equal to the probability distribution of chance.
- Interpretation of p-values -
So to summarize, p-values provide a precise estimate of the probability of the null hypothesis been true. It therefore provides an indication on the internal validity of our efforts to prove our alternative hypothesis as being wrong. Rejecting the null hypothesis seems to be easier to justify with a p=0.000000004 than for p=0.0499999. We however have to keep in mind that p-values do not provide any indication on the probability of the assumption of association of been true, they do not provide any indication on the magnitude of differences, and provides no indication on the probability of wrongly accepting the null hypothesis. Reporting CI95%, R^2 is therefore much more informative.
References
1. Fisher, RA The Arrangement of Field Experiments. Journal of the Ministry of Agriculture 1933; Sep 503-513
.
"Let me introduce Dr. Publish-Perish. He is the average researcher, a devoted consumer of statistical packages. His superego tells him that he ought to set the level of significance before an experiment is performed. A level of 1% would be impressive, wouldn’t it? Yes, but . . . He fears that the p-value calculated from the data could turn out slightly higher. What if it were 1.1%? Then he would have to report a nonsignificant result. He does not want to take that risk. How about setting the level at a less impressive 5%? But what if the p-value turned out to be smaller than 1% or even 0.1%? He would then regret his decision deeply, because he would have to report this result as p < 0.05. He does not like that either. So he concludes that the only choice left is to cheat a little and disobey his superego. He waits until he has seen the data, rounds the p-value up to the next conventional level, and reports that the result is significant at p < 0.001, 0.01, or 0.05, whatever is next. That smells of deception, and his superego leaves him with feelings of guilt. But what should he do when honesty does not pay, and nearly everyone else plays this little cheating game?
Dr. Publish-Perish does not know that his moral dilemma is caused by a mere confusion, introduced by textbook writers who failed to distinguish the three main interpretations of the level of significance."
taken from another good read :
Mindless statistics
Gerd Gigerenzer
http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf%E2%80%8E
.
Years ago, when statistics was new to me, the literature in social science, education and engineering that I reviewed suggested 0.05. I used it because I knew I was not an authority and they were. And it was convenient for computational purposes because statistical tables contained alphas of 0.05, 0.1, 0.01, 0.025, etc.
With the advent of statistical packages like SPSS, I think we can go beyond it.
From a different viewpoint, Type I error or alpha is important for interpreting your results. Scientists have decide that if alpha is large eg. a=0.05 you can't claim that your significant result is a discovery. You can only say that there is an indication. For smaller alpha eg. a=0.000001 you can claim that there is an observation. And for extremely small alpha eg. a=0.00000001 you can claim that you have a discovery (like Higgs boson) !! So it depends on the research field. In particle physics, for example, alpha is predetermined very small by scientist in order to be sure that a significant result is a discovery.
Dear Vahid,
Firstly, I stated that alpha is important for interpreting your results, not that ''the level of alpha depends on the interpreting of the finding (being an indication or a discovery)''. I agree with you that power (beta) and effect size are very important as well.
Secondly, scientists in particle physics claim that 5σ provide conclusive evidence for the discovery of a new particle (article: Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC). In my opinion, in other fields you have to take into account and other things. In medicine, for example, you can confirm your results doing more experiments but the most important is the patient's health not statistics and p-values.
@Fabrice Clerot: aren't things slowly changing with respect to the example you cited? I feel like more and more journals report exact p-values, not just the next conventional level... which should make the worries you describe less of a concern, no?
.
@Ekatarina
not really ; reporting the exact p-value is sure a progress but the unresolved question is the interpretation and the conclusion that can be drawn from such a value !
i started to doubt about my own "understanding" of p-values (fairly wrong by the way, something like p(data|H0)) the day i came across their malicious description by Jeffreys
"What the use of P implies ... is that a hypothesis that may be true may be rejected because it has not predicted observable resuts that have not occured. This seems a remarkable procedure."
(quoted in Hubbard and Lindsay, "Why p values are not a useful measure of evidence in statistical significance testing", Theory Psychology (2008) http://tap.sagepub.com/content/18/1/69 )
and i have not made my mind since ...
Ignorance was bliss ... Damn Sir Harold !
.
A big problem with p-value=0.05 is that this is often taught in courses withut reference to sample size. I agree with Vahid's comments. And I stress that basci courses in statitical inference should clearly show the danger in using 0.05 and the difference between "statistical significance" and "practical significance" (e.g. physical, economical, health etc etc)
You, research, defines the Level of Significance that compares P-value to decide about the hypothesis.
If your decision is risky , with losses due a rejection of H0, reduce Alpha value. Otherwise, increase it.
I´ve seen, regarding rare deseases researchs, or very small probability of occurrence, where Alpha were less than 1%. It's not usual.
Although it is a rule of thumb to use 5% as a cut-off, one can use a different level of significance. The main idea behind this choice is that with alpha equal to 5% or lower! the probability of making a mistake by rejecting the null hypothesis when it is true is lower than 5% which means that the alternative hypothesis is true at 95%.
I would venture to say that people who have thought about the topic and know what they are doing are not prevented at all to choose p-value cutoffs other than 0.05. As others have discussed here already, the choice of cutoff depends strongly on the costs that are associated with type-I and type-II errors, and 0.05 is a priori not worse a choice than that of any other cutoff. The problems start once an observation has been identified as statistically significant and is then called a Result (with capital R) which can be published, instead of trying to find alternative explanations. With alternative explanation I don't mean the alternative that is contrasted with a null hypothesis, but a different null hypothesis. The null hypothesis needs to capture all the biases that could be alternative explanations of the Result, explanations that would reveal the Result to be an artifact of wrong assumptions. A good null hypothesis is often not obvious and needs to be found through scientific creativity and discussion. But even when this has all been taken care of and the observation remains statistically significant, the work is not done, because then one has to think about interpretation and meaning and whether additional experiments might have to be performed to distinguish between cause-and-effect relationships and mere correlations. If you want to change people's minds, I suggest to educate them to understand that statistical analysis is not just done because it's done, but that statistics is a thinking tool. 0.05 is only a problem if there is nothing else.
But most people I know want to publish and avoid thinking if somehow possible ;)
I suggest not having a cut-off at all. Quote the value and give your interpretation. Readers can draw their own conclusions.
I provide my students with the following suggested Rules of Thumb for interpretation (insert = where you are comfortable):
If p > 0.2, forget about it.
If 0.2 > p > 0.1, put it aside but don't forget about it.
If 0.1 > p > 0.05, get back to it again later.
If 0.05 > p > 0.01, it's most probably real.
If 0.01 > p > 0.001, sit up and take notice of it.
If p < 0.001, hold everything!
@Richard
i suppose you also give your students some typical range of sample size for the validity of such rule of thumb ...
with a traditional "dumb nuill" (such as exactly zero effect) and with "big data" starting to (over)flow all around, p
Hi Leanne, hi everyone.
I think this whole debate about the "p-value" is a very interesting one. Recently a paper just appeared in Nature (see link: http://www.nature.com/news/scientific-method-statistical-errors-1.14700 ), where other experts are weighing in on their opinion as to what they think of cut-off values e.g. p=0.05 and false discovery rates etc. I suggest those interested should read the article and also take a look at what the other experts think. Enjoy debating the p-value :).
Another fun paper for people thinking about p-values published in 1994 is "The earth is round (p
Cut-off point is important to separate one from the other. For example, in plant or animal population studies for selection, breeders use a point of truncation. This point established by their experience on that particular trait and the size of the initial population. In statistics, 1% or 5% level for hypothesis tests was established based on experience and the accuracy of accepting the biological test carried-out by many in the past century. However, it does not mean that everybody for every scientific test should follow the same principle. There are enormous advancements in biological research by integration mathematics and biology to understand the concepts in many areas. Application of Probability theory in biological research helps to develop more confidence on conducting trials and interpretation of results in many experiments.
By reporting the probability value that indicates the validity of proving the association that we are testing rather than reporting the association is not significant at 5% level, would give more opportunities for readers to think and use that information for further investigation on the underline concept .
The p = 0.05 is used by the frequentists. The Bayesians use likelihood test to balance the type 1 and type 2 errors to find the optimal p value. If you go back and check you find that in most cases the optimal p value comes at at around 0.05,so the arbitrary choice of 0.05 is not bad.
It is only a shared belief due to the use of statistical analysis which is a low-efficiency tools to describe and understand the world.
The answer is very simple. The data itself dictates the level of significance of the test. Once the p-value, which is data driven, is in place, you will be able to predict the significance level.
I would say it is a matter of convenience and educational inertia: if you consult, for example, Spiegels' book on (introductory) statistics (Schaum's series), you wiil find at the end of the book tables which contain values for this p=0.05. If I remember correctly, there are sometimes tables for p=0.01. However, if you solve the integral involving the p value corresponding to the limits of confidence, you can take a p value of your choice, make your own table.
I think that the choice of p-value threshold can be based on "economical" reasoning too.
I try to explain this with a simplistic example in a industrial production process.
Suppose to use control charts to control the process behavior. Actually, the limits in control charts "implicitly" assume a "p-value to reject" of 0.0027. This means that, if the process is actually in-control, the control chart will say that it is out-control about 1/300 of times; i.e., the process will be stopped, analyzed, and restarted once every 300 controls.
Now, suppose you perform controls once a day. This means that, if the process goes well, it will be wrongly stopped for problems about once a year.
If you control the process once a minute, you'll wrongly stop the process about once every 5 hours.
If the stop requires 1 hour to analyze the process, in first case you'll lose only a few of production time, while in last case you'll will lose a lot of production time.
Of course, all this reasoning should have to include the power of statistical test and the "costs" in case of Type II error (for example, a broken pen can imply small cost, a defective medical device can imply a VERY BIG cost), ... , too. But this is a more complex argument..
I hope this contribution could be useful..
Ciao
Enrico,
"Of course, all this reasoning should have to include [...] the "costs" [...] too. But this is a more complex argument.."
this is the main critical point in all hypothesis (or significance) testing. The testing is performed to make a decision. The bases of making decisions is the (estimation of) costs and benefits of right and wrong decisions. Without somehow specifying and quantifying the costs and benefits, the whole decision process is quite arbitrary, and NOT AT ALL even a little bit "objective" as so often claimed by peopple adhereing to such tests. The costs and benefits are the linchpin of the decisions, and without having specified these, justified decisions can simply not be made. Without having costs and benefits in mind, any "rule" for the decision strategy is as good as any other, so the commonly used 0.05 cut-off is as senseless, pointless, meaningless, ... as any other cut-off.
In research, where costs and benefist typically can not be estimated (not even by the order of magnitude), clear sensible decisions can never ever be made based on a p-value. We simply have no means to perform some kind of "objective" decision making here. We should start to acknowledge this. Instead of daring black-and-white pictures of the world we should focus more directly on the data, the effects, and the precision, given the data we have. This makes our life harder, because we have to think more and harder instead of leaning back and pointing to these pretty "significance stars" in the charts we present, as if they would explain anything.
(Just to note: in [industrial, economic] process control where costs and benefits can be quantified, such decision strategies are very sensible and helpful!)
Hi Leanne, everyone,
I would like to comment on question from an applied perspective. There is nothing preventing the use of any value for a p-value cutoff, technically. However, the driver is often the scientific field you are working in. The p-value cutoffs used vary depending on the field. Each field has its own communal approach to integrating statistical information into knowledge in that area. The p-value used in a field is an (informally) agreed upon standard of scientific evidence.
Hope this helps,
Doug.
An alpha of .05 is, of course, partially arbitrary and conventional. But it is desirable that there is some convention on the matter. Were researchers able to "play" with the alpha level, then it would introduce extra experimenter degrees of freedom. For instance, effect I wanted: p = .09. "We set our alpha level at .10." Or conversely, effect I didn't want (e.g., because it confirms a competing account): p = .04. "We set our alpha level at .01." If we want to continue with NHST (which maybe we shouldn't, but I digress), then there does need to be a conventional cutoff that we all stick by (and that we also, of course, don't hack). An alpha of .05 is a reasonable number for this.
The choice is entirely arbitrary and whether 0.05 truly indicates significance seems to depend on the subject a bit. There is a discussion in this paper: http://www.pnas.org/content/110/48/19313.short
arguing that Bayes factors provide a less arbitrary cut-off, and that the standard should be tightened by at least a factor of five.
"I have read some strange interpretations of p-values and their distributions. This is mainly due to a confusion between Fischer approach of hypothesis testing where the null hypothesis (H0) can never be both true and untrue, and Neyman's and Pearson's alternative approach which assumes H0 can be true or untrue. "
We have all read strange interpretations of observed signifcance levels (p-values). However, this statement is blatantly false. The Neyman-Pearson approach does not allow Ho to be both true and untrue. Both the Fisher approach and the N-P approach require that the hypothesis pair partition the parameter space: the null is either true or untrue, and if it is untrue, the alternative must be true. What N-P show is that under this schema (in the case of simple null vs simple alternative), and with fixed alpha, the likelihood ratio provides the basis for a most powerful test of that size. That is, you cannot do better than the LR statistic. The question is, do you buy into selecting the test size in advance?
Under some circumstances (essentially when the costs of rejecting a true null and accepting a false null are quantified accurately) this might be reasonable. In most cases, and particularly in most scientific studies, this setup is not reasonable. The costs of each type of error are not well-known, and so it is impossible to do the balancing act that will minimize risk.
I wish to suggest two articles by late professor SG Carmer:
"Optimal Significance Levels for Application of the Least Significant Difference in Crop Performance Trials". Crop Science Vol. 16 No. 1, p. 95-99 (1976)
"Significance from a Statistician's Viewpoint". Journal of Production Agriculture Vol. 1 No. 1, p. 27-33. (1987)
who advocates the use of significance levels of .10 and more.
A recent paper of Karamanos et al ("Real differences – A lesson from an agronomist's perspective", Canadian Journal of Plant Science, 2014, 94(2): 433-437) "highlight[s] the importance of modeling all sources of variance, designing more efficient experiments, scrutinizing the size of treatment differences, and choosing an appropriate level of significance to ensure that only real differences are detected.".
Some excellent contributions to the discussion, particularly by Paul Vaucher and Dennis Clason. The bottom line for p-values is what are you willing to bet that the relationship you think you have found is real. Obviously, an engineer building a bridge cannot afford to have 5% of his/her bridges collapse, so a higher p-value for lack of failure is required. For the average experiment, Fisher's being willing to be wrong 5% of the time seems reasonable (significant). If a higher degree of confidence is required, one could take p=0.01 (highly significant) or p=0.001 (very highly significant) as one's standard. If one sees a trend, one might even take p=0.10 as "marginally significant" (meaning that this might bear further investigation). 1 - p-value is the probability that an apparent correlation is real (alpha, Type I error). The point is that by agreeing to a common terminology, we can understand each other (if the right tests are used, otherwise it is nonsense).
Of course, one also needs to do a Power Analysis for the likelihood of not finding an effect when there is one (Type II error, beta value, usually 0.80 in clinical studies). This ensures that a small sample study that fails to find an effect is understood to be underpowered to see it given the sample size, or that the size of the effect is small.
I hope that this helps.
It's also important to read all of Fisher. It was his view that a statistically significant result was a hunting license of sorts. That is, he was willing to do further research into mechanisms and explanations about one time in twenty. Fisher did not view a statistically significant finding as an end point -- it was the starting point.
Thus, in Fisher's view rejecting a tru-(ish) null) is an error to be corrected with further investigation. What Fisher takes a gloss on is the error of failing to reject a false null. This is what we would call the problem of underpowered studies.
Contrast this with the N-P approach: make your decision and move on to the next problem.
A recent forum on the use of P-values just came out in Ecology. You can read about it here:
http://evol-eco.blogspot.no/2014/03/debating-p-value-in-ecology.html
or go straight to the relevant Ecology number here:
http://www.esajournals.org/toc/ecol/95/3
I am very pleased to read so many open-minded responses! Thank you everyone!
I completely agree with Christopher Lange. It´s a matter of what the risk (or losses) due to bad decision based on "higher" level of significance.
Some months ago I read about research on rare disease where very small value of p should be set because it would be considered as due to chance, but it´s not, it´s significant.
When I was a student I recall being told by Maurice Quenouille, who had known Fisher, that 5% was motivated by the latter’s view that the distribution of many test statistics is approximately normal in the region of their 5% point.
I advise students that it depends on the frequency of the data; 5% is too high for high frequency data for which 1% or even 0.1% is more sensible. 10% is always bad practice in my opinion!
Nothing prevents you from using any preset p-value, nor from doing any hypothesis test at all. It is more important that the data you're analyzing justifies using hypothesis testing at all. Most experimental data qualifies, most observational data doesn't.
You may find some useful comments on to P-value at:
http://home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rmip
Including:
- The Meaning and Interpretation of P-values (what the data say?)
- Blending the Classical and the P-value Based Approaches in Test of Hypotheses
- Bonferroni Method for Multiple P-Values Procedure
Arsham, what do you mean (in the linked site) by "[...] if the chance of random variation is the only reason for sample differences [...]"? What is "chance", what is "random [variation]"? Aren't chance and random (at least in this sentence) simply synonyms and all this is a circular explanation? A random variation (also a random variable) may be (is) defined by a probability distribution, and probability = chance, both eventually expressing the very same thing: relative expectations or irgnorance about observations. So, to my understanding, the sentence says something like: "If the null hypothesis is true, and if the expectation of variation of which we have can only have some expectations is the only reason for sample differences, then the expectation of the observed data is a quantitative measure of..." (you may substitute all occurrences of "expectation" by "chance", "probability", "randomness", "uncertainty" or "ignorance". I am truly puzzled by this sentence, that I read quite often in many books and resources and never understand it. Possibly you could clarify this, I would be grateful.
I further wonder why you present p-values as "measure of evidence" (what is clearly and by definition an at least ordinal scale) and later say that p-values must not be compared. This cannot be both correct. I would be glad to get some explanation from you.
Thank you.
There is an argument that deals with the expected impact of decision. The impact weight matrix need not be symmetric and would shift the decision rule. A great deal more could be said in terms of decisions that are embedded in an operational context rather than as a simple experimental conclusion.
This has been hinted at in earlier messages, but for me choosing the critical p-value is not a statistical question. It is in the realm of the real-world effective cost of making the wrong decision. In research, it mainly relates to balancing "false positive" and "false negative" decisions. So, mostly informally, sometimes researchers set the critical value at 0.1 when replication is low. On the other hand when we have many replicates, we will find statistically significant differences that are biologically irrelevant. In my opinion in every scientific publication, whatever critical value we use for discussing and interpreting the results, the actual p-values should always be given. Not doing so, just discards valuable information. Of course, one historical reason for not reporting actual values was the laborious calculations involved in obtaining values by interpolation when using printed tables.
The situation has far reaching consequences when dealing with legal regulation compliance studies, or for environmental impact assessment, or safety. I would not want to take 1 in 20 risk of making the wrong decision concerning the possible lethal side-effect of a new medicine, while it might be acceptable to take that risk when comparing the new medicine to a currently used medicine known to be highly effective. In such cases we would want, rather than balance the risks of making false positive or false negative decisions, minimize one of them. In other words minimize the probability of the type of mistake that we need/want to avoid.
I have avoided statistical jargon, to make this understandable to more readers. Statisticians call these Type I and Type II errors, and there is plenty of literature on this. In any case I feel most comfortable with Tukey's view on hypothesis testing, and his idea that we can NEVER ACCEPT the null hypothesis. We can either get evidence that A > B or A < B, and the alternative being that we have not enough evidence to decide which one is bigger. Of course in practice using power analysis, we can decide that we could have detected or not a difference that would be in practice relevant. However, this is conceptually very different to accepting that there is no difference or no effect.
"Dr. Arsham's perspectives and experiences have led him to combine the decision-making framework to connect the gaps between optimization theory, discrete event simulation, and probability and statistics." - I really had hoped for a qualified reponse (post from May 8th).
Working in applied statistics, I have a clear view on this issue: A statistical cut-off makes sense only in the context of controlled experiments. In social siences, controlled experiments are extremely rare. However, the p-value is an excellent means to guide the decision maker among the risks of his or her decision; in addition, most situations involve multiple testing, and calculating even just approximate error probabilities is not easily possible. In applied social sciences, cut-offs do not make sense and do not play any role.
Hallo Peter. It is good to find you here!
I said above "It depends on the type of application."
We have to take into account the type of experiment in which we obtain p-values. Just this morning, a student of mine sent me her results on predicting a certain award for movies. Her test had p=0.07. I will have to explain to her that she should not just conclude "inconclusive". There is more that can be said here. Another example is cluster analysis for cancer rates (Spatial Epidemiology). A certain cluster may have a p value>0.05 but it still is small and it has a large Relative Risk. In such a case, I would state that "we need to keep an eye on this geographical area" ... etc.
Do you agree?
Raid you made the point: " it has a large Relative Risk. In such a case, I would state that we need to keep an eye on this geographical area" - the effect (and possibly the precision of its estimate) must be considered. P-values are not so helpful.
The p-value is just one piece of the total information that we gather, Jochen. I am glad that you agree with me on this point. We need to educate users of statistics (around us where we work) that statistics is a wonderful area, but we need to be open minded and knowledgeable on what we applying. Using a software package with statistics never replaces the statistician, the person who can think.
Raid: I entirely agree. "Keeping an eye" certainly means that we need to look for further evidence, perhaps more data, perhaps more sophisticated methods, perhaps enlightment from comments of an experts, etc.; the usual way to do empirical research. Students can profit a lot when they are guided along these lines.
This is also what I meant to say, Peter. More evidence is gathered and/or different methodologies are tried out. I try to give students a good overview of many types of approaches in statistics, allowing them to learn through large projects how they arrive at conclusions that make sense to them.
I am presenting at a journal club in a couple of weeks and have 'stumbled' across the issue of the 'p-value' versus clinical importance/significance. I recently completed a biostats course as part of my PhD program and was frustrated that clinical significance seemed to be a misnomer in the biostats course and frustrating to say something was 'significant' when clinically it would not be! Can anyone refer me to a great article that discusses this debate?
Natasha, there are tons and tons of literature about the often unrecognized but important difference between (clinical) relevance and statistical significance. You can search google and PubMed on these keywords to get many good hits.
Particularily good in my eyes is a paper of Steven N Goodman: Towards Evidence-Based Medical Statistics. 1. The P-Value Fallacy. Ann Intern. Med. 1990;130:995-1004.
You will find further interesting papers cited therein.
You may also be delighted to read papers of Gigerenzer ("Mindless statistics"), Bland, Altman, Tukey, and Oakes, just to name a few.
Much of the confusion could be avoided when authors would really write statistically significant (instead of only significant) when they are talking about statistical significance, so it won't be confused so easily with word significance in its non-statistical (more common) meaning (it will not solve the problem completely. I am sure that after some time people will again start confusing stat.sig. with relevance). I don't know how many authors are too lazy (to add the word statistically) and how many do not know that there is a difference at all. Unfortunately I even suppose that there are some authors that confuse it deliberately to pretend relevance where statistical significance is all they have...
I often hear words similar to those stated by Jochen above when I discuss some clinical trials with physicians, and I always use the expression "statistically significant", and I let the physicians decide whether it is clinically significant or not. I let the experts of the applications explain to me which clinical significance the results may or may not have. I am the expert for the statistical modeling and the analysis.
The choice of the cut-off point to reject the null hypothesis is arbitrary.
This should be an interesting article for you to read: http://www.jerrydallal.com/lhsp/p05.htm
I think 0.05 p value is practically convenient in certain cases, which acts more like a sign or reminder rather than a cutoff point.
I also want to clarify that there are two different kinds of interpretation for p-value, one for parameter (coefficient estimates) significance in regression analysis and the other for statistical null hypothesis test.
For the former one, there is room for flexibility depending on your field of study and research objective, but you also need to take into account multicollinearity, r-squared (the goodness of fit), and standard error (forecasting accuracy). In my opinion, p-value less than 0.1 is acceptable and practically meaningful.
For the latter one, 0.05 is more like a rule of thumb to either reject the null or accept it confidently. In my opinion, p less than 0.05 is more powerful. From my empirical research experience, I make immediate decision for any value less than 0.05 (the smaller the p-value the higher the significance), however I prudently cross-validate the hypothesis test for value between 0.05 and 0.1 if possible.
For example, if p-value is 0.01 for Augmented Dickey Fuller (ADF) test, I reject the null of unit root. If p-value is somewhere around 0.08, I will perform different unit-root tests (ADF, ADF-GLS, KPSS) with comparable results between either two. If all tests ask me to reject the null, I will wholeheartedly reject it. If not, I will investigate further and might accept the result from the most powerful test among the three after checking test criteria for errors, making adjustment, and retesting the null.
From the statistical perspective, nothing can prevent you from using a p-value different from 0.05 or 5%. Remember the lower the p-value used in an investigation or study the higher the probability of not rejecting the null hypothesis when it is correct.