In an upcoming issue of Nature Human Behavior, Daniel Benjamin and colleagues propose a new p-value threshold 0.005 instead our current dogma of 0.05. They claim that it will reduce false positive results in scientific community. On the other hand, it will increase false negative results and possibly diminish the meaning of effect size.
You may find the full text here: https://osf.io/preprints/psyarxiv/mky9j/
I would like to hear your opinion about the suggestion of a new p-value threshold... How can it change our research?
Hot air.
The problem is that p is mistaken as the probability that the null hypothesis is true or that rejecting the null generally only when p
Hi Michael,
I share the same opinion of Andrew Gelman (professor at Columbia university) group and Christopher Bishop (research leader at Microsoft). I believe that there is no magic threshold to determine that a model effect is “significant” (there is a great distance of an effect be significant in reality, and “asymptotically”). I believe that a model should naturally handle multiple hypothesis problems that can lead to false-discovery rate. This is why we Bayesians do not worry if under a multiple test comparison setting there are or not needs of test correction. We manipulate probability distributions as building blocks that are reasonable for the nature of the problem, which naturally circumvent problems shown under closed-models formulations, commonly used in frequents hypothesis testing settings. I believe that the model should be constructed to reflect the true nature of the problem, not on the nature of the statistical procedure.
Best,
Hot air.
The problem is that p is mistaken as the probability that the null hypothesis is true or that rejecting the null generally only when p
I agree with @Jochen Wilhelm. The entire p value dogma can be boiled down to the question of Clinical Significance. As a Clinician, I'm only concerned if the "intervention" is likely to be a positive one for my patient. End of story!
@Christopher,
maybe I got you wrong here, but don't you just postpone the problem to the clinical regime? The statement "if the intervention is likely to have a positive effect" from a clinical study is the very same statement as "if the intervention is likely to be a positive one for my patient". What I tried to stress is that "an intervention (effect) is positive" is not sufficient to make a rational decision whether or not the intervention should be approved by a heath agency or used by a doctor.
What counts for the doctor is the change in "total health" or "well-being" of the patient. The intervention effect is only a part of that. It can be that intervention A has a lower effect than B, but it comes at a lower risk of a bad side-effect. The doc has to weight these things against each other. Just knowing that the effect is positive is not enough.
What counts for a health system may be the expected long-term costs associated with the patient. This is linked to "total health", but even under an equal expectation for the increase in "total health" between two interventions, the may not result in the same costs. It may possibly be cheper to let the patient die than to try some intervention at all. That leadt to the next point:
What counts for the society may be how the costs are weighted against the health (and the life) of a patient. Here we must find a social agreement (we must develop our ethics) - this is a process that extremely difficult and notoriously delayed.
It is impossible to reduce such questions to a "positive effect" if this is interpreted as "this intervention reduces the symptoms of the disease". If you see a "positive effect" as some complicated overall measure that is hardly well defined, then the statement is practically useless (for sure, we work to make the world a better place! - the tricky question is how we define "better", precisely!).
I agree with Jochen Wilhelm to a large extent, but, if "effect size" is of central concern to you there are two ways of doing things differently that may help:
(i) do a significance test of the null hypothesis that the effect size size is less than some meaningful (non-zero) value;
(ii) form a confidence interval for the effect size.
Good morning all,
Very interesting thread, thanks for all the insightful comments.
Jochen Wilhelm said "It is impossible to reduce such questions to a "positive effect" if this is interpreted as "this intervention reduces the symptoms of the disease". If you see a "positive effect" as some complicated overall measure that is hardly well defined, then the statement is practically useless (for sure, we work to make the world a better place! - the tricky question is how we define "better", precisely!). "
In my mind, this is always the big question, and I echo the importance of effect size, and think it should be included in the researcher's initial deliberations. This follows Christopher Smith's reasoning above. For example, if I create an anti-hypertensive drug which "significantly" lowers BP 10 mmHg, and it costs the patient $300 per dose, even though my p value and calculations were solid, there was insufficient "bang for the buck".
In the social sciences, this is especially prevalent in studies of correlation or association. I have seen too many studies where (using this faux example) "drinking coffee significantly lowers stress, p< 0.001" only to see the raw Pearson's r value is 0.3 (described in most literature as a 'weak' association). The r square demonstrates the result of coffee having a 9% effect on the stress model, with 91% of the effects coming from other undefined or random sources, despite significance.
We, as researchers, need to better educate our students in the importance of this topic, and thanks to all of you for keeping it in the forefront of conversation. It makes me want to look at new articles that much more closely.
Cheers!
Rich
The biggest problem with the p value is;
it can easily be reduced by increasing the sample size. you can see the link;
https://en.wikipedia.org/wiki/Data_dredging
@ Jochen Wilhelm
Sir, you misunderstand my clinical point. As a consumer of peer-reviewed medical journals. My intent, when reading the published literature, is to ascertain whether the information presented should be adopted into my daily practice. You are quite correct in your description of a population health perspective although my previous comment was concerning the care of individual patients. When engaged in the care of A Patient my concern is that patient, not the population. (Although, quite important.) Interpretation of the current literature informs my practice over time and p values catch my eye but clinical significance and effect size are the metrics that influence my clinical decisions and approaches. (I am always looking for a better mousetrap) My previous comment was an attempt to condense this perspective. Thanks you.
Regards,
Christopher
@ Dear Christopher,
thank you for responding. I see that you are directly interested only in the individual patient, not so in economics or ethics (how else could you do your job?!).
I wonder why "p values [should] catch your eye". In your routine job you are (hopefully) not orienting your treatment on basic research papers but rather on larger clinical trials (phase II/III). If such trials won't show that they are at least confident about the sign of an effect (i.e. have a "significant" result), they would not be published. As the size of the p-value also depends on the sample size, a lower p-value may only indicate a larger sample. A study on a rare disease may simply not have a large sample available and thus will neccesarily have a relatively high p-value even when the effect is "large".
It is good that "clinical significance and effect size are the metrics that influence [your] clinical decisions and approaches". My concern was that even this is not enough to consider in clinical decisions, even when we talk about the treatment of an individual patient. This patient will not only have a chance of clinically significant treatment effect but also the risk of one or more clinically significant side effects (directly or indirectly associated with the treatment). It is important to weight these factors.
Thank you for that discussion and please correct me where I am wrong.
I'm just glad I don't have to explain the contribution of each of my 71 co-authors on a paper.
Coming from an agriculture and natural resources background, I think the 0.05 alpha level works out pretty well. Most field experiments don't have too many replicates, so an effect with p < 0.05 merits looking at. It's probably a good balance. I've always guessed that the alpha = 0.05 criterion became cemented in this context.
I've also wondered if research in agriculture has had fewer problems with reporting effect sizes and such. Because the measurements are understandable to the reader. That is, a difference of 0.5 bushels per acre of corn isn't too impressive no matter what your p-value. And there is a kind of simple cost-benefit ratio running in the background, because the cost of a fertilizer treatment or labor to make the treatment are relatively real and obvious. Just musing, but I'll also note that the main authorship of the paper comes from the fields of psychology, sociology, economic, and related fields.
So, from my perspective, for my field, I don't see how changing the standard alpha level would fix anything. I think the solution to some of these problems lies in better education of writers and readers about the importance of the size of effects, honestly plotting your data, and thinking about the practical implications of research.
Disclaimer: I'm an early-career researcher in biostatistics & epidemiology, and given that I'm "young" (especially in terms of career - I ended my master 3.5yrs ago) I might be too much idealistic in the way I approach research.
I tend to believe that the problem is not really "a matter of size" of the pval. Personally, I think that a threshold of 0.005 would be too strict in "normal" inferential analyses (it is straightforward that this statement does not apply to the issues coming from multiple testing, which can be solved through various approaches designed to reduce the threshold).
The problem, IMHO, is that pvals (even a veeeeery small one) are often misinterpreted - due to either ignorance of basic statistical concepts or "I-want-that"-guided interpretation - especially by lab researchers, for example in the biomedical field.
I lost the count of how many times lab researchers and/or clinicians came to my desk expecting me to shake my magic cane and make pvals as they wanted/expected; often, research is driven by the hope to find a certain pval rather than by proper scientific methodology, i.e. carrying out an "experiment" (of any kind) with an appropriate design and accept its results "no matter what". Again taking biomedical sciences as a touchstone, "appropriate design" would mean to write down a protocol including an estimation of sample size (declaring also the effect size, which is what makes the difference), a suitable design according to the purpose, and at least a draft of the plan regarding data analysis. Often, the protocol - if they have one - does not meet those criteria, and this should cast doubts on any results whatever the pval is.
I guess that a lot of colleagues from biostatistical units could say the same: how many times have someone from the clinical side asked you if you could "adjust a bit" the (correct and appropriate) analyses you presented because the results have not met the expectations? Or to do what is actually "data phishing", i.e. to run lots of tests until you find a significant signal, even though it does not make any sense when you look at it from a clinical standpoint (thus introducing a lack of coherence)?
Results are often unreplicable because of inconsistency, not because the threshold we use on pvals is too liberal.
I suggest that we should start from here. We, as applied statisticians, should strongly refuse to do this misleaded practices. Clinicians - or, in general, "users" of our analyses - should be educated better than they are as regards their comprehension of statistics and, even more important, methodology of research. If we don't, any reform of significance threshold will be totally useless (if not dangerous).
In 2015 the editorial board of Basic and Applied Social Psychology declared that anyone wanting to publish on the journal should avoid any reference to p-values. The Benjamin et al.’s manifesto triggering this thread does not go that far but advocates 10-time tougher standards before statistical significance can be claimed. To motivate their conclusions, the authors compare p-values with Bayesian factors, saying that with prior odds at least as great as 1:5 in favour of H1 the conventional 0.05 threshold value provides too weak an evidence against H0. B. Efron in his JRSS 2015 paper shows that Bayesian credible intervals are basically equivalent to frequentist confidence intervals when using convenience or reference vague priors. With prior odds closer to 1 the curve describing the false positive rate-power relationship shown in in the Benjamin et al.’s Fig.2 would look lower for p=0.05 , with a minimum false positive rate not that far from the nominal 5%.
i tend to view that one from a bayesian perspective, too. but much more important than the exact value of the significance level is that tests are performed only in situations where the treatment under study has a high prior plausibility to be effective. otherwise, significance even on a very high level is meaningless, just like a positive HIV test is very often falsely positive in a low prevalence population of blood donors. so while i think it's worth reflecting on a "better" p-value, i would rather like to see that plausibility/efficacy etc. of any treatment or intervention is sufficiently well supported by some theory or sound prior knowledge and that journals put more weight on prior plausibility than on p-values. and yes, of course, effect size and clinical significance must be considered, too!
You should not use a single p-value threshold for any set of applications. Each application should be individualized based on standard deviation and sample size.
PS - You might consider the following: https://www.researchgate.net/publication/262971440_Practical_Interpretation_of_Hypothesis_Tests_-_letter_to_the_editor_-_TAS
Michal -
You note that "In an upcoming issue of Nature Human Behavior, Daniel Benjamin and colleagues propose a new p-value threshold 0.005 instead our current dogma of 0.05." However, this just trades one problem for another. If "big data" applications, made possible now by advanced computing power, had been the case in the beginning, then many people would have used 0.005 all along. But the mistake is assuming "one size fits all." 0.05 should never have been considered a "standard." 0.005 should not be considered to be a "standard." The very idea of a "standard" here is wrong, as I indicated in an earlier response. Isolated p-values are basically meaningless. It is much better to focus on relative standard errors or sensitivity analyses.
Cheers - Jim
I agree with the answers given. Several papers address this issue. A classical one is attached:
Think of the p-value as giving the probability that the results could arise by chance. Then even if we reject the null hypothesis that our results could have arisen by chance, we still know nothing about how the results arose.
One possibility is that our alternative hypothesis is correct?
But another possibility is that we ran a study that failed to take into account one or more other, systematic sources of variability.
Tinker as much as you like with the p-value and it's still not going to tell you whether the results are reproducible. Switching to an estimation framework is not going to tell you whether your results are reproducible.
To do that, we need to know how the study was conducted. The design details and the steps taken to minimize or eradicate potential systematic sources of variability. Without that detail, we can't reproduce a result. With that detail we may be able to reproduce a result and yet conclude that the results arises through some systematic bias.
Those who pay lip service to design issues - control, bias, randomization, blocking - are going to get burnt.
" Think of the p-value as giving the probability that the results could arise by chance."
Dennis, that wording troubles me. It sounds (to me) as if you are saying p = p(H0 is true). I wonder if you really mean something more like this: p = the probability of getting a result at least as extreme as the observed result given that H0 is true. And if H0 is true, any departure from what one would expect under H0 is purely due to chance.
HTH.
Instead of "due to chance" I would prefer "not explained by the model".
We too often think of quantitative variables and a normal distribution. This is blurring a clear understanding of what we are talking about.
Consider a binomial response, which is represented by a random variable X that can take the values 0 or 1. With "random variable" we mean to say that we don't know the value X actually has (or will have, when we are going to observe it), and we instead give only probabilities for its possible values. This has nothing to do with believing that the value X takes is arbitrary or non-deterministic or in principle unpredictable or unknowable. It just that we don't know it for sure; we may well have more or less good reasons to believe that it's more likely to take 0 than 1 (or the other way around), and this is what we quantify with a probability distribution.
The words "probability" and "chance" are synonyms (some say that "probability" just a mathematically well-defined quantity, whereas "chance" is used in a less quantitative fashion, but eventuall they both mean the same).
Here, providing p=Pr(X=1) is sufficient to fully specify the whole distribution, because Pr(X=0) = 1-p, and there are no more possible values X can take. In a hypothesis test, we can test sample of observed data under a given value of p. Using p = p0 would allow us to see if the data is sufficient to conclude that pp0). The test is based on a comparison of how well our model with p fixed at p0 predicts the data relative to the best possible explanation of the data by our model (not fixing p).
Having p fixed or not doesn't make the data more or less "due to chance". The only thing we can see is if explanation of the observed data by the model is considerably impaired by restricting p to be fixed at p0. Note that we may express p as a function depending on some predictor values in test the coefficients of that model, e.g. to see if our data is sufficient to conclude if a predictor increases or decreases p (the coefficient in the model is usually a log odds score). Further note that the impact of restricting a coefficient depends on the entire model. Changing the model will change the impacts, because we "put our knowledge in a different frame", so to say. That doesn't make the data more or less "due to chance". Chance (probability) is a measure we assign to express what we know about a value in some given context.
We can now transfer this to a random variable with a normal probability distribution. This variable can take any real value. Say we know σ² but we have no idea about µ. In a hypothesis test, we can test sample of observed data under a given value of µ. Using µ = µ0 would allow us to see if the data is sufficient to conclude that µµ0). The test is based on a comparison of how well our model with µ fixed at µ0 predicts the data relative to the best possible explanation of the data by our model (not fixing µ). We can express µ as a function of predictors and test coefficients. And still, the question is always if the restriction impairs the ability of our model to explain the observed data. At no point the model distinguish or measures if or how much the data is "due to chance".
This post got longer than intended, and I am still not sure if someone will understand what I was trying to say. I give it a probability of 0.2 ;)
It is necessary to complete the nice Jochen's responce by a remark, that the methods of statistics are for mutual communicating the results of our models in uniformly understandable terms. Therefore, also, there is no need to determine stricly which p-level is the best one for rejecting "proper" hypotheses. This is simply also a convention. Some dependence of this parameter is, however to be stressed: building a bridge, preparing medicine which could be dangerous or starting a big enterprise, one should take care for the consequences in case "if something goes wrong" Thus, in the risky situations, the rejection should be given greater chance than for some problems of less negative cosequences. Obviously, this can also be modelled in some way by e.g. randomiing the probability of rejecting and/or involving this problem into the project on global costs of the research.
Regards, Joachim
Are p-values valid? Why look for a new value? Would they be the product of a deformation of the original form in which they were created? the p-values have very precise indications for their use? Answers to these questions and many more can be found in:
https://www.researchgate.net/publication/260062295_Null_hypothesis_significance_tests_A_mix-up_of_two_different_theories_The_basis_for_widespread_confusion_and_numerous_misinterpretations?_iepl%5BviewId%5D=0JWSC7NjAHuOYVjd0iEbsRxL&_iepl%5BsingleItemViewId%5D=RqxpPPbj0YqrbCcQeLScFRZx&_i
Well, as a late arrival at this symposium here is a dialectic response to the query "How will it (Benjamins revision of p-value threshhold to 0.005) change our research?
Thesis: New p-value threshold 0.005 instead our current dogma of 0.05 will reduce false positive results in scientific community.
Antithesis1. It will increase false negative results and possibly diminish the meaning of effect size.
Antithesis2. It rests on two unexamined premises. (2a) Bayesian posterior probability is a measure of strength of evidence; (2b) sample size is cheap.
Synthesis 1. The inevitable risk of increased false negatives falls on the experimenter, but not on the experimental community, via published meta-analyses. The growing use of effect size reporting is to be strongly encouraged, regardless of Benjamins et al.
Synthesis 2a. Evidence (once we have the data) is not the same as probability calculated from the likelihood ratio calculated after we have the data. Royall (1996) makes the case for likelihood as the measure of evidence.
Synthesis 2b. Incremental costs of increase in sample size depend on the field. It may well be cheap in psychology and the social sciences (note the affiliations of many of the co-authors). In some fields, incremental costs preclude research at 0.005 threshhold.
Comment on change in practice: Shift to 0.005 is more useful to readers than to researchers. This is possible as long authors continue the now established practice of reporting p-values, not just threshholds, as required by Neyman-Pearson dogma.
~David S
Shouldn't we just silence p-values to death and wait for the last proponents (who build their careers on the paradigm) to retire?
Young students and postdocs are still trained in frequentists statistics, mostly even withouth being told that this is "just" a philosophy and that there is (at least) a different philosphy around. And these researchers have to publish in journals where reviewers want to have their beloved p-values (also because data are undoubtly much more difficult to interpret -- actually, it would become neccesary to really interpret data rather than just looking at a p-value that seemingly tells us everything important). I thus doubt that the flow of proponents building their careers on p-values will ever dry up.
The problem is not the p-values. The problem is us. The p-value (and confidence intervals) are viable tools to help us judge if the data we have can give us sufficient confidence in claiming the direction or sign of an effect, what can be really helpful in complex situations/models. It's us forgetting that this is about confidence, not about probability, and that it is about data, not about estimates or parameters. We use it wrongly, and we interpret it wrongly.
Going to estimation is a rather huge step, not at the first sight, but at the second. A fact that is again easily overlooked by many. A "least squares estimate" or, more generally, a "maximum likelihood estimate" is actually by itself not an estimate of a parameter (an effect). It is just a value of the parameter under which the observed data is most likely, and the confidence interval is a range of values under which the observed data is not too unlikely. As these are calculated only from the data (based on a few assumptions, as always), these are sample statistics (just like a p-value is a sample statistic), and there is no magic entering the meaning of such a statistic. It's data, still; not a parameter. If we really want to say something about the parameter, we have no other choice but to model parameters as random variables so we can assign a probability distributions over the possible values of the parameters. And that's just where we flip the philosophy from frequentist to Bayesian. We must then admit that there exists no objectity (what I think is fine -- or better:correct, but what I think won't be acceptable for the majority). This is what I mean with a huge step.
Just so it's clear, here's the American Statistical Association statement on p-values.
It has more to do with how to use them, rather then whether to use them. It does warn against specific thresholds (0.05, 0.005, whatever).
1. P-values can indicate how incompatible the data are
with a specified statistical model.
2. P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data
were produced by random chance alone.
3. Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes
a specific threshold.
4. Proper inference requires full reporting and
transparency.
5. A p-value, or statistical significance, does not measure
the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a goodmeasure of
evidence regarding a model or hypothesis.
The problem is also with a lawyer or politician who might have had a science class as general education and must now deal with a complex social or biological issue. Just tell me the answer in 250 words or less, not some endless mumbo jumbo about uncertainty. Tell me the truth so I can move on. A p-value lets people easily fall into this trap.
If anyone is still looking at this, please read the paper at the following link:
https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.Wym3BCCQyCg
It is important to understand not trust the "MAGIC" of the numbers.
Best, David Booth
Another problem is that most people just don't understand the weird sampling logic and twisted asymptotic math of frequentist test statistics. In our master track in Psychology I give statistics workshops in which I introduce Bayesian parameter estimation. I typically start with: "Whatever you have learned in your statistics classes, don't let that come into your way!" The frequent response of students is: "Now, I get it!".
And here is my standard footnote when I use Bayesian credibility limits in scientific report: "Credibility intervals are what many researchers falsely assume the confidence interval is [1]."
[1] Hoekstra R, Morey RD, Rouder JN, Wagenmakers E-J. Robust misinterpretation of confidence intervals. Psychon Bull Rev 2014:1–7. doi:10.3758/s13423-013-0572-3.
In conclusion, if we want do-it-yourself statistical analysis in the future, it has to be as intuitive to use, flexible and safe as a modern car. That's obviously not true for frequentist NHST and p-values.
SELECTIVE APPLICATION
Error level of 5% is practical, especially in
social science research. Error level of 0.5% may
be possible and useful in certain field outside of
social science, i.e. natural science, such physics or
genetics where high precision is needed.
This thread started with - Do we need a new threshhold for Type I error? R.A. Fisher proposed 4 "definite levels of significance"
0.1, 0.5, 0.2, and 0.1. Leaving the reader to produce the estimate of Type I error and make their own judgement. As a researcher I like Fisher's approach for statistics in a research settting.
"We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of *other* free minds to utilize them in making *their own* decisions." J.Roy.Stat.Soc.17:69-78.
That being said, as a consultant I have to ask "What is the cost of Type I error?"
Were I physician, I would need to ask the same question, for me, and for the patient. If Type I error has a clear cost that I can articulate, let's limit it to acceptable levels, such as 5%. If we can't pin down the cost, let's be flexible, in the manner of Fisher.
Suum quique,
David S
@ David,
You wrote "0.1, 0.5, 0.2, and 0.1" did you mean to repeat the 0.1? Significance at 0.5 means that there is a 50-50 chance of finding an observation as great as or greater. I suspect fingers typing too fast.
I might agree if we were all more statistically fluent. However, I become uncomfortable when p-values of 0.50 and 0.60 are discussed as showing a significant effect (as in a recent report that I was shown). Based on these values hundreds of thousands of dollars were spent continuing this research. Fortunately, this has since stopped.
Even if there is no clear cost, I think we should think very hard about what it means to relax the 0.05 significance level, arbitrary as it is. So as a scientist, if any p-value can be significant then I can get more papers out by reducing replication to 2 replicates per treatment. Everything is significant and I can finally tell the story in the data that I know is there.
@Timothy A Ebert
Fingers too fast indeed.
Fisher offered flexibility at four levels 0.1, 0.05, 0.02, 0.01. Neyman-Pearson gave us rigidity at a single preset criterion (usually 5%). As a scientist I concur, with you and Fisher, at drawing no conclusion, at p > 0.05.
As an educator in an academic setting, I find that students take well to likelihood ratios (the regression model is 600 times more likely than no relation). Students chafe at, and routinely misinterpret, the logic of rejecting or not rejecting a null hypothesis. Maybe someday we can get beyond the fixed criterion for Type I error, except as it matters, when there is a clear cost.
Thanks for the correction, David S
@David C Schneider,
It would be great if you'd give a reference for your statements:
"Fisher offered flexibility at four levels 0.1, 0.05, 0.02, 0.01. Neyman-Pearson gave us rigidity at a single preset criterion (usually 5%). As a scientist I concur, with you and Fisher, at drawing no conclusion, at p > 0.05."
I know that Fisher write in the 13th edition of his 1925 classic book "Research Methods for Research Workers", page 44: “The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not.”, and chosing this limit was quite obviouesly chosen because he was allowed to only show an excert of the tables for critical t-values (the copyrights for the full table was with Neyman). In his own research papers Fisher did never adere to any particular cut-off. Sometimes he considered p around 0.2 as significant, somtimes not - depending on the context of the experiment.
And to my knowledge, Neyman never promoted any particular cut-off (actually, Neyman did not consider rejecting H0 at any "level of signicicance", as his test is not about significance and rejecting H0; it's about confidences of accepting one of two alternatives). You may confuse Neyman's type-I error rate with the probability of a false rejection of H0 under H0 (despite what textbooks often teach wrongly, these are very different things!).
So a bit of clarification would be nice here, how you get to your statements.
Jochen, I believe you were directing your question to David C Schneider, rather then to me.
However, it may be helpful to point out a recent set of videos on YouTube, that relate to the discussion here:
https://youtu.be/VRF-UwrepAs
What do we learn about significance tests from the replication crisis? - Significance tests (part 1)
https://youtu.be/R7dQo9bW1DM
Should we redefine statistical significance? - Significance tests (part 2)
https://youtu.be/gtQlvR-RUPg
The use and abuse of Significance Testing ... - Significance tests (part 3)
https://youtu.be/txLj_P9UlCQ
Sir David Cox - In gentle praise of Significance Tests - Significance tests (part 4)
These are from the Royal Statistical Society and dated 5 September 2018 (published 23 October 2018), so reasonably up-to-date.
David A. Jones, yes, I selected the wrong "David" from he list. I corrected that. Sorry for the confusion!
I read a lot of quantitative education studies where data is drawn from PISA scores (a world wide multi-subject test of 15 year-olds). I know 15 years olds and I know stats and for authors to make bold statements about teachers, teaching, and learning based on their low level of significance is a stretch; especially with no on site observations to back up their "significant" findings. I suggest changing the accepted level of significance and requiring mixed methods for studies used in policy decisions in education. Yikes, we are still fighting against the "modernity project" - except now its against neo-liberal statisticians!
Jochen Wilhelm
It would be great if you'd give a reference for your statements:
"Fisher offered flexibility at four levels 0.1, 0.05, 0.02, 0.01.
The reference is page 80 of Statistical Methods for Research Workers.
"If p is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05, and consider that higher values of [chisquare] indicate a real discrepancy."
Higher values could be taken as 0.01, 0.001, etc.
See also p 25 of Design of Experiments. 8th ed 1966
"Convenient as it is to note that a hypothesis is contradicted at some familiar level of significance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength which the evidence has in fact reached."
The quote comes from a long paragraph that is a comprehensive argument against Neyman-Pearson decision theoretic hypothesis testing, where a fixed criterion is set beforehand. The paragraph ends with a rejection of NPDTHT: "A good deal of confusion has certainly been caused by the attempt to formalise the exposition of tests of significance in a logical framework different from that for which they were in fact first developed."
I read that as Fisher's denial of responsibility for using a fixed criterion, pointing instead at what was intended when first developed by Fisher-- a way of judging strength of evidence (from the p-value), which Fisher considered a stronger form of inference than reporting the likelihood ratio.
Ziliak and McCloskey 2008 blame Fisher for "imposing the Rule of Two" --significant if greater than 2 standard deviations, (1.96 to be exact). Saying that Fisher imposed the rule of Two strikes me as blaming the victim, given FIsher's bitter fight against rigid criteria that came from Neyman and Pearson.
Once I finish writing my comps I will attend to this - part of the problem with this is I have my feet in two worlds. I am currently a narrative researcher but still do research in science on the side. I am currently digging into some work on "cognitive load" by Sweller and the results for their experiments relate to this. I apologize for not addressing your well conceived question but, I will get to in time - its a great question. Cheers, Pat
Part of the problem is that we want the right statistical method for a given analysis. All other methods are then wrong and therefore my results and conclusions are true. Often the right statistical method is the one that can be done in the shortest time and results in a single value that proves the point. Its truth is further validated by getting two or three others to accept it and having it finally appear in print.
The alternative problem. I analyzed the data in three ways that differ in their assumptions. I present proof within the limits of my data that the assumptions of these models are reasonable. I now have a 45 page manuscript that no one will read even if it does get published. The purpose of the paper is "lost" in considering the possibility that my conclusions are an artifact of the experimental design and analysis.
Setting a global desire for a p-value of 0.05 or 0.005 is simply dogma.
Statistician generally eschew dogma when it comes to making sensible decisions about study design and p-values. Instead the goal is think scientifically with knowledge of the biology and the consequences of building a poor model (biased results from failure to account for confounders) and the respective costs of .type 1 and type 2 error.
I agree that the p-value chosen should reflect the relative costs of type 1 error (rejecting the null hypothesis when no difference exists) and type 2 error (failure to reject the null hypothesis when difference do exist). Minimizing type 1 error increases the probabilitiy of type 2 error. Type 1 is common if the power to detect a difference is low (ie. a too small sample to detect a small difference). The first is a false positive - finding a difference when none exist and the second is a false negative - e.g. failure to find a real difference between drug A and drug B when drug B really is a more effective drug.
Really though, computer programs with generally give you the actual probability that the difference is due simply to random error and you can decide how important it is to your circumstances but repost the value that is generated so that the reader can judge the results for herself. In medicine we definitely want to detect a rare debilitating disease. We do not wish to make a type 2 error and so use a large sample size to offset the probability of failure to find the rare disease.
I model building, on the other hand we want to be sure to include confounding variables (gender, age) even if the p-value is greater than 0.05 (say 0.10). This is to be sure to control for factors that might interact with outcome of interest (coronary artery disease) even is the effect is small. The failure to include these just because they are not ≤0.05 could result in a biased odds ratio (say) that is artificially high or low.