Challenging "p<0.05" dogma?

07 July 2017 43 7K Report

In an upcoming issue of Nature Human Behavior, Daniel Benjamin and colleagues propose a new p-value threshold 0.005 instead our current dogma of 0.05. They claim that it will reduce false positive results in scientific community. On the other hand, it will increase false negative results and possibly diminish the meaning of effect size.

You may find the full text here: https://osf.io/preprints/psyarxiv/mky9j/

I would like to hear your opinion about the suggestion of a new p-value threshold... How can it change our research?

Jochen Wilhelm Popular answer

Hot air.

The problem is that p is mistaken as the probability that the null hypothesis is true or that rejecting the null generally only when p

Jhonathan Pedroso Rigal dos Santos

Hi Michael,

I share the same opinion of Andrew Gelman (professor at Columbia university) group and Christopher Bishop (research leader at Microsoft). I believe that there is no magic threshold to determine that a model effect is “significant” (there is a great distance of an effect be significant in reality, and “asymptotically”). I believe that a model should naturally handle multiple hypothesis problems that can lead to false-discovery rate. This is why we Bayesians do not worry if under a multiple test comparison setting there are or not needs of test correction. We manipulate probability distributions as building blocks that are reasonable for the nature of the problem, which naturally circumvent problems shown under closed-models formulations, commonly used in frequents hypothesis testing settings. I believe that the model should be constructed to reflect the true nature of the problem, not on the nature of the statistical procedure.

Best,

Jochen Wilhelm

Hot air.

The problem is that p is mistaken as the probability that the null hypothesis is true or that rejecting the null generally only when p

Christopher A Smith

I agree with @Jochen Wilhelm. The entire p value dogma can be boiled down to the question of Clinical Significance. As a Clinician, I'm only concerned if the "intervention" is likely to be a positive one for my patient. End of story!

Jochen Wilhelm

@Christopher,

maybe I got you wrong here, but don't you just postpone the problem to the clinical regime? The statement "if the intervention is likely to have a positive effect" from a clinical study is the very same statement as "if the intervention is likely to be a positive one for my patient". What I tried to stress is that "an intervention (effect) is positive" is not sufficient to make a rational decision whether or not the intervention should be approved by a heath agency or used by a doctor.

What counts for the doctor is the change in "total health" or "well-being" of the patient. The intervention effect is only a part of that. It can be that intervention A has a lower effect than B, but it comes at a lower risk of a bad side-effect. The doc has to weight these things against each other. Just knowing that the effect is positive is not enough.

What counts for a health system may be the expected long-term costs associated with the patient. This is linked to "total health", but even under an equal expectation for the increase in "total health" between two interventions, the may not result in the same costs. It may possibly be cheper to let the patient die than to try some intervention at all. That leadt to the next point:

What counts for the society may be how the costs are weighted against the health (and the life) of a patient. Here we must find a social agreement (we must develop our ethics) - this is a process that extremely difficult and notoriously delayed.

It is impossible to reduce such questions to a "positive effect" if this is interpreted as "this intervention reduces the symptoms of the disease". If you see a "positive effect" as some complicated overall measure that is hardly well defined, then the statement is practically useless (for sure, we work to make the world a better place! - the tricky question is how we define "better", precisely!).

Ette Etuk

The choice of the p benchmark is entirely an arbitrary issue. It is a matter of convention. Statistical inference is still at best uncertain or probabilistic. As long as the convention is universal, there is no problem. That is my opinion.

David A. Jones

I agree with Jochen Wilhelm to a large extent, but, if "effect size" is of central concern to you there are two ways of doing things differently that may help:

(i) do a significance test of the null hypothesis that the effect size size is less than some meaningful (non-zero) value;

(ii) form a confidence interval for the effect size.

Richard E. Haas

Good morning all,

Very interesting thread, thanks for all the insightful comments.

Jochen Wilhelm said "It is impossible to reduce such questions to a "positive effect" if this is interpreted as "this intervention reduces the symptoms of the disease". If you see a "positive effect" as some complicated overall measure that is hardly well defined, then the statement is practically useless (for sure, we work to make the world a better place! - the tricky question is how we define "better", precisely!). "

In my mind, this is always the big question, and I echo the importance of effect size, and think it should be included in the researcher's initial deliberations. This follows Christopher Smith's reasoning above. For example, if I create an anti-hypertensive drug which "significantly" lowers BP 10 mmHg, and it costs the patient $300 per dose, even though my p value and calculations were solid, there was insufficient "bang for the buck".

In the social sciences, this is especially prevalent in studies of correlation or association. I have seen too many studies where (using this faux example) "drinking coffee significantly lowers stress, p< 0.001" only to see the raw Pearson's r value is 0.3 (described in most literature as a 'weak' association). The r square demonstrates the result of coffee having a 9% effect on the stress model, with 91% of the effects coming from other undefined or random sources, despite significance.

We, as researchers, need to better educate our students in the importance of this topic, and thanks to all of you for keeping it in the forefront of conversation. It makes me want to look at new articles that much more closely.

Cheers!

Rich

Mehmet Guven Gunver

The biggest problem with the p value is;

it can easily be reduced by increasing the sample size. you can see the link;

https://en.wikipedia.org/wiki/Data_dredging

Christopher A Smith

@ Jochen Wilhelm

Sir, you misunderstand my clinical point. As a consumer of peer-reviewed medical journals. My intent, when reading the published literature, is to ascertain whether the information presented should be adopted into my daily practice. You are quite correct in your description of a population health perspective although my previous comment was concerning the care of individual patients. When engaged in the care of A Patient my concern is that patient, not the population. (Although, quite important.) Interpretation of the current literature informs my practice over time and p values catch my eye but clinical significance and effect size are the metrics that influence my clinical decisions and approaches. (I am always looking for a better mousetrap) My previous comment was an attempt to condense this perspective. Thanks you.

Regards,

Christopher

Jochen Wilhelm

@ Dear Christopher,

thank you for responding. I see that you are directly interested only in the individual patient, not so in economics or ethics (how else could you do your job?!).

I wonder why "p values [should] catch your eye". In your routine job you are (hopefully) not orienting your treatment on basic research papers but rather on larger clinical trials (phase II/III). If such trials won't show that they are at least confident about the sign of an effect (i.e. have a "significant" result), they would not be published. As the size of the p-value also depends on the sample size, a lower p-value may only indicate a larger sample. A study on a rare disease may simply not have a large sample available and thus will neccesarily have a relatively high p-value even when the effect is "large".

It is good that "clinical significance and effect size are the metrics that influence [your] clinical decisions and approaches". My concern was that even this is not enough to consider in clinical decisions, even when we talk about the treatment of an individual patient. This patient will not only have a chance of clinically significant treatment effect but also the risk of one or more clinically significant side effects (directly or indirectly associated with the treatment). It is important to weight these factors.

Thank you for that discussion and please correct me where I am wrong.

Salvatore S. Mangiafico

I'm just glad I don't have to explain the contribution of each of my 71 co-authors on a paper.

Coming from an agriculture and natural resources background, I think the 0.05 alpha level works out pretty well. Most field experiments don't have too many replicates, so an effect with p < 0.05 merits looking at. It's probably a good balance. I've always guessed that the alpha = 0.05 criterion became cemented in this context.

I've also wondered if research in agriculture has had fewer problems with reporting effect sizes and such. Because the measurements are understandable to the reader. That is, a difference of 0.5 bushels per acre of corn isn't too impressive no matter what your p-value. And there is a kind of simple cost-benefit ratio running in the background, because the cost of a fertilizer treatment or labor to make the treatment are relatively real and obvious. Just musing, but I'll also note that the main authorship of the paper comes from the fields of psychology, sociology, economic, and related fields.

So, from my perspective, for my field, I don't see how changing the standard alpha level would fix anything. I think the solution to some of these problems lies in better education of writers and readers about the importance of the size of effects, honestly plotting your data, and thinking about the practical implications of research.

Marco Gnesi

Disclaimer: I'm an early-career researcher in biostatistics & epidemiology, and given that I'm "young" (especially in terms of career - I ended my master 3.5yrs ago) I might be too much idealistic in the way I approach research.

I tend to believe that the problem is not really "a matter of size" of the pval. Personally, I think that a threshold of 0.005 would be too strict in "normal" inferential analyses (it is straightforward that this statement does not apply to the issues coming from multiple testing, which can be solved through various approaches designed to reduce the threshold).

The problem, IMHO, is that pvals (even a veeeeery small one) are often misinterpreted - due to either ignorance of basic statistical concepts or "I-want-that"-guided interpretation - especially by lab researchers, for example in the biomedical field.

I lost the count of how many times lab researchers and/or clinicians came to my desk expecting me to shake my magic cane and make pvals as they wanted/expected; often, research is driven by the hope to find a certain pval rather than by proper scientific methodology, i.e. carrying out an "experiment" (of any kind) with an appropriate design and accept its results "no matter what". Again taking biomedical sciences as a touchstone, "appropriate design" would mean to write down a protocol including an estimation of sample size (declaring also the effect size, which is what makes the difference), a suitable design according to the purpose, and at least a draft of the plan regarding data analysis. Often, the protocol - if they have one - does not meet those criteria, and this should cast doubts on any results whatever the pval is.

I guess that a lot of colleagues from biostatistical units could say the same: how many times have someone from the clinical side asked you if you could "adjust a bit" the (correct and appropriate) analyses you presented because the results have not met the expectations? Or to do what is actually "data phishing", i.e. to run lots of tests until you find a significant signal, even though it does not make any sense when you look at it from a clinical standpoint (thus introducing a lack of coherence)?

Results are often unreplicable because of inconsistency, not because the threshold we use on pvals is too liberal.

I suggest that we should start from here. We, as applied statisticians, should strongly refuse to do this misleaded practices. Clinicians - or, in general, "users" of our analyses - should be educated better than they are as regards their comprehension of statistics and, even more important, methodology of research. If we don't, any reform of significance threshold will be totally useless (if not dangerous).

Luca Mancini

In 2015 the editorial board of Basic and Applied Social Psychology declared that anyone wanting to publish on the journal should avoid any reference to p-values. The Benjamin et al.’s manifesto triggering this thread does not go that far but advocates 10-time tougher standards before statistical significance can be claimed. To motivate their conclusions, the authors compare p-values with Bayesian factors, saying that with prior odds at least as great as 1:5 in favour of H1 the conventional 0.05 threshold value provides too weak an evidence against H0. B. Efron in his JRSS 2015 paper shows that Bayesian credible intervals are basically equivalent to frequentist confidence intervals when using convenience or reference vague priors. With prior odds closer to 1 the curve describing the false positive rate-power relationship shown in in the Benjamin et al.’s Fig.2 would look lower for p=0.05 , with a minimum false positive rate not that far from the nominal 5%.

Thomas Grischott

i tend to view that one from a bayesian perspective, too. but much more important than the exact value of the significance level is that tests are performed only in situations where the treatment under study has a high prior plausibility to be effective. otherwise, significance even on a very high level is meaningless, just like a positive HIV test is very often falsely positive in a low prevalence population of blood donors. so while i think it's worth reflecting on a "better" p-value, i would rather like to see that plausibility/efficacy etc. of any treatment or intervention is sufficiently well supported by some theory or sound prior knowledge and that journals put more weight on prior plausibility than on p-values. and yes, of course, effect size and clinical significance must be considered, too!

James R Knaub

You should not use a single p-value threshold for any set of applications. Each application should be individualized based on standard deviation and sample size.

PS - You might consider the following: https://www.researchgate.net/publication/262971440_Practical_Interpretation_of_Hypothesis_Tests_-_letter_to_the_editor_-_TAS

James R Knaub

Michal -

You note that "In an upcoming issue of Nature Human Behavior, Daniel Benjamin and colleagues propose a new p-value threshold 0.005 instead our current dogma of 0.05." However, this just trades one problem for another. If "big data" applications, made possible now by advanced computing power, had been the case in the beginning, then many people would have used 0.005 all along. But the mistake is assuming "one size fits all." 0.05 should never have been considered a "standard." 0.005 should not be considered to be a "standard." The very idea of a "standard" here is wrong, as I indicated in an earlier response. Isolated p-values are basically meaningless. It is much better to focus on relative standard errors or sensitivity analyses.

Cheers - Jim

Mauricio Jerez

I agree with the answers given. Several papers address this issue. A classical one is attached:

Dennis William Lendrem

Think of the p-value as giving the probability that the results could arise by chance. Then even if we reject the null hypothesis that our results could have arisen by chance, we still know nothing about how the results arose.

One possibility is that our alternative hypothesis is correct?

But another possibility is that we ran a study that failed to take into account one or more other, systematic sources of variability.

Tinker as much as you like with the p-value and it's still not going to tell you whether the results are reproducible. Switching to an estimation framework is not going to tell you whether your results are reproducible.

To do that, we need to know how the study was conducted. The design details and the steps taken to minimize or eradicate potential systematic sources of variability. Without that detail, we can't reproduce a result. With that detail we may be able to reproduce a result and yet conclude that the results arises through some systematic bias.

Those who pay lip service to design issues - control, bias, randomization, blocking - are going to get burnt.

Bruce Weaver

" Think of the p-value as giving the probability that the results could arise by chance."

Dennis, that wording troubles me. It sounds (to me) as if you are saying p = p(H0 is true). I wonder if you really mean something more like this: p = the probability of getting a result at least as extreme as the observed result given that H0 is true. And if H0 is true, any departure from what one would expect under H0 is purely due to chance.

HTH.

Dennis William Lendrem

You're right. The 'given that' is the most crucial part.

Jochen Wilhelm

Instead of "due to chance" I would prefer "not explained by the model".

We too often think of quantitative variables and a normal distribution. This is blurring a clear understanding of what we are talking about.

Consider a binomial response, which is represented by a random variable X that can take the values 0 or 1. With "random variable" we mean to say that we don't know the value X actually has (or will have, when we are going to observe it), and we instead give only probabilities for its possible values. This has nothing to do with believing that the value X takes is arbitrary or non-deterministic or in principle unpredictable or unknowable. It just that we don't know it for sure; we may well have more or less good reasons to believe that it's more likely to take 0 than 1 (or the other way around), and this is what we quantify with a probability distribution.

The words "probability" and "chance" are synonyms (some say that "probability" just a mathematically well-defined quantity, whereas "chance" is used in a less quantitative fashion, but eventuall they both mean the same).

Here, providing p=Pr(X=1) is sufficient to fully specify the whole distribution, because Pr(X=0) = 1-p, and there are no more possible values X can take. In a hypothesis test, we can test sample of observed data under a given value of p. Using p = p0 would allow us to see if the data is sufficient to conclude that pp0). The test is based on a comparison of how well our model with p fixed at p0 predicts the data relative to the best possible explanation of the data by our model (not fixing p).

Having p fixed or not doesn't make the data more or less "due to chance". The only thing we can see is if explanation of the observed data by the model is considerably impaired by restricting p to be fixed at p0. Note that we may express p as a function depending on some predictor values in test the coefficients of that model, e.g. to see if our data is sufficient to conclude if a predictor increases or decreases p (the coefficient in the model is usually a log odds score). Further note that the impact of restricting a coefficient depends on the entire model. Changing the model will change the impacts, because we "put our knowledge in a different frame", so to say. That doesn't make the data more or less "due to chance". Chance (probability) is a measure we assign to express what we know about a value in some given context.

We can now transfer this to a random variable with a normal probability distribution. This variable can take any real value. Say we know σ² but we have no idea about µ. In a hypothesis test, we can test sample of observed data under a given value of µ. Using µ = µ0 would allow us to see if the data is sufficient to conclude that µµ0). The test is based on a comparison of how well our model with µ fixed at µ0 predicts the data relative to the best possible explanation of the data by our model (not fixing µ). We can express µ as a function of predictors and test coefficients. And still, the question is always if the restriction impairs the ability of our model to explain the observed data. At no point the model distinguish or measures if or how much the data is "due to chance".

This post got longer than intended, and I am still not sure if someone will understand what I was trying to say. I give it a probability of 0.2 ;)

Joachim Domsta

It is necessary to complete the nice Jochen's responce by a remark, that the methods of statistics are for mutual communicating the results of our models in uniformly understandable terms. Therefore, also, there is no need to determine stricly which p-level is the best one for rejecting "proper" hypotheses. This is simply also a convention. Some dependence of this parameter is, however to be stressed: building a bridge, preparing medicine which could be dangerous or starting a big enterprise, one should take care for the consequences in case "if something goes wrong" Thus, in the risky situations, the rejection should be given greater chance than for some problems of less negative cosequences. Obviously, this can also be modelled in some way by e.g. randomiing the probability of rejecting and/or involving this problem into the project on global costs of the research.

Regards, Joachim

Cesar Ruben Zelaya-Vargas

Are p-values valid? Why look for a new value? Would they be the product of a deformation of the original form in which they were created? the p-values have very precise indications for their use? Answers to these questions and many more can be found in:

https://www.researchgate.net/publication/260062295_Null_hypothesis_significance_tests_A_mix-up_of_two_different_theories_The_basis_for_widespread_confusion_and_numerous_misinterpretations?_iepl%5BviewId%5D=0JWSC7NjAHuOYVjd0iEbsRxL&_iepl%5BsingleItemViewId%5D=RqxpPPbj0YqrbCcQeLScFRZx&_i

David C Schneider

Well, as a late arrival at this symposium here is a dialectic response to the query "How will it (Benjamins revision of p-value threshhold to 0.005) change our research?

Thesis: New p-value threshold 0.005 instead our current dogma of 0.05 will reduce false positive results in scientific community.

Antithesis1. It will increase false negative results and possibly diminish the meaning of effect size.

Antithesis2. It rests on two unexamined premises. (2a) Bayesian posterior probability is a measure of strength of evidence; (2b) sample size is cheap.

Synthesis 1. The inevitable risk of increased false negatives falls on the experimenter, but not on the experimental community, via published meta-analyses. The growing use of effect size reporting is to be strongly encouraged, regardless of Benjamins et al.

Synthesis 2a. Evidence (once we have the data) is not the same as probability calculated from the likelihood ratio calculated after we have the data. Royall (1996) makes the case for likelihood as the measure of evidence.

Synthesis 2b. Incremental costs of increase in sample size depend on the field. It may well be cheap in psychology and the social sciences (note the affiliations of many of the co-authors). In some fields, incremental costs preclude research at 0.005 threshhold.

Comment on change in practice: Shift to 0.005 is more useful to readers than to researchers. This is possible as long authors continue the now established practice of reporting p-values, not just threshholds, as required by Neyman-Pearson dogma.

~David S

Martin Schmettow

numerous statisticians have pointed out the fundamental flaws in NHST ages ago.
the replication crisis brought to light that we better had listened
the American Statistical Association recently warned against p-values
even the American Psychology Association is slowly updating their recommendations away from NHST towards parameter estimation
p-values are in many ways inferior to modern model selection techniques (LOO, information criteria)

Shouldn't we just silence p-values to death and wait for the last proponents (who build their careers on the paradigm) to retire?

Jochen Wilhelm

Young students and postdocs are still trained in frequentists statistics, mostly even withouth being told that this is "just" a philosophy and that there is (at least) a different philosphy around. And these researchers have to publish in journals where reviewers want to have their beloved p-values (also because data are undoubtly much more difficult to interpret -- actually, it would become neccesary to really interpret data rather than just looking at a p-value that seemingly tells us everything important). I thus doubt that the flow of proponents building their careers on p-values will ever dry up.

The problem is not the p-values. The problem is us. The p-value (and confidence intervals) are viable tools to help us judge if the data we have can give us sufficient confidence in claiming the direction or sign of an effect, what can be really helpful in complex situations/models. It's us forgetting that this is about confidence, not about probability, and that it is about data, not about estimates or parameters. We use it wrongly, and we interpret it wrongly.

Going to estimation is a rather huge step, not at the first sight, but at the second. A fact that is again easily overlooked by many. A "least squares estimate" or, more generally, a "maximum likelihood estimate" is actually by itself not an estimate of a parameter (an effect). It is just a value of the parameter under which the observed data is most likely, and the confidence interval is a range of values under which the observed data is not too unlikely. As these are calculated only from the data (based on a few assumptions, as always), these are sample statistics (just like a p-value is a sample statistic), and there is no magic entering the meaning of such a statistic. It's data, still; not a parameter. If we really want to say something about the parameter, we have no other choice but to model parameters as random variables so we can assign a probability distributions over the possible values of the parameters. And that's just where we flip the philosophy from frequentist to Bayesian. We must then admit that there exists no objectity (what I think is fine -- or better:correct, but what I think won't be acceptable for the majority). This is what I mean with a huge step.

David C Schneider

Just so it's clear, here's the American Statistical Association statement on p-values.

It has more to do with how to use them, rather then whether to use them. It does warn against specific thresholds (0.05, 0.005, whatever).

1. P-values can indicate how incompatible the data are

with a specified statistical model.

2. P-values do not measure the probability that the studied

hypothesis is true, or the probability that the data

were produced by random chance alone.

3. Scientific conclusions and business or policy decisions

should not be based only on whether a p-value passes

a specific threshold.

4. Proper inference requires full reporting and

transparency.

5. A p-value, or statistical significance, does not measure

the size of an effect or the importance of a result.

6. By itself, a p-value does not provide a goodmeasure of

evidence regarding a model or hypothesis.

Timothy A Ebert

The problem is also with a lawyer or politician who might have had a science class as general education and must now deal with a complex social or biological issue. Just tell me the answer in 250 words or less, not some endless mumbo jumbo about uncertainty. Tell me the truth so I can move on. A p-value lets people easily fall into this trap.

David Eugene Booth

If anyone is still looking at this, please read the paper at the following link:

https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.Wym3BCCQyCg

It is important to understand not trust the "MAGIC" of the numbers.

Best, David Booth

Martin Schmettow

Another problem is that most people just don't understand the weird sampling logic and twisted asymptotic math of frequentist test statistics. In our master track in Psychology I give statistics workshops in which I introduce Bayesian parameter estimation. I typically start with: "Whatever you have learned in your statistics classes, don't let that come into your way!" The frequent response of students is: "Now, I get it!".

And here is my standard footnote when I use Bayesian credibility limits in scientific report: "Credibility intervals are what many researchers falsely assume the confidence interval is [1]."

[1] Hoekstra R, Morey RD, Rouder JN, Wagenmakers E-J. Robust misinterpretation of confidence intervals. Psychon Bull Rev 2014:1–7. doi:10.3758/s13423-013-0572-3.

In conclusion, if we want do-it-yourself statistical analysis in the future, it has to be as intuitive to use, flexible and safe as a modern car. That's obviously not true for frequentist NHST and p-values.

Paul Louangrath

SELECTIVE APPLICATION

Error level of 5% is practical, especially in

social science research. Error level of 0.5% may

be possible and useful in certain field outside of

social science, i.e. natural science, such physics or

genetics where high precision is needed.

David C Schneider

This thread started with - Do we need a new threshhold for Type I error? R.A. Fisher proposed 4 "definite levels of significance"

0.1, 0.5, 0.2, and 0.1. Leaving the reader to produce the estimate of Type I error and make their own judgement. As a researcher I like Fisher's approach for statistics in a research settting.

"We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of *other* free minds to utilize them in making *their own* decisions." J.Roy.Stat.Soc.17:69-78.

That being said, as a consultant I have to ask "What is the cost of Type I error?"

Were I physician, I would need to ask the same question, for me, and for the patient. If Type I error has a clear cost that I can articulate, let's limit it to acceptable levels, such as 5%. If we can't pin down the cost, let's be flexible, in the manner of Fisher.

Suum quique,

David S

Timothy A Ebert

@ David,

You wrote "0.1, 0.5, 0.2, and 0.1" did you mean to repeat the 0.1? Significance at 0.5 means that there is a 50-50 chance of finding an observation as great as or greater. I suspect fingers typing too fast.

I might agree if we were all more statistically fluent. However, I become uncomfortable when p-values of 0.50 and 0.60 are discussed as showing a significant effect (as in a recent report that I was shown). Based on these values hundreds of thousands of dollars were spent continuing this research. Fortunately, this has since stopped.

Even if there is no clear cost, I think we should think very hard about what it means to relax the 0.05 significance level, arbitrary as it is. So as a scientist, if any p-value can be significant then I can get more papers out by reducing replication to 2 replicates per treatment. Everything is significant and I can finally tell the story in the data that I know is there.

David C Schneider

@Timothy A Ebert

Fingers too fast indeed.

Fisher offered flexibility at four levels 0.1, 0.05, 0.02, 0.01. Neyman-Pearson gave us rigidity at a single preset criterion (usually 5%). As a scientist I concur, with you and Fisher, at drawing no conclusion, at p > 0.05.

As an educator in an academic setting, I find that students take well to likelihood ratios (the regression model is 600 times more likely than no relation). Students chafe at, and routinely misinterpret, the logic of rejecting or not rejecting a null hypothesis. Maybe someday we can get beyond the fixed criterion for Type I error, except as it matters, when there is a clear cost.

Thanks for the correction, David S

Jochen Wilhelm

@David C Schneider,

It would be great if you'd give a reference for your statements:

"Fisher offered flexibility at four levels 0.1, 0.05, 0.02, 0.01. Neyman-Pearson gave us rigidity at a single preset criterion (usually 5%). As a scientist I concur, with you and Fisher, at drawing no conclusion, at p > 0.05."

I know that Fisher write in the 13th edition of his 1925 classic book "Research Methods for Research Workers", page 44: “The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not.”, and chosing this limit was quite obviouesly chosen because he was allowed to only show an excert of the tables for critical t-values (the copyrights for the full table was with Neyman). In his own research papers Fisher did never adere to any particular cut-off. Sometimes he considered p around 0.2 as significant, somtimes not - depending on the context of the experiment.

And to my knowledge, Neyman never promoted any particular cut-off (actually, Neyman did not consider rejecting H0 at any "level of signicicance", as his test is not about significance and rejecting H0; it's about confidences of accepting one of two alternatives). You may confuse Neyman's type-I error rate with the probability of a false rejection of H0 under H0 (despite what textbooks often teach wrongly, these are very different things!).

So a bit of clarification would be nice here, how you get to your statements.

David A. Jones

Jochen, I believe you were directing your question to David C Schneider, rather then to me.

However, it may be helpful to point out a recent set of videos on YouTube, that relate to the discussion here:

https://youtu.be/VRF-UwrepAs

What do we learn about significance tests from the replication crisis? - Significance tests (part 1)

https://youtu.be/R7dQo9bW1DM

Should we redefine statistical significance? - Significance tests (part 2)

https://youtu.be/gtQlvR-RUPg

The use and abuse of Significance Testing ... - Significance tests (part 3)

https://youtu.be/txLj_P9UlCQ

Sir David Cox - In gentle praise of Significance Tests - Significance tests (part 4)

These are from the Royal Statistical Society and dated 5 September 2018 (published 23 October 2018), so reasonably up-to-date.

Jochen Wilhelm

David A. Jones, yes, I selected the wrong "David" from he list. I corrected that. Sorry for the confusion!

Patrick Raymond Wells

I read a lot of quantitative education studies where data is drawn from PISA scores (a world wide multi-subject test of 15 year-olds). I know 15 years olds and I know stats and for authors to make bold statements about teachers, teaching, and learning based on their low level of significance is a stretch; especially with no on site observations to back up their "significant" findings. I suggest changing the accepted level of significance and requiring mixed methods for studies used in policy decisions in education. Yikes, we are still fighting against the "modernity project" - except now its against neo-liberal statisticians!

David C Schneider

Jochen Wilhelm

It would be great if you'd give a reference for your statements:

"Fisher offered flexibility at four levels 0.1, 0.05, 0.02, 0.01.

The reference is page 80 of Statistical Methods for Research Workers.

"If p is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05, and consider that higher values of [chisquare] indicate a real discrepancy."

Higher values could be taken as 0.01, 0.001, etc.

See also p 25 of Design of Experiments. 8th ed 1966

"Convenient as it is to note that a hypothesis is contradicted at some familiar level of significance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength which the evidence has in fact reached."

The quote comes from a long paragraph that is a comprehensive argument against Neyman-Pearson decision theoretic hypothesis testing, where a fixed criterion is set beforehand. The paragraph ends with a rejection of NPDTHT: "A good deal of confusion has certainly been caused by the attempt to formalise the exposition of tests of significance in a logical framework different from that for which they were in fact first developed."

I read that as Fisher's denial of responsibility for using a fixed criterion, pointing instead at what was intended when first developed by Fisher-- a way of judging strength of evidence (from the p-value), which Fisher considered a stronger form of inference than reporting the likelihood ratio.

Ziliak and McCloskey 2008 blame Fisher for "imposing the Rule of Two" --significant if greater than 2 standard deviations, (1.96 to be exact). Saying that Fisher imposed the rule of Two strikes me as blaming the victim, given FIsher's bitter fight against rigid criteria that came from Neyman and Pearson.

Patrick Raymond Wells

Once I finish writing my comps I will attend to this - part of the problem with this is I have my feet in two worlds. I am currently a narrative researcher but still do research in science on the side. I am currently digging into some work on "cognitive load" by Sweller and the results for their experiments relate to this. I apologize for not addressing your well conceived question but, I will get to in time - its a great question. Cheers, Pat

Timothy A Ebert

Part of the problem is that we want the right statistical method for a given analysis. All other methods are then wrong and therefore my results and conclusions are true. Often the right statistical method is the one that can be done in the shortest time and results in a single value that proves the point. Its truth is further validated by getting two or three others to accept it and having it finally appear in print.

The alternative problem. I analyzed the data in three ways that differ in their assumptions. I present proof within the limits of my data that the assumptions of these models are reasonable. I now have a 45 page manuscript that no one will read even if it does get published. The purpose of the paper is "lost" in considering the possibility that my conclusions are an artifact of the experimental design and analysis.

Patrice Showers Corneli

Setting a global desire for a p-value of 0.05 or 0.005 is simply dogma.

Statistician generally eschew dogma when it comes to making sensible decisions about study design and p-values. Instead the goal is think scientifically with knowledge of the biology and the consequences of building a poor model (biased results from failure to account for confounders) and the respective costs of .type 1 and type 2 error.

I agree that the p-value chosen should reflect the relative costs of type 1 error (rejecting the null hypothesis when no difference exists) and type 2 error (failure to reject the null hypothesis when difference do exist). Minimizing type 1 error increases the probabilitiy of type 2 error. Type 1 is common if the power to detect a difference is low (ie. a too small sample to detect a small difference). The first is a false positive - finding a difference when none exist and the second is a false negative - e.g. failure to find a real difference between drug A and drug B when drug B really is a more effective drug.

Really though, computer programs with generally give you the actual probability that the difference is due simply to random error and you can decide how important it is to your circumstances but repost the value that is generated so that the reader can judge the results for herself. In medicine we definitely want to detect a rare debilitating disease. We do not wish to make a type 2 error and so use a large sample size to offset the probability of failure to find the rare disease.

I model building, on the other hand we want to be sure to include confounding variables (gender, age) even if the p-value is greater than 0.05 (say 0.10). This is to be sure to control for factors that might interact with outcome of interest (coronary artery disease) even is the effect is small. The failure to include these just because they are not ≤0.05 could result in a biased odds ratio (say) that is artificially high or low.

Badges
Science topic

Similar topics
Mathematics
Statistics

More Michal S. Karbownik's questions See All

How to make my research transparent?

I plan to carry out a survey research that will model a certain variable with other pre-specified variables. As this is not an intervention study, I am not obliged to register the study to such a...

09 October 2019 2,779 5 View

Reusing dataset from a public data repository?

This is advised nowadays to submit a dataset to a publicly available repository (eg. Mendeley) before publishing a paper done on these data. Can I reuse such a repository dataset to publish my...

31 December 2018 9,304 3 View

How large is effect size in social science analyses?

The effect size in statistics reflects the magnitude of a measured phenomenon. It may be expressed as a coefficient of determination (R2), which states how much variance can be explained by the...

11 December 2018 1,833 5 View

Software to illustrate confirmatory factor analysis model?

Hello! Does anybody know any free software to build an illustration of the confirmatory factor analysis model (like in the attachment)? I use STATISTICA to perform the analysis itself, but it...

06 July 2018 10,120 3 View

Access to the full text of "Neuroscience meets salivary bioscience: An integrative perspective."?

Dear All, Does anybody have an access to the full text of "Neuroscience meets salivary bioscience: An integrative perspective." by Segal SK, published in Behav Neurosci. 2016; 130(2): 156-75....

01 February 2018 776 0 View

Software for internal validation of a Cox regression model?

The best suggested method to internally validate a Cox regression model bases on bootstrap resampling technique with subsequent assessment of discrimination and calibration performance of...

11 December 2017 1,676 4 View

Intra-assay coefficient of variability for ELISA - misinformative?

While assessing intra-assay coefficient of variation (CV), it is generally advised that CVs should be calculated from the calculated concentrations rather than the raw optical densities. If the...

11 December 2017 5,665 6 View

Can I reuse data previously published as preliminary report in my final original article?

Some scientific journal editors allow to publish original research as short communications (or brief reports, research snapshot, etc.). This applies to projects of smaller scope and/or significant...

06 July 2017 9,510 1 View

Could you please recommend a useful software to manage references?

While preparing a scientific paper one may experience a mess with citations. Especially during adding and removing a citation and because various formats of references exist. Could you please...

06 July 2017 1,810 3 View

Manual for WinMICE software needed

WinMICE is a software designed to impute multilevel missing data. http://www.stefvanbuuren.nl/mi/Software.html It is a free software, however, I need a manual to use it. Does anybody have it and...

06 July 2017 4,234 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is this a facetotecta nauplius?

This larva was captured using a plankton net in the Persian Gulf during the summer. I believe it may be a Facetotecta nauplius.

08 August 2024 3,746 4 View

May members post flyers about opportunities to present at a conference? If so, where to post?

May members post flyers about opportunities to present at a conferehttps://veraeducation.com/nce? If so, where to post for the Virginia Educational Research Association? https://veraeducation.com/

08 August 2024 4,585 1 View

Hello all, Looking for international reviewer to review Ph.D thesis in wireless sensor network.Can anybody help?

My name is Apurva Saoji. I am a Ph.D scholar in Computer engineering in India. I am looking for international expert in reviewing my PhD thesis, "Competitive Optimization Techniques to Minimize...

07 August 2024 4,600 2 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Research Methodology - Impact of Corporate Reputation on Stakeholders Behaviors?

Please can anyone support with the survey questions based on RQ measures and propose how to do it in FMCG industry and include as well the role of brand equity Thanks

06 August 2024 949 0 View

Weak DAPI staining after immunohistochemistry - how to improve?

After immunohistochemistry of previously fixed in PFA and EtOH and then frozen 20 μm sections of zebrafish brain, DAPI staining is very weak (right) compared to the same sections stained without...

05 August 2024 9,637 2 View

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Are there any statistical methods to justify your sampling technique using SPSS or AMOS?

05 August 2024 9,153 4 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View