The use of p-value in statistics to define significant differences between groups is common in research. My question is related to limitation of dependance of research on the p-value in such discremination.
Science should be quantitative wherever possible. The questions asked should not be "is there an effect?" but rather "what is a likely/credible effect size, given our current knowledge?"
Significance testing (or worst: "null-hypothesis significance testing", NHST) usually ignores the quantitative information from the data and reduces the data interpretation to a simple classification (yes/don't know, significant/non-significant). P values may be marginally useful as an estimate of a "normalized statistical signal-to-noise ratio", and having a low p-value is only the indication that the data is worth a further, more careful and quantitative analysis (considering the entire context, background knowledge, experimental details, sample size). Stating a "significant result" It is not the end of the analysis, it is the beginning. Unfortunately, allmost all researchers think that this is the end.
Inductive inference which is made when sample evidence is used to generalize on the population of interest is at best uncertain or probabilistic. The p-value is a measure of the probability that the sample has fallen into the critical region. The traditional limit of p is 0.05, lesser than which the null hypothesis could be rejected. If p < 0.05, the test is said to be significant. Another traditional limit to the p-value is 0.01, below which the test is said to be highly significant. This is an attempt to introduce a measure of objectivity to an otherwise subjective procedure.
@Ette: "This is an attempt to introduce a measure of objectivity to an otherwise subjective procedure" - the procedure is objective, all decisions (including the selection of the model and the tested hypothesis, the design of the experiment, and eventually the interpretation of the p-value) remains subjective.
Just because all people use the same decision criterion doesn't make the criterion objective. "If p < 0.05, the test is said to be significant" is nothing objective. It is a subjective statement, no matter how many people repeat it. The problem is, that this rule (to base one's subjective decision upon) is sometimes sensible and sometimes not (it depends on the context, on the model, and on the sample size).
In medicine, I came across submitted papers where is this significant difference mathematically as per p-value, but such calculations have no biological meaning and cannot be justified.
You may want to refer to some papers on the subject such as Sedgwicks short vignette in BMJ "What is significance" (basically your "significance 101") PMID 26116134.
Substantially more scientifically rigorous and in-depth is the statement by the American Statistical Association "The ASA's Statement on p-Values: Context, Process, and Purpose". A .pdf file is attached - warning: While not anything extremely complex, it may not be casual reading for the statistically faint-of-heart
Also a rather statistically competent more "popular" site is fivethirtyeight has its quite appropriate take on this as well - link below.
A questionable feature of p-value-based hypothesis testing is that it assumes the null hypothesis has no "thickness". If H0 is: Mean = 5.00, then any sample mean value other than exactly 5 could potentially appear significant. (And this gets quite easy with large sample sizes, etc.) Yet when we estimate a mean, we don't expect to pin it down to a single point value. And when we estimate a good sample size and power for a hypothesis test, we allow for a range of apparent difference from exact equality that we won't consider meaningful. I've actually found in simulations (some reported in a paper listed on this site) that if one allows for a "thicker", more realistic null hypothesis, then the p-value comes a bit closer to performing as advertised.
For decades, the p-value has been misused, in my opinion, because theory developed regarding null hypothesis concerns without conveying the proper practical use of hypothesis tests to nonstatisticians. It seemed to me, back in the late 1970s, and I expect long before that, that misuse was likely rampant, while at the same time one could go to a lecture and hear the latest formulations on null hypotheses. I thought it a dangerous situation then, and I eventually published a letter to the editor of The American Statistician:
When there is a nonspecific alternative hypothesis, one is bound to obtain nebulous results. Clear type II error probability analysis is important. A p-value is incomplete - something similar to providing one end of a confidence or prediction interval.
In my opinion, the p-value has long been both misused and greatly overused. Often a standard error is more practically informative. Jochen noted the difference basically between saying yes we have this, or no we don't, versus saying To what degree do we have this? It's like saying to what degree does a model differ from reality, not is it "correct" or not. This idea carries over, for instance, to the idea for a regression model: Is there heteroscedasticity or not? The question is really: How much is there? After all, OLS is just a special case of WLS, where the coefficient of heteroscedasticity is zero.
So the biggest limitation of p-values is that a p-value at best is not very informative, and at worst, it is often misunderstood and misused, as your attachment indicates.
Cheers - Jim
Article Practical Interpretation of Hypothesis Tests - letter to the...
P-values have become ubiquitous, but epidemiologists have become increasingly aware of the limitations and abuses of p-values, and while evidence-based decision making is important in public health and in medicine, decisions are rarely made based on the finding of a single study.
Some might infer from statements like this that Fisher and his disciples advocated decision making based on a single statistically significant result. But here is what Fisher actually said (emphasis added):
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point), or one in a hundred (the 1 per cent. point). Personally, the writer prefers to set a low standard of significance at the 5 per cent. point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.
I am not sure if you think that the bold sentence inplies that Fisher thought of actual replications. I think this is a merely probabalistic statement. It is known that Fisher was quite a "frequentist", seeing a tight link between frequencies and probability. Therefore, I think, he meant that Pr(low p-value) should be low, what is the case for an experiment with a reasonably high "power". - Well "power" is a bad word to use here, as this is a concept of Neymanian and not Fisherian test theory, but I am lacking a better alternative to express what I mean. Fisher does not have the concept of some alternative hypothesis. There is only the single point null hypothesis, and it is only a matter of sample size to reject this hypothesis (in any practical scenarium). A "properly designed experiment" shuld therefore have a sample size large enough to expect the rejection with reasonably high probability. And if the null hypothesis was rejected, then one can start to interpret at least the direction of the effect (or the size, in case of F- or Chi²-tests). What do you think?
"...the writer [Fisher] prefers to set a low standard of significance at the 5 per cent point..." because they did not have "big data" then as an example of where that thinking, considering only one type of error, will take you. Taking that statement by Fisher too seriously has resulted in a great deal of modern misuse.
Hello Jochen. Yes, I did take Fisher's statement (which I bolded) to mean replications. I may be wrong to do so. Perhaps someone who has read a lot of his original works would be able to shed light on it.
I was also thinking about what Fisher meant by "well designed", and wondered if it included having an adequate sample size. But as you say, and for the reasons you give, he would not have used the term power.
I agree with Dr. Jochen Wilhelm. The statistical interpretation of p-value depends on the type of study design and area of research. There are fields such as evolutionary genetics and natural selection where p>0.05 indicate the populations follow Hardy-Weinberg equilibrium, so > p-value indicate more stable populations.