I think that a semi-subjective measurement of the signification in the theory of tests is very recomendated, personally I prefer the Bayes factor as an alternative.
I had long thought a practical approach was needed, when many statisticians spent a lot of effort on theorizing about the null hypothesis. (Please see https://www.researchgate.net/publication/262971440_Practical_Interpretation_of_Hypothesis_Tests_-_letter_to_the_editor_-_TAS from 1987.) "Significance" has always been a misnomer. A lone p-value is virtually meaningless, and using one threshold, 0.05 or 0.01, without proper regard to standard deviation or sample size - any effect size - makes no sense. Even the original argument, I think by Fisher, that a p-value threshold of 0.05 tells you whether or not to experiment further is still a lopsided argument. The type II error needs consideration.
Many use a p-value to make a decision. A "yes" or "no" response is desired, but generally what really should be looked into in decision making is a matter of degree. Estimation can help inform a good decision.
Consider heteroscedasticity. There are "tests" for heteroscedasticity, but heteroscedasticity naturally exists, and even when modeling circumstances or data quality issues may favor a homoscedastic approach, why not just estimate the coefficient of heteroscedasticity to see if that is so?
In many cases, a hypothesis test can be done well, if both types of error or some kind of sensitivity analysis is considered, but in many other cases one might do better with a confidence interval, or an estimated relative standard error, or a prediction interval.
The problem has been too much fascination with null hypotheses by theorists, and too much use of unsupported p-values by researchers.
It used to be obvious that one threshold does not fit all situations when one had a very small sample. Now it is also obvious for very large samples, now that more data and computer power have been available for some time. A very large random sample may often make it obvious that you will need a threshold that is much lower.
One might say that there is a lot of groupthink in statistics and its application, and the p-value/'significance'/0.05 threshold has been a very substantial example of that. (I have trained myself to often say "substantial" rather than "significant," due to the fact that the latter word has been so misused.)
Now practical significance, in health care, it is called clinical significance: Want to share real life example, in our graduating class, one of our class fellow was conducting the research on the post cardiac surgery patients' pain perceptions, it was replication study and the previous student, who conducted the study on pain perceptions, patients have to tell the pain on 1-10 pain scale, '1' is least amount of pain and '10' is the more severe form of pain, the patients were male, all male patients mentioned the 1 or 2 scale for their pain. During the data analysis, results were insignificant and showing that patients have no pain after cardiac surgeries, but that was not true(it was investigated further to find the cause), patients were receiving intravenous pain killers on thrice daily, six hourly daily, or whenever needed basis, and they had pain, but they did not mention it on pain scale as our cultural reasons, that men, as being manly must not express pain, but practically to make them vitally and hemodynamically stable in clinical post cardiac surgery, especially for first 24-48 hours, intravenous pain killers were given, to keep them pain free. Study was statistically insignificant but practically significant that pain relief was an important intervention after surgery; context, culture, gender does play role, what results we have(practical significance tells all that).
" Practical significance refers to the magnitude of the difference. This is also known as the effect size. Results are practically significant when the difference is large enough to be meaningful in real life.
Example: SAT-Math Scores
Research question: Are SAT-Math scores at one college greater than the known population mean of 500?
H0:μ=500
Ha:μ>500
Data are collected from a random sample of 1,200 students at that college. In that sample, x¯=506. The population standard deviation is known to be 100. A one-sample mean test was performed and the resulting p-value was 0.0188. Because p≤α, the null hypothesis should be rejected and these results are statistically significant. There is evidence that the population mean is greater than 500. However, the difference is not practically significant because the difference between an SAT-Math score 500 and an SAT-Math score of 506 is very small. With a standard deviation of 100, this difference is only 506−500100=0.06 standard deviations"( https://newonlinecourses.science.psu.edu/stat200/lesson/6/6.4 )
This is a long-term debate and often ignored by many researchers. For practical reasons, people prefer to do the p-value "bussines as usual" methods. The problem is that in most Universities frecuentist statististics is often teach. I think that is fair to say that in my field, Marine Ecology, students and researchers are not very familiary with the statistical literature. My take on this is to use a combination of methods including bayesian and likelihood approaches. So instaed of rejecting or accepting the Ho (like a black or white type of decision) it would be wise to search for the degree of support of the Ho.
It's interesting to see how in the last few days this discussion gained a new life with several people recommending it. When I started it a few months back, I was still in the aftermath of the ASA editorial, trying to understand how the so-called "normal" - not Gaussian, geek joke ;) - people, i.e. practitioners, would react to the plea to drop statistical significance altogether. The way I see it, there was no overall reaction so far. I feel the field is not prepared to do so, because that involves a mentality shift. People need to embrace the notion that any decision taken under uncertainty might be wrong. Yet people want to decide black or white constantly and do not want to make any wrong decisions! That is just not possible - and so that notion needs to become prevalent before a paradigm shift occurs. But even then, the truth is that most people will continue to prefer a hard rule (meaningful vs. not meaningful, to avoid the word significant) than "the everything is complex and you need to think carefully each time", because most statistical users are nor real researchers, who can spend days pondering on the best way to make some statistical inference, but ordinary people wanting to make decisions on practical issues and move on. I think nonetheless the ASA editorial was timely and useful, even if only as a first decisive step to begin that process. Now it's time to, with calm but steady, making the right moves to take us where we should want to be. A place where people turn their brains before turning their computer, where they think before turning the handles of machines they might not fully understand. That is the only thing that matters!
See also this Nature article by Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories... https://www.nature.com/articles/d41586-019-00857-9
We need to understand the data and subject matter better, and make more informed decisions.
As Tiago noted above, we do have to make "...decision(s) taken under uncertainty...."
I worked at a statistical agency producing official statistics on a frequent basis, for energy establishments, other private organizations, and government, to make decisions. In general, "decision makers" did not want to hear about standard errors. They just wanted numbers. Yet some way you know they incorporated that information into other considerations. People have to consider "uncertainty." They do it in some ways, but hesitate in others. A more thoughtful way of considering uncertainty in statistics is needed. Misused p-values have provided a way to try to bypass actual understanding of statistics, but now that more people realize this, we can hope for better basic understanding. It has been tempting to show the p-value, like magic, which will cover all needs. But it doesn't. A number seems like something solid, but you need to know what any number does, and what it doesn't mean, before using it. Often graphics are more meaningful than a given statistic (number).
"And now... what?"
We try to understand better what the data may be telling us, knowing that some results may be spurious, and need to be considered in relation to the subject matter.
Hypothesis tests can be helpful at times. But a p-value threshold established in a vacuum of other information and analyses, such as consideration of effect size, is misleading and/or meaningless.
"Statistical significance" is only a "top of iceberg": generally statistics is (at least in "my" organismal biology) frequently used as an element of the ritual, a "gown" to increase the appearance of seriousness [ „The danger of statistical (and general of mathematical) methods in ecology is that their application gives a stamp of extreme exactitude and reliability to conclusions even if derived from faulty, though sufficiently numerous, data” - B.P. Uvarov ], or simply a disguise to conceal the poverty of reliable evidence or deficient elaboration [ „if your experiment needs statistics, you ought to have done a better experiment” - Ernest Rutherford ]. Authors routinely apply (and editors, reviewers directly demand!) sophisticated statistical tests where very simple ones would perfectly suffice or where none is needed at all, resulting in the opposite of what has been intended: false or at least unjustified conclusions... Such formalistic attitude leads, naturally, to misunderstanding of the very meaning of statistics, e.g. to treating "significant" as synonym of "true" and "insignificant" as "false"!
In 1990 I attended a symposium and contributed a comment for the proceedings: The Future of Statistical Software: Proceedings of a Forum https://g.co/kgs/U9y7vb, The National Academies Press (US). It seemed to me that statistical software should do better than to appear to encourage one to run a "test" for 'significance,' take your (meaningless-when-taken-alone) p-value, and move on. I was hoping "future" software would do better. Maybe that is not a fair assessment. We should all do better.
I wrote about the broader philosophical issues some time ago, relating it to Popperian corroboration...
the null model is one special case of Popper's background knowledge, where evidence for a hypothesis gains corrob only if it is "improbable" in light of this background knowledge. the p value is analogous to ""improbable" in light of this background knowledge", but note that this makes cut-offs like .05 for "significance" largely meaningless.
see also Mayo's old book, "Error and the Growth of Knowledge"
I recently transitioned from empirical lab-based science into bioinformatics, and I can report that the mathematicians are well ahead of everyone else on the issue. I have been overwhelmed by the number of ways that different mathematical frameworks may be implemented in complex data (and are not necessarily reliant on p-value thresholds to extract meaningful insights--though many do use p-value inference for familiarity). Classic statistics are just one of many tools that may be applied to modern data problems--especially as computing becomes more powerful and easier. For example, I've learned that many questions may be addressed with alternative probability theory (Bayes), graph theory (networks), linear algebra (matrix math), model building for prediction/classification (machine learning), and/or information theory. Sometimes combinations of different analytical frameworks can help build or refute confidence in the trends/patterns. While the very advanced methods (deep machine learning) are necessary for some big data problems, the components/ideas that are built into these tools are often very powerful by themselves--and for smaller datasets! If anything, I'm hoping that these methods are adopted more broadly so that the language with which we ask questions and infer patterns from our data becomes more refined (beyond "different than expected from chance, not different").
An excellent topic for discussion. The statistical significance of the value of p has been a long-debated topic. However, the academy, the researchers' union, and the scientific publishers persist in giving more importance to the statistical significance than to the biological significance of the result of the statistical test. In my opinion the value of p less than or equal to 0.5 is somewhat arbitrary and as a great colleague taught me who can tell me why that value and not another. Currently, on this subject there are several interesting articles that I put to your consideration:
I think that statistics should not be disconnected from the natural history of the phenomena studied. I think that rather than stop using it, we should know how to give a better link to biological phenomena. That is, not only give the statistical focus of whether it is significant or not but also we need to link the statistical with the biological interpretation.
The purpose of statistics is to get a better understanding of the subject matter, whatever it may be. The statistics used must be appropriate to the subject matter, what is being studied, and what question or questions are being asked.
A lone p-value is incomplete for addressing practical issues, whatever the subject matter. A confidence interval, or a prediction interval for a random variable (the dependent variable in a regression), tells us that given these data (with this sample size), there is this much chance that a value is between these two limits. But a lone p-value does not tell you how much it is impacted by sample size. A primary key is standard error which depends on standard deviation and sample size. But the p-value took on a mystical meaning, as I noted above, as people used it inappropriately as a sign that a decision could be made on it alone.
Best to go back to the basics, and the best starting point for many cases where we use statistics to help understand the subject matter is to look at graphics, as I also noted earlier. "A picture is worth a thousand words." In graphics one might see several things worth quantifying. A graphic is often at least a good start on learning more about your subject matter. One should also remember that a given sample can be an anomaly, and spurious results can be reached. Thus, for example, there are times in regression where cross-validation may be important. As I noted above "...some results may be spurious, and need to be considered in relation to the subject matter."
Even a good hypothesis test where we look at effect size, not limiting ourselves to just considering the type I error, might be unhelpful if it does not tell us what to do next. An hypothesis test for heteroscedasticity in regression might convince us that the heteroscedasticity might be substantial, but unless we estimate how much there is, say, estimating the coefficient of heteroscedasticity, and use that in regression weights, we cannot see how much it impacts our results. Thus though hypothesis tests might be used for decision making, we may find that estimation and prediction, and the use of standard errors might be more practical for decision making.
[PS - Please note that I also considered the point about hypothesis testing for heteroscedasticity in regression in my first response, as well as the last one. - Pardon me for neglecting to note that.]
This is a serious problem that needs more discussion and explanation. Scientists who are just applying statistics to solve their problems (biologists, ecologists, paleontologists, etc.) without a strong statistical background, would catch the thread to proceed, therefore the term "significant" is very sufficient in their point of view. We need a clear decision from experts!!
For me (using resampling statistics), the p value simply tells me the likelihood that the difference I observe, could be due to random chance. The real work of determining whether the observed difference will make a difference in real life, belongs to the field of biology/ecology itself and/or to the various explanatory or predictive models that follows. In my experience, normally distributed data is almost non-existent in ecological data.
Francisco Javier Urcádiz Cázares now used where? Most of papers still do the same things we used to see in the 60s.
Since the beginning of times, pvalues were supposed to be shown at face value, not compared to a threshold. So we are not saying nothing new here
What the ASA really mean is that nowadays unless you run a power study and a controlled experiment, pvalues make little sense, as you can make them arbitrarily small.
Along those lines, we are less interested than we think in pvalues because they are not estimates of population parameters, they are just statements about the variance of the sampling distribution.
Also, most people would automatically compute pvalues for the nullified hypothesis regardless of their research question. As such pvalues should be replaced by pvalue functions
Article How feasible is it to abandon statistical significance? A re...
the value of a result must be based on the interpretation of the spectrum of values compatible with the data, such as Amrhein et al. suggested. However, removing a term such as statistical significance is far from a solution to avoid problems such as publication biases.
Even among supporters of abandoning statistical significance, the full application of this recommendation does not seem feasible (at least shortly). The arguments against the inappropriate use of statistical tests should promote more education among researchers and users of scientific evidence. We consider that the main problem does not rely on choosing a cutoff for the p-value, but on our difficulty recognizing the limitations of both statistics and rules.
Above, Fredi has that "We consider that the main problem does not rely on choosing a cutoff for the p-value, but on our difficulty recognizing the limitations of both statistics and rules."
Statistics are limited to specific bits of information, and often graphics will carry more useful overall information. "Rules," as in "rules of thumb" are often worse than useless in statistics. They can greatly oversimplify a situation, such as in cases of determining sample size when one needs to consider standard deviations and the often complex issues that the employed methodology entails. But "... choosing a cutoff for the p-value ..." is a big problem. Not only is a lone p-value an ambiguous statistic for which people often apply an arbitrary value, but the whole concept is not conducive to practical decision making.
A good example is the confusion people have over which hypothesis test to use to "test" for heteroscedasticity in regression. (See my comments last September 17.) A different test may give you a rather different p-value, even with the same sample size, and this relates to how to define each null hypothesis. And the main problem with all hypothesis tests is how to compare to and define an alternative hypothesis, for which the lone p-value is useless. In the case of a 'test' for heteroscedasticity, there is the classic "pass/fail" or "yes/no" nature of such an effort, when the real problem is one of degree. It isn't a matter of Is there heteroscedasticity or isn't there? but How much is there? and then, crucially, What do you do about it? (See the update regarding hypothesis testing that goes with the following project and other updates: https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression.)
In hypothesis testing, one wants to know the effect size, but that just means estimation is better for decision making. Yes/no results are bound to be arbitrary without accompanying impact information. In the case of measuring the coefficient of heteroscedasticity for use in regression weights, you can then see the impact on your results of interest. The result of a somewhat arbitrary hypothesis test is then of no consequence.
I also agree with Michael A. Cretacci. There's nothing wrong with that, it is one way but not the only way, to view the world and characterize findings for policy implementation.
As I discussed in an earlier paper, issues around understanding and appropriately applying statistics are much deeper and the criterion debate is just a tip of the iceberg. The most important point is that statistics does not generate knowledge - it helps label and structure information in our possession.. The rest stems from that. I propose a serious consultation on the foundations of statistics and probability which would help avoid some of these flare-ups in the future.
I've been hearing this "end of p value" stuff since I did my Masters 11 years ago, and only now do editors and reviewers begin to ask for effect size and confidence intervals along with the p value. I'm guessing p value will alive and kicking 20 years from now
The problem is not with p-value (or statistics in general) but with naive - rigid, schematic, dogmatic - interpretation of them! Statistical value, result of statistical evaluation is not the final solution of a scientific problem but only one of the more or less useful "signals" possibly helpful in finding the solution and evaluation of the chances of its being correct. The problem arises only when (unfortunately, rather frequently...), forgetting the probabilistic nature of statistics, p (or any other statistical) value is treated as decisive proof (or disproof) of a scientific hypothesis.
Many candid persons, when confronted with the results of Probability, feel a strong sense of the uncertainty of the logical basis upon which it seems to rest. It is difficult to find an intelligible account of the meaning of ‘probability,’ or of how we are ever to determine the probability of any particular proposition; and yet treatises on the subject profess to arrive at complicated results of the greatest precision and the most profound practical importance (Keynes, 1920, p. 56).