Please suggest if statistical "divergence" is the best term to replace statistical "significance" and for more details please see here:
Preprint From Significance to Divergence: Guiding Statistical Interpr...
Your feedback and suggestions of language to be used will be much appreciated and if you have a better term please let us know. Also if you do not like this term please tell us why.
Many thanks in advance
Suhail Doi
From the preprint:
To determine if the study data were possible under sampling from a population with no true effect overall, a p value is computed. A smaller p value means more divergence of the data from this assumed population and exact p values are reported but p ≤ 0.05 can be taken to imply this assumed population devoid of an overall effect is 'statistically divergent' from the data. A statistically divergent result, however, only means that the data does not support the population assumption but does not imply that the data are far enough from it to be to be clinically useful. To determine the latter, a 95% non-divergent interval (95% nDI) is presented that indicates a possible range of effects for populations which could be supported by the study data (i.e. fulfill the non-divergent p value criterion) and this range then helps decision making regarding the possible utility of the study findings (see Figure 3). This interval, however, only conveys the uncertainty due to random error and may underestimate the true range of population effects that could be supported by the study data.
My 2 pennies:
The problem starts when people use "significant" rather than the correct full term "statistically significant" to indicate a sufficient incompatibility of the observed data with a hypothetical model (which is nested in a larger class of models).
If we would just write that the difference between the sample estimate of this and that coefficient (e.g. a sample mean difference) and the hypothesized value is statistically significant, then things would be clear. Unfortunately, we often see sentences like "the mean was significantly higher in group A than in group B, p < 0.05, ...". This is shorter, yes, but not precise, and it causes confusion.
Science suffers on blurry language.
I don't see how exchanging one word by another would solve that problem.
Thanks Jochen but that is what people do when they mean statistically significant as with or without the term 'statistically' the consider this to mean important or useful or impactful and so on because its a 'given' that significance comes from statistics. This will never change unless we change the language used to something that cannot have a duality of meaning. When it comes to confidence intervals it gets even worse and even well known and well respected methodologists slip up. See for example an erratum to a well known textbook that was clearly a slip of language
https://statmodeling.stat.columbia.edu/2017/12/28/stupid-ass-statisticians-dont-know-goddam-confidence-interval/
The real problem is that divergence in biology and mathematical phylogenetics has an entirely different meaning. Lot's of people would really be confused.
Really there is nothing wrong with statistical significance. The term is restricted by the statistical modifier. Failure to to include that modifier indicates a scientist who does not understand the sciencitic question that is being tested. With computers, anyone can produce statistically result with a large enough sample size even if they do not have the scientific implications of the detected difference.
Patrice Showers Corneli if there was nothing wrong with statistical significance then please explain the history below:
Year
Author
Perspectives
1900
Pearson K4
Introduced the concept of the p value in his Pearson's chi-squared test, utilizing the chi-squared distribution and notating it as capital P. Interpreted as the probability of observing a system of errors as extreme as or more extreme than what was observed, given that the null hypothesis is true.
1925
Fisher RA46
Popularized p values and introduced the cutoff of 0.05 and introduced statistical significance. Highlighted that a p value of 0.05 corresponds to a 5% chance of observing the results if the null hypothesis were true. Emphasized deviations exceeding twice the standard deviation as formally significant.
1928
Neyman J-Pearson5,47
Brought in concepts of type I and type II errors, null and alternative hypotheses, and the process of hypothesis testing. Introduced the idea of rejecting the null hypothesis if the test statistic falls in the critical region and accepting the alternative hypothesis.
1935
Jeffreys H48
Introduced Bayes Factor, which represents the likelihood ratio of data under the alternative hypothesis to data under the null hypothesis.
1937
Neyman J22
First to describe confidence intervals as a long run probability: If we were to repeat the experiment many times, 95 percent of the intervals thus created will contain the ‘true’ mean if the threshold for significance was 0.05
1940
Lindquist49
Introduced the concept of null hypothesis significance testing (NHST), which combines Fisher and Neyman-Pearson's approaches.
1942
Berkson J27
Challenged the reliance on NHST, stating that p values do not necessarily provide the answers researchers seek.
1943
Fisher RA50
In response to Berkson's reliance on subjective judgment over rigorous statistical analysis in interpreting experimental data, R.A. Fisher emphasizes the importance of objective tests and critiques Berkson's dismissal of NHST.
1947, 1950
Wald A51,52
focus on decision theory and the formulation of decision functions
1955
Fisher RA53
Critiqued the reinterpretation of common tests of significance as decision function, highlighting several fallacies associated with phrases like “repeated sampling from the same population” , errors of the “second kind” and “inductive behavior”.
1960
Rozeboom WW54
Noted that the p value is not the exact probability of the outcome under the hypothesis, but rather a partial integral of the probability-density function of potential outcomes under that hypothesis. Highlighted why NHST should not be used as the cut-off point for “acceptance” or “rejection” of hypothesis. Proposed realistic methods of data assessment.
1963
Edwards W et al.55
Pointed out the tendency to overestimate evidence against the null hypothesis.
1966
Bakan D56
Emphasized the importance of actively generating psychological hypotheses, conducting investigations, and making relevant inferences, rather than simply testing null hypotheses, especially in contexts where there are strong reasons to believe the null hypothesis is false from the outset.
1978
Rothman KJ28
Argued for replacing statistical significance with confidence intervals, discouraging the use of p values.
1989
Rosnow RL, & Rosenthal R57
Discussed the issues related to dichotomizing p values and emphasized the need for nuanced interpretation.
1992
Goodman25, Senn58
Reported the low reproducibility rate of studies relying on p
Thank you for sharing that overview. Could you add the references? That would be super great!
The key problem is Lindquist49. If you remove that and all following direct and indirect references to that, many problems will disappear (would not have been appeared, so to say).
Really, I am quite familiar with the history of significance testing among actual leaders in the statistical community. These are honest dialogues about the nuances of the theory or of slightly different perspectives from RA fisher, Neyman, Pearson and Jeffreys. Al of these were brilliant statisticians. The same cannot be said of, for example, of Trafimow and Marks are psychologists, perhaps with some statitical knowledge but no indication that either are deeply trained in statistical inference theory - which is a hard years long slog. But very beautiful theory which I am supposing they have not studied as their arguments make no sense to me. Other references you kindly provide refer to the very common misunderstanding of a p-value as an arbitrary single point in probability space. It is arbitrary and a guideline that few people use anymore because with can now calculate the probability and report that instead. Still others caution that statistical significance itself is hard to understand and that nuanced thinking is necessary - something rather lacking in some 'scientific' crank-it-out from the laptop research
.
It is indeed the "very common misunderstanding" you mention that this paper targets. Medical research commonly equates significant to presence of an effect and decades of criticism has not changed that. What we need is a way out for clinicians and methodologists complaining that clinicians conduct crank-it-out from the laptop research will not help. Two decades of criticism later, I think methodologists need to appreciate that applied research cannot happen without clinicians and they need clarity of terms to stop the misinterpretation
Suhail A. Doi a rose by a different name is still a rose. The problem with statistical significance is that many accept it as a conclusion. Change its name and a rose still has thorns.
Statistics books start the problem by claiming that p-values are for hypothesis testing. A p-value cannot test (accept-reject) any scientific question. The p-value is a data quality evaluation. It is an initial indication that the collected data MIGHT indicate a tendency in a particular direction. An independent set of data collected under the same conditions MIGHT indicate a tendency in the opposite direction.
A p-value that is acceptable to the investigator's requirements is an opportunity to evaluate the data against the initial assumptions (stated and implied) and to determine what is needed to obtain a measurable result that will support a conclusion.
There are 2 very separate issues here. One is the meaning of a cut-point definition regarding the null hypothesis -- there are a great many problems with this, involving overpowered and underpowered studies, confidence (or lack of such) in the point estimate derived, effect size, reproducibility, etc (not to mention the even far larger and more important issues related to study bias that are routinely overlooked while focusing on statistical testing). This discussion begins to address some of these. The article in question, however, addresses a far easier issue, which is the fact that the word "significant" has an entirely different meaning when modified by the adjective "statistical" than it does in common English usage. (This is far from unique, as many words used in this context mean something different than they do in common English; "bias" is an example of this, as "study bias" -- the non-random introduction of error into the results of a study -- can be caused by "bias" in its usual sense, but can also have many other causes.) Since the vast majority of people who read medical journals (and even more of the people who read [and write] articles about "scientific breakthroughs" in places like the NY Times) do not understand the concept of "statistical significance," and more importantly do not understand that "significant" as so modified does not mean "important," insisting that adding the adjective "statistically" solves this problem completely misses the point. Even if one believes that NHST is useful, surely we should all recognize that the choice of the word "significant" as a descriptor in statistics has led to massive misunderstanding, and that of course it should be abandoned.
I was going to respond to Joseph L Alvarez but your response Jerome R Hoffman is exactly on point and answers several questions posed in this thread. I agree fully with your thoughts and indeed we are answering the easier issue - language which in our view has led to this state we are in. I just do not get why "significant" was chosen by Fisher to characterize a 'deviation' from an assumed model. Perhaps it was clear to him that significant meant 'significant deviation' not realizing that the term deviation was never to be used again (except in his book). However we are also suggesting moving away from Neymans choice of 'confidence' for the confidence interval and the long run probability based interpretation of the interval as well.
Suhail A. Doi the response to Jerome R Hoffman does not address the problem of teaching hypothesis testing or confidence intervals covering a range of compatibility. Any statistical result from a data set does not and cannot test a hypothesis. The hypothesis is supported or not supported by inductive logic with well-supported arguments. Re-naming, adding requirements, or precise definitions do not address the problem.
Joseph L Alvarez I think you are missing the point here. This is what Sander Greenland said in one of his papers: "But the core problems are of human psychology and social environment, one in which researchers apply traditional frameworks based on fallacious rationales and poor understanding [1, 3]. These problems have no mathematical or philosophical solution, and instead require attention to the unglamorous task of developing tools, interpretations and terminology more resistant to misstatement and abuse than what tradition has handed down." You are looking at it from the perspective of a methodologist, but clinical research is done to better peoples health and the clinician researcher will only change if he/she thinks that there is a reason to do so. As sander Greenland aptly says in his paper "Users tend to take extra leaps and shortcuts, hence we need to anticipate implications of terminology and interpretations to improve practice. In doing so, we find it remarkable that the P-value is once again at the center of the controversy [10], despite the fact that some journals strongly discouraged reporting P-values decades ago [11], and complaints about misinterpretation of statistical significance date back over a century [12,13,14]. Equally remarkable is the diversity of proposed solutions, ranging from modifications of conventional fixed-cutoff testing [15,16,17,18] to complete abandonment of traditional tests in favor of interval estimates [19,20,21] or testing based on Bayesian arguments [22,23,24,25,26]; no consensus appears in sight." Well the time has come to reach a consensus on language
Suhail A. Doi -That is my point. You are offering nothing that contributes to Greenland's statements. We cannot fix the problem by focusing only on terminology. You have to fix the process. Use p-values for data quality. If you teach statistics as scientific hypothesis testing instead of data evaluation and certification. you are asking for misuse. This is a bottom-up problem that cannot be fixed by bandaids on the top.
Joseph L Alvarez I agree with you that the process needs better explanation (and this paper attempts that too by switching to percentile interpretations) but that has been attempted ad libitum in the last 20 years. Greenland has suggested S values and terminology change to compatibility intervals because the process could not be fixed and my view is that it could not be fixed because the language came in the way. Others have suggested many other fixes as documented in the paper. This may be a bandaid as you say, but perhaps one that will finally heal the wound.
I would not object to the term "statistical divergence" in this context. The term "non-divergence interval" is a bit clunky but perhaps acceptable. I do question whether the term "confidence interval" really is in need of replacement.
I strongly agree with those who point out that the issues are much wider and deeper than terminology. In pessimistic moments I fear we just can't fix stupid and lazy and far too many people are not only both but also exercise much too great power and influence. But I suppose we should keep trying...
Fixing terminology and fixing the ways people reason from data are not mutually exclusive. The former is certainly not sufficient. I'm not even sure it's necessary. But it might help and seems unlikely to hurt.
Well stated, Scott. I only would not subscribe to your last statement. There is always the risk that changing terminology will add further confusion and fuzziness rather than clarify definitions. It also seems to me like a bypass to learn and understand correct definitions, it opens ways to jump to yet another blurry understanding of concepts. I mean, look at recent developments in "machine learning" and "AI"... I cannot even count the papers being published about using such methods were it's obvious that the authors do not have any clue about the definition of these concepts (more than thinking of it as some kind of a black magic box spitting out truths - pretty much as many seem to think about hypothesis tests and p-values. I think that it is natural that new fields are burned down in large flash in a pan until someone connects loose ends and goes beyond. But I also see that fires may burn for centuries, lastingly hindering good research. There always is a downside, it seems.
Scott D Chasalow I personally think that these are good ideas from Fisher and Neyman but what has gone wrong is the terminology. I cant speak for other fields, but in Medicine, significance and confidence are used in their literal English connotation which is far from what the authors of these terms intended. Methodologists have indeed claimed in the last decade that there are deeper and wider issues at stake and like some of the comments here blame the end user for cognitive failures. I disagree completely with the latter because physicians naturally are decision makers and have a good grasp of inferential decision making. However terminology has thrown this concept astray. There is a parallel in Medicine with dehydration and hypovolemia - these terms are used in their English language connotation that more or less equates them with disastrous results. Finally, a decade of discussion about statistical education surrounding these concepts has produced no change. Perhaps the time has come to stop calling divergence its exact opposite - significance! Nobody can ever claim that a statistically divergent results means an important, or useful result - the user will be forced to ask - "divergent from what?".
I contribute Scott D Chasalow the problem is the lazy and disinterest. This does not mean everyone should be a statisticion (I am certainly not https://www.researchgate.net/post/Is_being_critical_in_science_and_the_use_of_p-values_useless). I do not believe, p-values are useless anymore, on the contrary, yet not just not for the things we are thaught so I try to educate myself with various books and then summarize it (https://snwikaij.shinyapps.io/shiny/). I only believe this is possible if you want too and I do not think more articles necessarly really adress the epistemic (education) and moral issues underlying and resulting from it.
That being said in
Article I can see clearly now: Reinterpreting statistical significance
tried to adress statistical "significant" as "statistical clarity". This however does not change the issue at all because then the underlying philosopy and logic are still not grasp and then we search for statistical "clear" results. I would suscribe to Fisher (1925) more lose terminology as that the p-value adress "whether or not the data are in harmony with any suggested hypothesis". Ofcourse still insufficient. Then the data would be significantly differen from zero as in P(Data≥0). The problem being faced with the previous definition is that that this leaves out that the p-value is condition under the null as P(Data≥0|H0), which is important. I was for example strugglin with the notation P(|T|>|t|) in https://www.researchgate.net/post/Why_is_it_PTt after reading again some articles, I think it was from D. Mayo, I found the notation P(|T|>|t| : H0) which is much more clearer. Note that if you search google scholar for somethings like "brms ecology statistical significant" you can see that credibility intervals not overlaping 0 are equaly used as "statistical signficant". Thus what the Bayesian was trying to prevent Article Mindless Statistics
is re-emerging as a new dogma. Hence, old wine in a new bottle (or something like that). If changing a complete framwork does not adress the issue how would changing words change it?The "fact" we apply it everywhere and only publishe the p
To summarize, Wim Kaijser says that fixing terminology will not work without fixing the ways people reason from data and Scott D Chasalow says that fixing terminology is not sufficient but that it does not mean that we cannot do both i.e. they are not mutually exclusive so we can change terminology and fix the way people reason from data. The point I am raising is that we should fix the way people reason from data but to pragmatically achieve that, we must fix the terminology otherwise the second task is too onerous and probably is why we are where we are. There have been no real attempts to change terminology in this clear and determined way so this might work.
Well said. I would like to remind us all, however, that there is an even larger issue - it doesn't matter how well or poorly one reasons from data if the data itself is (as is so often the case) hopelessly corrupt, b/c of bias (in both its scientific and common English meanings).
Thank you for such an engaging discussion, Suhail A. Doi. I appreciate your insights and agree that the terminology and methodology could benefit from simultaneous revision. However, as I argue in this article (1), a key issue is that statistics tend to be viewed merely as a method, leading to an instrumental use of metrics like p-values. For example, one main issue with p-values alone is the general lack of understanding that their interpretation is dependent of how probability is defined in the first place.
Also, it would be wiser, but of course more complex, to recognize modern statistics as a discipline rooted in probability theory, incorporating concepts from measure theory, stochastic processes, and limit theorems. There are numerous published examples that illustrate how these concepts can be integrated into undergraduate education, such as teaching students about the various probability theories underlying key principles. About precise and accurate communication, Greenland aptly stated that many critiques of p-values arise not from inherent flaws but from misunderstandings caused by inadequate teaching or improper terminology. Thus, we must consider the types of knowledge claims we can make and aim to achieve when conveying "statistical significance."
When Francis Edgeworth introduced the concept in the late 19th century, it was meant to highlight results warranting closer examination, not necessarily to confirm their scientific importance. Many advocate for the broader use of terms like "clinical significance," which encompasses effect size measures. Interestingly, it has been suggested that "importance" might be a yet another suitable term, in line with "clinical significance" when discussing effects. This perspective could be particularly valuable in research where it is crucial to consider the viewpoints of patients and their families, not just healthcare professionals, as their perspectives have real-life implications for outcomes such as medication adherence and treatment outcomes.
1) Holmberg C. Why we need to discuss statistical significance and p-values (again): Addressing the underlying issue of different probability interpretations and actionable recommendations. Nordic Journal of Nursing Research. 2024;44. doi:10.1177/20571585241253177
While proper terminology or language is important for understanding statistical concepts, I agree with those who point out that the issues of the p-value-based hypothesis testing are much wider and deeper than terminology. I don’t think that language is “a root cause of the misconceptions surrounding p values”. The philosophy and methodology of the p-value-based hypothesis testing are the root cause of the misconceptions surrounding p values. Siegfried (2010) wrote, “It’s science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.” Therefore, I agree with many authors who suggesting retiring or abandoning statistical significance and p-values and call for statistics reform. Although some authors still defend NHST and p-values (e.g. Benjamini et al. 2021, Hand 2022, Lohse 2022), “A paradigm shift away from null hypothesis significance testing seems in progress (Berner and Amrhein 2022).” For my detailed views on statistics reform, please see Preprint Statistics reform: practitioner's perspective
*Edited*
Hening Huang thanks for sharing your thoughts and preprint. Lets take the example of the beverage flavor taste experiment in your paper. For A1 vs B, diff 3.81 (95%CI 0.19, 7.43), p=0.040. For A2 vs B, diff 8.7 (95%CI -0.34, 17.74), p=0.058. few points:
1. The null effect model is weakly divergent from the data in both comparisons ( p of 0.04 and 0.058 are practically the same) and therefore perhaps an alternative non-zero model generated the study data
2. The subdivergent intervals for the difference in both cases include estimates of "true effects" that could be very small or even negative differences in flavor score at the lower limits (0.19 vs -0.34) that could have generated the study data.
3. For A1 the point estimate is small but for A2 the point estimate is large therefore the results of this experiment could be important.
Overall, there is a potential for the new beverage but we are uncertain about its flavor benefit and a larger sample size is needed to see where the estimated mean-difference will actually fall. This point estimate of the mean difference is too imprecise to make a judgment.
These seem like clear conclusions to me. Your thoughts on this would be appreciated
Update: We had a group meeting yesterday and the consensus was replace non-divergent with sub-divergent. based on discussion thus far. Any thoughts? The preprint file has been updated
Preprint From Significance to Divergence: Guiding Statistical Interpr...
Suhail A. Doi I think this example of the beverage flavor taste experiment demonstrates the invalidity of the p-value-base hypothesis testing. The two t-tests gave contradictory results. By looking at the data, our domain knowledge (common sense in this case) tells us immediately that the difference between the two flavorings is practically significant, and that without any calculations, the new flavoring is better than the old flavoring. My descriptive statistics calculation and analysis support our common sense. Indeed, the sample sizes (n=10) are not large (but it is not practically small either). As a result, the relative standard uncertainty (RSU) of the effect size is large: RSU=45.84% for the comparison of Group A1 versusGroup B and RSU= 47.31% for the comparison of Group A2 versus Group B. However, due to the large effect size, the signal-to-noise ratio (SNR) is large: 4.76 and 4.47, respectively, indicating that the effect size estimate is reliable. In other words, the experimental data are credible. In addition, the exceedance probability (EP) is 0.741 and 0.725 respectively. Therefore, we should be in favor of the new flavoring over the old flavoring.
Thanks Hening Huang for your feedback - the contradiction based on p values only exists with the significant / non-significant divide that exists and thinking of this as important/non-important. As soon as we realize that these are just arbitrary thresholds for divergent / sub-divergent from an assumed model, we lose any meaning of 'importance' and we are forced to reevaluate the exact p value and then realize they are more or less the same and there are no contradictory results.
Dear Hening Huang !
I just read your paper “Statistics reform: practitioner’s perspective” you proposed to show that the t-test should be abandoned in favor of your proposed other indices (e.g. RES, SNR, SCI…). But I really cannot follow your conclusions. Please correct me, but my impression is that your whole argument is based on criticizing the dichotomous decision of p >= .05 is not significant vs. p < .05 is significant. In my opinion you conflate the method to derive the p-value (i.e. the t-test in your example) and the p-value with the decision criterion significant vs. not significant. And you also seem to conflate statistical significance vs. practical relevance.
Your argument is that it is nonsensical for a smaller effect to be significant (p = .04), whereas the larger one is not significant (p = .057). But as already pointed out by others, if you interpret p-values continuously, they are practically equivalent. And all your indices you present tell the same story as the t-test, it’s p-value and the descriptive statistics of the raw data (including the simple effect size i.e. the mean difference).
1. Both p-values are the same, from a practical point of view, similar to your SNR and SCI values, which come to nearly the same results for both comparisons, A1 vs. B, A2 vs B (interestingly enough SNR and SCI are both SMALLER for A2 vs. B, there seems to be less information, ALTHOUGH the ES is larger, why did you not discuss that? This is exactly what the p-value also shows: both contain nearly the same signal to noise, but a bit weaker for A2 vs B, because of the uncertainty in A2). Similarly, your EP and NSP are also nearly identical and again, descriptively smaller, just a tiny bit, just like the p-value, for A2 vs B.
2. The standard deviation for A2 is MUCH larger than for A1 and B, i.e. we cannot be so sure about the effect, since the variability is larger, i.e. the estimate is much more uncertain (btw, there is a mistake in table 2, the sample standard deviation of A1 is 3.59, not 12.89, which is the variance), which we can also find in your SU index, there is much more uncertainty for the A2 vs. B comparison.
3. Besides the raw effect size ES, you also present the relative effect size RES, but again, this completely corresponds to the raw ES.
Taken together, all of the parameters and indices presented t-value, p-value, ES; RES; SU; RSU; SNR; SCI; EP; NSP tell the same story, if taken together, so for what do we need the extra stuff? The whole problem you want to solve, as it seems to me, is to get rid of the dichotomous decision, but I am pretty sure all of us would agree that we should not blindly use this criterion, but to understand the data. I can get from the sample statistics and t-test alone the conclusion: a) the statistical decision is nearly the same for both comparisons, b) the effect of one is larger than the other, c) there is much more uncertainty for the larger effect. Nothing more, nothing less.
From a didactical point of view, I think it does not make sense to introduce 7 new indices, which tell the same story as the old ones, instead of teaching how to interpret the known ones correctly. It is a little bit like old wine in new skins.
BTW: just for fun, I used the presented beverage flavoring data and conducted a Bayesian parameter estimation approach. It told the same story without any p-values, but plain and simple results: the difference for A1 vs B is smaller than A2 vs B. Both credible intervals were close to zero (but contrary to the frequentist approach both did not include zero). There was more uncertainty for A2 vs B. So, "All Quiet on the Western Front", either.
Suhail A. Doi , above you wrote that "Update: We had a group meeting yesterday and the consensus was replace non-divergent with sub-divergent. based on discussion thus far."
Are you referring to this:
https://discourse.datamethods.org/t/time-for-a-change-in-language-statistical-divergence-and-the-95-non-divergent-interval-95-ndi/15415
Why don't you mention/crosslink this? Is it a secret?
Do you like to hide that because of the largely negative comments on your idea being posted there? For instance, Frank Harrell's comments (copied form the datamethods thread I linked above):
"This is extremely problematic in my opinion, You are getting into dichotomania, adding confusing new terminology (what’s wrong with compatibility?), not solving the root problem, and inventing new words (the proper word is tritomy; think of the Greek root of dichotomy = dicho (two) + tomy (cut).). I personally would think deeper about how I want to spend my time."
and
"Don’t borrow incorrect terms. I still think there are so many more valuable projects to work on. And when you transform a problematic measure the transformation process will not fix the underlying issue. Finally, I think it is a grand mistake to think that interpretations will be improved by this. Signing off for now, glad to have had a chance to express my opinions."
?
Jochen Wilhelm the discussion I refer to is face-to-face with my group MCPHR if you have read the preprint as the discussions on other forums are helpful but the final say is with the group. Datamethods is a public forum available through a google search so I am surprised you believe its 'secret'. RG is also public and not 'secret'. There are no positive or negative comments in my view as everyone has an opinion on this. Thank you for posting the link anyway as I do hope it allows more posting of opinions on datamethods. If you have other blog suggestions as well please do let us know here.
Hening Huang I think the points raised by Rainer Duesing are very pertinent to this discussion and such dichotomy needs to be pragmatic if it is to be utilized. p = 0.04 and p = 0.057 cannot be considered different even though they lie on either side of the 0.05 threshold suggested by Fisher as these are not "importance" thresholds and that is exactly my point. Use of divergent is clear - we have a threshold for beginning to consider that another model and not the assumed model was the source of the data but subdivergent now also does not mean "not important" and so language will guide use. I think the start of the problem was the choice of language - significance by Fisher (he actually meant whether a model deviates or diverges significantly from the data) and confidence by Neyman. If confusing language is used, confused researchers will ensue. This issue has also taken the toll of well recognized methodologists, for example see this erratum to a textbook posted below and I would also blame language here:
https://statmodeling.stat.columbia.edu/2017/12/28/stupid-ass-statisticians-dont-know-goddam-confidence-interval/
Rainer Duesing I think you seem to have misunderstood what I meant. As stated in the preprint,
“It is important to note that p-values are output of statistical methods, such as the two-sample t-test. Therefore, the problem with p-values is not just about p-values. The problem of p-values should be tracked back to the statistical method that generated the p-values.”
Also, “Clearly, the two-sample t-test does not answer our question about “whether treatment A is superior to treatment B, or vice versa?” Instead, it misleads us into considering whether treatment A is different from treatment B based on the “statistical significance” quantified by an arbitrary threshed p-value (e.g. 0.05). Therefore, the rationale behind the two-sample t-test is wrong. In practice, we know that treatment A is different from treatment B just by looking at the data or two group means. Therefore, there is no need to assume the null and alternative hypotheses. In other words, we do not need a “strawman” (the null hypothesis) and then try to disprove it; we can directly assess the practical significance of the difference between the two groups based on our domain knowledge. We can further perform a probabilistic analysis to determine the probability that treatment A is superior to treatment B (or vice versa).”
Therefore, I don’t that interpreting p-values continuously will solve the philosophical and methodological problems of the t-tests. Moreover, the p-hacking problem (through N-chasing) cannot be solved unless t-tests are abandoned.
You stated “… all of the parameters and indices presented t-value, p-value, ES; RES; SU; RSU; SNR; SCI; EP; NSP tell the same story.” What do you mean “the same story?” These statistics are different and each has its own meaning. For example, t-value is a standardized effect size, while ES is the simple effect size (unstandardized effect size). There is a big difference between the standardized and unstandardized effect sizes. As stated in the preprint,
“Because the simple effect size has the original (physical) unit, it will nearly always be more meaningful than standardized effect size (Baguley 2009). Schäfer (2023) argued that in their unstandardized form, effect sizes are easy to calculate and to interpret. Standardized effect sizes, on the other hand, bear a high risk for misinterpretation. In real-world applications, our domain knowledge about a quantity of interest is related to the physical unit of that quantity. Therefore, it is easier for practitioners to assess the practical significance of effects using the original (physical) unit than the dimensionless unit of standardized effect sizes. Baguley (2009) discussed the advantages of using simple effect sizes over standardized effect sizes. He stated, “For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size.””
Moreover, it is well known that p values are easily misinterpreted or misunderstood. Common misconceptions of p-values include “the p value measures the probability that the research hypothesis is true” and that “the p value measures the probability that observed data are due to chance” Chén et al. (2023). I think a p-value cannot be correctly interpreted without examining the specific process that produced it. For a two-tailed two-sample z-test, the resulting p-value is actually the (estimated) probability of compatibility between two group means (statistics). For a detail discussion on the compatibility probability, please see Article Probability of Net Superiority for Comparing Two Groups or Group Means
I think if we really understand the meaning of p-values, we would not use t-tests in practice. But this will require a change in our mindset, which will be extremely difficult.C'mon, Hening Huang , you write "Instead, it misleads us into considering whether treatment A is different from treatment B based on the “statistical significance” quantified by an arbitrary threshed p-value (e.g. 0.05). Therefore, the rationale behind the two-sample t-test is wrong."
You should know better.
That's not the rationale of the t-test, and it has never been. I agree that this is what many people think would be its rational, but that another topic. If you also think that this is the rationale then it does not seem worth to further discuss all that with you.
These "philosophical and methodological problems of the t-tests" are really the problems of you, not of the t-tests. Again, you share these misconceptions with many people, but that does neither make the test invalid or useless nor does it make make your arguments and solutions correct.
Hening Huang I think you do not contradict my statement. You are still conflatig the process of generating a p-value, p-values itself, the decision ctriterion declarig something as "statistically significant", and seeing something as practical relevant.
I would totally agree with you that the mindless use of the arbitrary decision, like in your example, shows the problem with the application (if you do not consider the other available information), but not the p-value or the t-test. I don't even know why you are so fond with t-tests, since p-values are used everywhere. Neither the t-test, nor the p-values become wrong, just because a decision criterion is blindly used.
Concerning my statement: “… all of the parameters and indices presented t-value, p-value, ES; RES; SU; RSU; SNR; SCI; EP; NSP tell the same story.”
Maybe my wording was a little bit unclear, I did not want to say that all parameters itself contain the same information. This was misleasing. What I meant, and what I thought I had shown in the paragraphs before, is that I can use a combination of different parameters, to come to the same conclusion. For example, if I take the t-value, the p-value, the SDs and the simple ES, I can come to the same conclusion, that both comparisons contain nearly the same information/signal to noise (p-value), one comparison is much more uncertain (SDs), the effect for one comparison is much larger than for the other (ES). You came to the same conclusion with your SNR, SCI, SU, EP and NSP parameters. Therfore, I just questioned if they really tell another story and for what we need them, if we can get basically the same information from everything we already have. I did not question the correctness of each parameter itself!!
By the way, I would also agree with you and Thom Baguley that in many applications raw, unstandardized effect sizes are more useful.
I think this discussion clearly highlights how quickly we accept statistical significant to mean statistical importance and that too without giving it much introspection. This would never have happened had Fisher given this the right name - Statistical divergence - none of this would have ensued. To give credit to Fisher he actually put it in the context of a deviation from an assumed model being statistically significant or not. But what carried forwards were just these two words and we are where we are. Majority of physicians make this error of conflating significance with importance and no amount of statistical education will redress this because as soon as they embark on what is right a dozen of their peers will bring them back to what is the mainstream view. As you have seen in a previous post, if Andrew Gelman has also slipped up then it cannot be just faulty reasoning from data.... The language is grossly misleading. I think the language of divergence has a good chance of fixing this problem for good. I don't think Hening Huang should be blamed for a pervasive language problem in statistical inference.
I thought this discussion was of how words related to p. Not on how useful or correctly interpret p, test or summary stat? The later is something about teaching, reasoning and information. Not about calling an apple a vegetable, but "understanding" what an vegetable is. That being said, I do agree with Rainer.
Wim Kaijser well said. There is nothing added to solving the problems of misunderstanding statistics by those who think they have discovered a solution. What those discoverers want is to sell their discovery. It is not possible to explain any problems with the discovery or refute the discovery.
Is it even possible to have a rational discussion about teaching statistics? I would like to see a question about how to teach statistical understanding. What or how can statistics inform us?
Joseph L Alvarez you are right, therefore my two cents to the original topic.
I think it may be helpful to use a different language and terminology to makes things clearer, but I doubt that this will solve the problem, but may help. I think in English it is similar to German, the term "significant"/"signifikant" already had and has a strong meaning in terms of content, i.e. it is often associated with "important"/"wichtig" (at least in German it is often used synonymously).
But when I think about your proposal Suhail A. Doi of "divergent", I do not feel very comfortable with. You are right, it will trigger the question "divergent from what", but it may give the impression that a p-value is a probability of divergence, i.e. the probability that the null hypothesis is not true. And it is clear that this is also not a correct interpretation. Additionally, the term "non divergent interval" sounds a) very cumbersome and is b) a double negative, maybe this should be avoided.
Therefore, why not sticking with "compatibility" or more specifically report a "compatibility interval", which has already been proposed by others in various publications. In my opinion this has two advantages, 1) it clearly describes which range of values are statistically compatible with your estimate and hence 2) you do not need to check for different hypotheses at all, because you can directly see if the null value or any other one is compatible with your estimated value. Besides that, it fosters a positive terminology.
Wim Kaijser and Joseph L Alvarez I agree that the problem is not with p values but rather with making an idol of them (as Senn has said in his paper Article Two cheers for P-values?
)We need to bring the discussion back to using language to avoid misinterpretation of both p values and confidence intervals and as we not in this discussion we have proposed changing significance to divergence so that threshold p values are termed statistically divergent / sub-divergent in lieu of significant / non-significant. In addition the confidence interval will now be termed the sub-divergent interval with an interpretation for the realized interval, not the long run 'such intervals' interpretation.
Thanks Rainer Duesing for the input. I agree with the others that we need to get back on track on this topic of language. There is a problem with using 'compatibility' as suggested by Sander Greenland because we wanted to unify the terms 'significance' and 'confidence' used by Fisher and Neyman for the test and interval respectively. Greenland's suggestion fixes the interval but cannot be used for the test and I think it would be going a step too far to advocate to drop the test at this point. If we were to use the term 'compatible' for the test what would we call this? A statistically compatible versus incompatible result? How about degree of compatibility - shall we say no, weak and strong compatibility with the model? There are several problems here. First, and most important, is that the p value does not measure 'compatibility' but rather it measures divergence as p values get smaller. A large p value implies less divergence and therefore more compatibility and therefore a statistically significant result will now be a statistically incompatible result and we are back to the same misinterpretation as significant because incompatible can mean so different that both cannot happen, be correct, or be accepted at the same time. This is not the case at all as a statistically incompatible model can still be the data generating model just as a statistically divergent model can still be the data generating model - which is more correct?.
We feel that statistically divergent result is not misleading as it cannot be reinterpreted to make misuse of it. We felt that if we were to replace a misleading term with another, it should be the best possible (not another possibly problematic term).
I agree that non-divergent is misleading since the opposite of not meeting the divergent threshold is 'less divergent' not 'non-divergent' and this was pointed out to us by Allessandro Rovetta on the other blog. We have since replaced the name non-divergent with sub-divergent in the preprint meaning a p value that is 'under' the threshold for a statistically divergent result (see new version of preprint posted). The interval is also now the sub-divergent interval meaning an interval with a range of model parameters that are statistically sub-divergent from the data - meaning have not yet reached the divergent threshold which is the threshold we have chosen to begin to think about alternatives to the assumed model as a source of the study data.
Suhail A. Doi as a non-native speaker, "sub-divergent" sounds as cumbersome as "non-divergent", but this is maybe only my problem.
But I cannot follow your argument about "compatible" and your critique how to use the term "...shall we say no, weak and strong compatibility with the model". I could apply the same reasoning with "divergent". "Shall we say no, weak or strong divergence?" So, this is not a good reasoning in my opinion. Same is true for "statistically compatible versus incompatible result" as compared to "statistically sub-divergent versus divergent result". I cant see a clear advantage of either formulation compared to the other.
Admittedly, you are right that "compatibility" may imply evidence for the null hypothesis, what a p-value cannot give, but on the other hand "compatibility" and p-values already go in the same direction. The smaller the p-value, the smaller the compatibility of the estimate with the null hypothesis. This is not true for divergence. The smaller the p-value, the larger the divergence. For me (and when I think about students) it is much easier to evaluate, when term and parameter go in the same direction.
I think, both terms have some pros and cons, both do not seem to be optimal. Therefore, it reduces to personal flavor, which arguments are weighted as more important subjectively (that's my impression). I would like to hear arguments and attitudes by others.
Rainer Duesing I see that your reasoning may seem logical but we need to be clear about language as this is what we are attempting to fix. We felt that we need the best possible term that covers current use. To summarize what you just said:
a) No, weak or strong compatibility are similar to no, weak or strong divergence.
b) Statistically compatible is the same as statistically sub-divergent
c) Compatibility and p values go in the same direction while divergence and p values go in opposite directions
d) Compatibility interval and sub-divergent interval mean the same thing
My response to these four points is that language is what created this confusion to start with and we need to be very clear about what we imply moving forwards. While divergence and compatibility address change and interaction, they do not directly oppose each other such that they can be used interchangeably as you have stated and we feel that misinterpretation is also possible with compatibility like it was with significant for the following reasons.
a) Divergence: Focuses on the process of becoming different
b) Compatibility: Focuses on the ability to work together or coexist without conflict.
c) Models and data that have diverged can still be compatible and conversely, models and data that are not diverged much can still be incompatible. Thus a statistically divergent model from the data may still have generated the study data while a statistically sub-divergent model from the data may not have been the data generating model.
Now that these terms are defined, do you still think the language implications are the same as in your points a) to d) above? If so, why?
Can you elaborate on your point c) please and can you give an example? Do you mean compatibility here more in the sense of equivalence testing?
Thanks Rainer Duesing . I have attached a figure to demonstrate point c). In the figure, the curve is the assumed or target model with mean 7kg. The sample is the blue dot and has mean 10kg. Only 0.27% (p=0.0027) of test statistics on the model are as or more extreme than the test statistic from the data sample and if 5% is our chosen threshold then this model is statistically divergent from the data. However this only means that we should begin to consider alternatives to the assumed model as the source of the data. It does not mean that the model is incompatible with the data as this model may still be the data generating model and in that sense is 'compatible' with the data. it is just that something unusual has happened (random error). Thus statistically divergent does not automatically imply incompatible and that is a judgement we make after declaring divergence of model from data. The other point is that I personally believe, that like significant, incompatible will again be misused to imply importance and we are back to square one. Divergent or divergence cannot be construed in any way or form as importance of any type.
Many senior people on the datamethods blog have asked me "Whats wrong with compatibility?" I have kept on saying there is nothing wrong from the conceptual point of view but did not want to contradict Greenland as he is on that blog and is very passionate about this term. I just do not see it solving this problem in language terms and we should call a discrepancy a discrepancy. Moving on to incompatibility means that we have decided that this model cannot generate this data and lose the language meaning that Fisher proposed. It is interesting to note what Fisher actually said in 1925 “The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.”
Dear Suhail A. Doi !
I can follow your explanation, but isn’t a bit like playing wordplays? You say that declaring something as “incompatible” is a judgement. But really, the same applies to “divergent”. Technically, every sample estimate (here your mean), which does not exactly hit the model null value (here your 7) diverges from it. So even if your sample mean would be 7.0001, it diverges from 7. Therefore, it is a judgment when we consider something as “extreme enough” or not to declare it as sub-divergent, divergent, compatible or incompatible. Hence, your argument does not favor the one or the other terminology in my opinion. The other way around, it strengthens the argument to abandon such dichotomous terminology at all, but take the p-value as a continuous measure. Every dichotomous terminology mapped onto some critical values will be a judgement. This is also a reason I do not like the Bayes Factor (although I do Bayesian parameter estimation), because typically some arbitrary, but authoritative appearing thresholds are cited to “proof” that your result is important or of any value (often even without looking at the rest of information).
Therefore, I do not follow your argument that “divergent” could not be (mis-)used to declare importance. From my example above, what would hinder me to declare a sample mean of 7.0001 as “divergent”, ceteris paribus?
But your figure maybe exemplifies something else. Your presentation is completely right in my opinion, if you do NHST, but it typically does not correspond to the approach of parameter estimation (also true for Bayesian parameter estimation).
In NHST, you declare a null value (here 7), you estimate the sample variance and together with the sample size you construct a confidence interval around the null value to find the critical value(s), here 5.04 and 8.96 at the 5% level. You compare the critical values with your sample mean and conclude that the value of 10 is “divergent enough” to call it significant or divergent.
But in parameter estimation, you do not put the confidence or credible interval around the null value (maybe someone correct me, if any software package gives you the CI around the null value and you have to decide yourself if the sample estimate is included), but around the sample value (or in Bayesian estimation the posterior distribution described with some central tendency). Therefore, you would take your 10, construct the confidence interval [8.04; 11.96] and evaluate it, i.e. you compare if 7 would be included. If you approach it that way, the term “compatible” makes much more sense. Every value, i.e. every model with mean values between 8.04 and 11.96 are “statistically compatible” with your sample estimate.
Similarly, your statement “However this only means that we should begin to consider alternatives to the assumed model as the source of the data” does not make much sense to me. We already know that in the frequentist approach, the sample mean describes the best possible model, since it would have truly no divergence and the model with null value 10 would have a p-value of 1. In your example, a model with N(10,1) would describe the data in the best way. And again, the parameter estimate approach and term compatibility already informs us about the best possible models, according to the sample data (we should not forget this point, nothing guarantees that the data is a bad representation).
Can you follow my arguments? As already mentioned, nowadays I would typically use a Bayesian parameter estimation approach, where it is natural to use the sample information to estimate the best possible population values. Here “compatibility” may come into play in another instance, e.g. if we use the Region Of Practical Equivalence (ROPE) approach, where we do not only declare a point value as null value, but a region of null values, for which we assume that these are of no practical relevance. Now, we can evaluate how compatible the posterior is with the ROPE. A similar approach can be found in the frequentist realm, known as equivalence testing or Two One Sided T-tests (TOST).
I am really looking forward to hear your thoughts about it.
Rainer Duesing , a bit off-topic and for curiosity: do you really do Bayesian parameter estimation or do you interpret likelihood- or confidence intervals as kind of credible intervals (based on some flat prior)? And if you do Bayesian estimation, do you ever use strong, informative priors and if so, where do you get them from and how do you justify them (e.g. in manuscripts)?
Dear Jochen Wilhelm ,
it really depends on the study. I typically would not use flat or very uninformative priors, since you can always rule out very implausible values. So, I would say, when I do not have many a priori information, I would estimate, how large the maximal effect could possibly or plausibly be and build my prior for the coefficients around it, centered at zero for example for regression coefficients. So I have a regularizing effect, but do not commit to any effect direction.
Just to give an example how you may proceed. At the moment, we are working on a paper, where we directly incorporate former information into the model. To break it down, it is a simple pre / post comparison of a newly implemented short term program in clinical psychology. What I basically needed (besides some technical stuff) was a prior for the pre intervention measure (the intercept) and for the difference between pre and post, i.e. the regression coefficient.
We knew some things a priori: the dependent variable was measured on a T-scale, i.e. normally distributed with mean = 50 and sd = 10. We knew that we had a clinical sample, where patients are categorized as clinically relevant, if they have scores above 60, roundabout. Additionally, we assessed the mean and sd of former patients, which were not part of the study, so estimated, how a typical patients groups looks like at that institute. We (the clinical psychologists, which were the experts in their realm, and myself) agreed on a prior for the mean of the patients at the pre measure, that was a skewed normal distribution with a very large skewness parameter, so that we thought plausible values would lie between 58 and a bit above 70 (see picture). I would say, this is still not highly informative, apparently not uninformative.
Similarly, we thought about the difference/slope. A typical long term intervention shows decreases of about 10 points. Therefore, we thought that larger values are implausible for a short term intervention. We did not expect that the short term intervention would have detrimental effects, i.e. an increase, but we didn't want to rule it out completely. Hence, we also used a skewed normal with a large negative skew value, so that larger values as 4 are implausible (see the other picture).
The paper is still in preparation and I do not have any feedback, yet, but I think that the choice of priors is comprehensible.
Thanks very much Rainer Duesing for the detailed comments – these have been very helpful in understanding the debate as it unfolds. I agree that both compatibility and divergent are judgments based on where the test statistic locates on a model. I also agree that every sample estimate which does not exactly hit the model null value parameter diverges from it. We just need to exercise judgment about what is “extreme enough” or not to declare it as sub-divergent, divergent, compatible, incompatible, significant or not significant. The difference is in the language – i.e. do we call the disagreement between model and data incompatibility, significant or divergent? My argument is that language led us towards the misuse of the measure of divergence so lets call a spade a spade – p values measure divergence so that is what we should call it. If we do not do so then significant will start to mean important, useful etc. So too will incompatibility because the p value is a measure of divergence and these are tests of divergence and not of significance, compatibility or anything else. In addition these terms have additional English language connotations that lend themselves to misuse. Regarding the argument around dichotomizing or not – that is peripheral to the language argument but in Medicine, doctors make decisions and these cannot be made without a threshold to act. Blood pressure is continuous but we have guidelines about what is high enough to act after a measurement is done but this does not solely depend on the level of the BP – other factors are key such as did the patient just have a stroke or is this a post infarction elevation etc. In the same way, continuous or exact p value reporting is good but guidelines are needed around when they should be considered small enough for an action to be taken and that was Fishers purpose. Dichotomania only exists when p values are misused and misunderstood and the definition of a threshold has little to do with it. My argument is first make the language clear then try to educate the doctors – one cannot remain in muddied waters and claim that education about reasoning from data is possible. I agree that every dichotomous terminology mapped onto some critical values will be a judgment, and there is nothing wrong with judgments so long as they are not based on a misunderstanding – a judgment that model and data are more divergent than I am comfortable with is fine so long as one understands the context. But a judgment that there is no treatment effect because there is no significance or incompatibility attached to a deviation is clearly wrong. Can we say the same of divergence – there is no divergence attached to a deviation? – I think that is impossible and therefore this is the correct first step – call a deviation a deviation and nothing more (or less). Now hopefully you will agree that “divergent” could not be (mis-)used to declare importance because it is calling a deviation a deviation.
Regarding your example, nothing would hinder you to declare a sample mean of 7.0001 as “divergent”, so long as you state the threshold you are using clearly (e.g. p < 1 is considered statistically divergent in this study) and its up to you to justify this. But lets not side-track – this discussion is not about whether p values should be dichotomized or not – this discussion is about calling a deviation a deviation.
In NHST, we do not construct a confidence interval around the null value to find the critical value(s), and that is again a very common misconception. Neither do we construct it around the model whose parameter value equals the sample mean. What we are actually doing (and what Jerzey Neyman wanted us to do) is to move population models up and down the test statistic scale till we hit the model (critical values) that meet our prespecified threshold (eg 5th percentile location for the test statistic or 2.5%tile on either side of the symmetrical distribution). That is how we got 8.04 and 11.96 – by defining whatever level fits our purpose. We do not compare the critical values (limits of the interval - these are in the Fisherian sense not in the NP decision theoretic sense) with our sample mean – we compare our assumed hypothesis to the range of the critical values to see if the the value of 10 deviates enough to satisfy the standard we set.
Therefore, I will have to disagree with you – the frequentist always would take the 10, construct the confidence interval [8.04; 11.96] and evaluate it, and indeed would compare it to 7 (the null hypothesis or target hypothesis) to see if it is outside and therefore far enough to be divergent based on the threshold we have set. Even if we approach it this way, the term “incompatible” makes less sense than saying this is a deviation that is big enough for me to act. Every value, i.e. every model with mean values between 8.04 and 11.96 are not deviated too far away from my sample estimate “statistically sub-divergent” sub meaning under the threshold that we set. Why do we need to call a deviation something other than a deviation e.g. incompatibility or non-significant?). In the frequentist approach, the sample mean is just an estimate of an unknown model and a single sample mean cannot be considered the best estimate of the true population mean - it is just one estimate of it. However, a model with a parameter equal to the sample mean is the most likely to have generated such data but seeing the data does not mean it came from the most likely model just because that model would have truly no divergence with the data – that would just be a circular argument. Agree that in my example, a model with N(10,1) would describe the data in the best way, but only God knows if that is the model that generated my data. In summary, I think that because you have a Bayesian leaning you have inadvertently created a circular argument for frequentists or I am misunderstanding what you mean and you have to enlighten me. I therefore look forward keenly to your thoughts on these comments of mine.
Addendum - NHST is an amalgamation of Fisherian significance testing with Neyman-Pearson decison p values. Here I mean divergence p values throughout (Fisherian)
But of course is a deviation a deviation. The sample test statistic deviates from some hypothetical values, for which the statistical significance of that deviation can be derived. Nothing new here. The whole issue is about i) what statistic should be used to look for the sample deviation from some hypothetical value (e.g. t or F or z or chi² or W or U... - something that captures all the information from the sample relevant to the problem in a consistent, unbiased and efficient way) and ii) to judge if this observed deviation is statistically significant, that is, if larger deviations are reasonably unexpected under the hypothesized model with the parameter fixed at that hypothetical value. The amount of observed deviation must be rated or judged in comparison to deviation one may expect under some (defined) circumstances, and this is done via p-values.
I have the impression, as already stated above, that this is more about a personal preference in language. Maybe, this is based in the fact that we are both non-native speakers and associate different things with the English words, maybe because of their direct translation into our native language and thinking is different (just a hypothesis...).
I can all follow your points, but they do not convince me, since I still think that it is possible to turn the whole argument around in favor of compatibility. You say it yourself, that the interval is constructed around the sample value. You do not even need to formulate an explicit null model to do this. Therefore, all we have are sample values and an interval around a sample estimate, for which we believe that these values are not extreme enough. Doesn't it sound odd to you, to construct an interval and give it a name, which already is a negative to describe the interval (non-divergent or sub-divergent) instead of using positive language to describe it? For me, personally, this IS odd. But again, maybe only my way of thinking and this is not and should not be perscriptive.
By the way, has Fisher ever said anything about confidence intervals? In my understanding the strict Fisherian application does not incorporate CIs, but only p-values. And then I need an explicit null value (but not an alternative value, like in NP). As I understand it, this is not in line with CIs around the sample estimate, but maybe I am wrong here, please correct me.
And we are going a bit in circles. You cite Fisher (argument from authority) that a p-value is a divergence statistics. But as I already mentioned, since more divergent cases give smaller p-values, I find it more appropriate to see it as compatibility. The larger the more compatible, the smaller the less compatible. Again, personal flavor.
But I would like to thank you for the constructive and kind debate here. That's not always the case on RG.
Thanks Rainer Duesing , I guess you are right that personal preference does play a role here as this is aimed at medical practitioners who are not too interested in the mechanics of the process and just want to be able to state that something is meaningful. Though we are non-native English speakers, there is no way out as English is the de facto language of medical research and thus all of these terms need to be in English. Perhaps you are right and clinicians will see small p as incompatibility and large p as compatibility between models and data more easily than see large p as sub-divergence and small p as (larger) divergence between model and data - time will tell. My gut feeling is that significance and compatibility are surrogates and a direct language is what will change years of debate - but that is my and the MCPHR opinion and we will only know if it gets published. Compatibility interval as a name was published in the BMJ in 2019 but there has been no uptake and at that time Gelman disagreed and said a better name was 'uncertainty interval'. Dushoff (Article I can see clearly now: Reinterpreting statistical significance
) suggested we drop all this and use statistically clear and statistically unclear and argue that 'the language of “statistical clarity” could help researchers escape various logical traps while interpreting the results of NHST, allowing for the continued use of NHST as a simple, robust method of evaluating whether a data signal is clear'.You are right, it was Neyman and not Fisher who proposed the so called confidence interval and Neyman viewed this as a long run probability for the process (such an interval will contain the true value in 95% of replications). The dilemma here is that this realized interval (8.04,11.96) cannot be interpreted as it has either 0% or 100% probability of containing the true parameter value. To bring this back to Fisherian inference, we can say that 8.04 - 11.96 is the range of target hypothesis that will not meet our set threshold for "a significant deviation". The reason Fisher did not think of this properly was that he was too busy arguing with NP about the fallacy of the decision p and if he had calmly thought of this properly he would have certainly called this the sub-significant interval meaning one that contains a range of model parameters that have not reached his threshold deviation for data from a single study (no replications needed here). This is the actual interpretation of the realized interval and doctors have no use for Neymans process interpretation of coverage for the long run probability of such intervals as that means nothing to decision makers and has falsely suggested to them that the limits of the interval are also the limits on the population parameter.
Thanks again for this discussion - lesson well learnt is that everyone sees this issue of language differently and perhaps methodologists should be free to use whatever terms they want ... its the non-methodology end user we have to simplify things for - e.g. doctors. As you would have realized I am also a medical practitioner and evidence based medicine is at the core of good medical practice - if we can convince doctors to interpret research output correctly (for which we need uncomplicated simple language that means what is said).
Suhail A. Doi – I have a lot of trouble with the oft-repeated statement that "The dilemma here is that this realized interval (8.04,11.96) cannot be interpreted as it has either 0% or 100% probability of containing the true parameter value."
If I were to estimate your probability of a fatal stroke as 55%, I am saying that over a large number of replications 55% of people with your risk factors will have a fatal stroke. However, if I go on to tell you that there is no way of applying this to your case, as the probability that you will have a fatal stroke is either 0% or 100%, you would look at me very strangely indeed.
Look here, you say to me, showing me a lottery ticket. They sell just 100 of these tickets, and the prize is won by drawing a ticket at random. What are my chances of winning? If I again retort that the ticket is either the winning ticket or it isn't, ignoring the information you have given me, then you will be right to walk off to find someone intelligent to talk to.
Why force a ridiculous interpretation on confidence intervals when we know we are just being pedantic?
Thanks Ronán Michael Conroy , I will skip the stroke example as that is a clinical prediction model and is a different ball game.
Coming to the lottery ticket, let me frame it this way: If only 100 were sold and only 1 of the 100 was bought by the buyer and on lottery day he was asked what were his chances of winning if a random ticket was drawn from the box he would probably say 1%. The buyer attaches 1% probability to the single trial about to be performed. The buyer can attach a probability to a single trial because it is real to the buyer, reliably estimated from a model of probability and of course the buyer has sufficient experience with gambling. There is no need for a hypothetical future, given the probability model, where buyers are buying tickets repeatedly.
After the draw (drawing a single ticket at random from 100), before checking the name on it, what are the chances of the buyer winning? If I retort that the ticket is either the winning ticket (100% probability) or it isn't (0% probability), that is absolutely correct and if one chooses to walk off it has little to do with intelligence.
The same applies to confidence intervals and there is nothing pedantic about it. In a 95% CI the percentage reflects a long run probability over repeated CIs. Before I use the usual process to create the CI, the probability that I will "win" by creating one that includes the population value is 95%. After I create a single CI, the probability that it contains the population value is either 0% or 100% as above. That is why Fisher should have created the CI before Neyman and called it the sSI (sub-significant interval). Instead Neyman created it and gave it a less useful name and interpretation (for users) though it may work for statisticians.
Before a buyer draws a ticket, he assigns a 1% change to the event {the ticket drawn is a winning ticket}. After the ticket has been drawn, but not yet read, you say that there is something miraculous changing the sample space and the information about the event. I don't see where this miracle comes from. As if it was forbidden, nonsensical or stupid to bet on a ticket that has been drawn, but not yet read. There are still two possible outcomes, nothing changes the sample space or the event space. It sounds invalid to assign two different probabilities to a single event. I could understand if you'd say that probability is not applicable at all after drawing the ticket, but then it makes no sense to attribute "either 0% or 100% probability" to either event (which should not even be an "event" (as a member of the event space). That all makes no sense at all.
The probability space is to be re-defined upon incorporation new information (relevant to the outcome). There is no such information provided by just buying a ticket. It's provided by reading the ticket. After reading the ticket, that sample space is degenerate and there is no probability involved.
Jochen Wilhelm The ticket has the same magical quality before the event. The magic property of closed system gambling is that it has set numbers with set rules. A single play, simple or complicated, has the same magic; win or lose.
It is unfortunate that we have these never-ending discussions about p-values. There is no solution, save teaching and understanding what we teach.
Suhail A. Doi I believe your approach is misguided. Language and trying to explain the new language will not work. We have to start now to teach it right and hope the rest will catch up. The concept of in-the-long-term holds for closed systems, but is limited in the real world. A confidence interval that contains the 'long-term' value, at the time of the test, might not contain the new long-term value at a later time. An open system evolves. A conclusion about a statistical test has little value in the scientific question for which the data were collected. The real test is if a viable conclusion is possible from the data. A statistical test is necessary for data quality, but cannot ensure the quality of the data.
Jochen Wilhelm and Joseph L Alvarez you have both confirmed that language is the problem here because you have both misinterpreted the probability in the lottery example as well as the realized confidence interval because the language is misleading. The misinterpretation was defining "1% probability of what" and you both refer to the ticket (please correct me if I am wrong). That was simply because the terms used were very confusing. Saying "there is no solution, save teaching and understanding what we teach" seems to have bounced back because we cannot teach anything if we have not understood what the language means.
Coming back to the lottery ticket, the 1% probability of winning for the buyer is the long run probability that he would win on repeated entry into different but similar lotteries - if the buyer assigns a 1% probability to this single trial it means that he is assigning a 1% probability to the lottery process drawing a number that produces a win for him. Only after the process completes (even before the results are released) does his realized ticket become important because the process has completed and the probability of him holding a winning ticket is now either yes (100%) or no (0%).
Same for a confidence interval. A 95% interval generating process is such that if repeatedly done will create intervals where 95% of them are winners (population parameter included). I can thus attribute 95% probability to the CI process creating an interval with a 'win' for me. Once the interval is created, this realized interval is either a win (100% probability of containing the true value) or a loss (0%...).
My point in creating a sub-divergent name for the same interval is that the "confidence" term gives us false confidence about applying this to the realized interval, as you both did, leading to never-ending misinterpretation, albeit with confidence.
No, you got me wrong. I said the sample space is { "the ticked wins", "the ticked loses" }, the sigma algebra over this as the event space is { {}, { "the ticked wins"}, {"the ticked loses"}, { "the ticked wins", "the ticked loses" }. The probability measure assigns real value to each member of the event space, in our example 0, 0.01, 0.99, 1 (in the order of the listed events).
This probability space defines our state of knowledge about the possible outcomes up to the point where we actually know the outcome. This knowledge isn't altered by drawing the ticket, it is altered by reading/seeing it. Note that the "process" of drawing the ticket is irrelevant, as you said that the buyer assigns a probability of 1% to the event. The fact that the ticket is drawn does not add information. You may consider the sampling (drawing) of a ticket from a well-known population (a specified pool of ticket, with known proportion of winning and losing tickets in there) under specified conditions (e.g. the order of the tickets in the pool is unknown) and use that to derive a probability assignment based on that fully disclosed process as sampling probability (e.g. assuming a Laplace space). But nothing was said. It just: there is a ticket being bought and the event {"ticket wins"} is assigned to be 1%.
It's not about a realized ticket, it's about we realizing what the ticket says. Before revealing the ticket, we are as uncertain about the outcome as before. The process is "revealing", not buying or drawing. In schoolbook examples, drawing is coupled with revealing, no distinction is made, and there the process may be called "drawing" as well (because there, drawing necessarily means revealing).
I don't understand at all what you mean with "long run probability".
Jochen Wilhelm
In another discussion on ResearchGate (https://www.researchgate.net/post/Should-Null-Hypothesis-Significance-Testing-NHST-be-eliminated), you said “NHST is a hybrid.* And I think that the logic is flawed.” If the logic of NHST is flawed, as you admit, then the problem lies with NHST and p-values, not with the many scientists who suggest abandoning NHST and p-values but are often blamed for misunderstanding NHST and p-values.
Hening Huang ,
yes, NHST is logically flawed, but p-values are not. It's not that "the problem lies with NHST and p-values" (emphasis mine) as you wrote but just with NHST. It may be that (and to my opinion is a fact that) the propagation of NHST has increased confusion or created misconceptions about p-values, but that's a different story. We today do have all that confusion and misconception, and the question is how we can get out of that trap. Stop teaching NHST will be one step among many others.
I wonder where this mix up of p-values itself, the process to derive p-values (e.g. t-tests), and the method how use them (Fisher vs. NP [where we do not need the p-value itself] vs. NHST vs. whatever) comes from?? Really, I mean there are several discussions on RG here alone and this has also been stated here in this specific discussion. The latter has nothing to do with former two. I can use a valid process and derive a valid parameter, but use it wrongly/badly. This does not make the process or the parameter wrong/bad itself.
Thanks Rainer Duesing I agree with you on this - I also think its a valid process which has been used badly and I do not blame researchers - I blame Fisher and Neyman for calling this "significance" and "confidence" respectively. By so doing they have created this artificial misunderstanding although in their minds they were very clear. This is what Fisher said about confidence intervals in 1949 (bold emphasis is mine) in response to the interpretation Neyman gave to the confidence interval because he felt like we do that Neymans interpretation was of a long-run probability (i.e. probability derived through replication) and that was unhelpful to "scientific workers" needing to make decisions:
"An alternative view of the matter is to consider that variation of the unknown parameter, μ, generates a continuum of hypotheses each of which might be regarded as a null hypothesis, which the experiment is capable of testing. In this case the data of the experiment, and the test of significance based upon them, have divided this continuum into two portions. One, a region in which μ lies between the limits 0.03 and 41.83, is accepted by the test of significance, in the sense that the values of μ within this region are not contradicted by the data [this is our sub-divergent interval], at the level of significance chosen. The remainder of the continuum, including all values of μ outside these limits, is rejected by the test of significance."
Also Senn says Article Two cheers for P-values?
that "What is unreasonable is to regard the sort of ‘contradiction’ implicit in the Lindley paradox as a reason in itself for regarding P-values as illogical for, to adopt and adapt the rhetoric of Jeffreys (Ref. 19, p. 385), ‘it would require that a procedure is dismissed because, when combined with information which it doesn’t require and which may not exist, it disagrees with a procedure that disagrees with itself.’ Senn also quotes Fisher who says ". . tests of significance are based on hypothetical probabilities calculated from their null hypothesis. They do not generally lead to any probability statement about the world but to a rational and well-defined measure of reluctance to the acceptance of the hypothesis they test’."
In summary, there is nothing wrong with p values and what is wrong is making an idol of them (quote from Senn) but I would extend this to say that what is wrong is calling divergence 'significance' and sub-divergent as 'confidence'
Rainer Duesing I can only speak of myself, but somehow this amounts again to the reflection of a 'toolbox'. If you have a problem you search for a solution by posing a question. If the only thing that you are thought is that hammers solve solutions, you will use a hammer. If everybody believe smashed things are the solution, we will smash things. If we do not use the smashed things then it will always be a solution. But if we use it we see the issue.
I mean P(Model|Data, Info), P(Data>0|Model0), P(Data|Model), P(Data|Model0)/P(Data/Model1) are all different answers. But clearly (sarcasmn) P(Data>0|Model0) should always be used for every issue. Note that if I have a some invitro experiment on cancer cells enzyme reduction (whatever) and want to know wheter my data has some sufficient (arbitrary) information on a signal-to-noise ratio independent of what I believe the signal woul dbe (we don't really know for an new product). An perhaps over a short-run of studies shows only low p-values, why would I want to answer P(Model|Data, Info)? Then I can Bias my result to P(Model>0|Data, Info) because I know many believe crediblity intervals do not overlap 0 there is an 'effect'. I would personally go full 'Frequentist' mode. Select my model and 'meaningful' sample statistic derive sample size, controlled and randomized conditions and done (shortly).
Mostly this oppurtunity does not exists because because the majority believe this is all accounted for by the model (residual normality and independence test) and iterate through a community then nothing happens. While there is an appeal to Bayesian thought and models somehow we do not really want to conclude on a posterior. We just want to state to 'the best' of our knowledge these are most reasonable values for the parameter. If we are 'wrong' bad luck.
Yet, this is a completely different question than searching for objective signal-to-noise ratio. Articles such as "Bayesian Estimation Supersedes the t Test" are (in my opinion) completely ridiculous because they are not even competing in the first place, we just have to get our questions aligned with the methods we use. The problem is indeed that partially of language but also how this language refers to abstaction being used. That being said I see not need for wordplay because everything of 'statistical language' has already been defined before (https://www.statlect.com/fundamentals-of-statistics/point-estimation). That hypothesis test is a misleading word because the Frequentist does not really adress a hypothesis is unfortionate. Same as the word 'significant', but when adressing these issues issue the underlying theory should be highlighted. But I do not think many people are interested in it. The Bayesian can but needs to make a sacrifice. Titles such as: "Bayesian Estimation Supersedes the t Test" are simply aggrevating the issue, because it portrays a standard t-test as inferior. While I like 'Kruschkes' book it kapitalizes on a false sense of superiority rather than adressing how to align questions and answers. The latter is not an easy task considering people like words such as 'likelihood', 'probability', 'cause', 'effect', 'importance', 'evidence', 'truth', 'facts', 'knowledge'. I how no clue what either of these words realy 'mean'. I have no clue about how my estimate model 'parameter' might relate to the world outside of us (there is no physical estimate).
Titles such as "Bayesian Estimation Supersedes the t Test" kapitalizes on nihlismn (I just use it because it sells and I do not really care) and dogmatismn (Bayesianism is the best). The opposing question should be in what sense does Bayesian estimation suppersedes the clasical t-Test. If I never want to make a false positive then set only point null priors (vague prior centered around 0). But we do not know there is an 'effect' or 'no-effect' in the first place. So how can any method supersed a t-Test if we do not know the 'true' answer in the first place? Either, indeed, you state the prior and everything cohering to the prior is 'true' or you state none and remain objective and fall to an infinite regress till its limits. I see no way how advocating another word (or argue against a t-test) provides more or information or a clearer explanation for people that might them be able to choose the tools they need or scrutinize positive kapitalisation. The later being so sever that I cannot write an article without the word 'effect' or refrain from a conclusion. If only have associations or correlations, because they only used the word 'effect'. I think these issues arrive somehow from the issue of how abstracts refer to the outside world. But in this regard you would immediately fall back to realismn and anti-realimsn and how such viewpoints reflect theories of justificaiton and the believes we hold (https://en.wikipedia.org/wiki/Justification_(epistemology)). But (again) these are not question of statistics directly.
Suhail A. Doi I didn't want to blame anyone and I am pretty sure that I believe wrong things, because I did not understand them correctly or I did not inform myself enough. I think this is true for all of us. And you do not even need to know the correct definitions of p-value, t-test, NHST and so on, BUT at least we should acknowledge that these are not the same things, shouldn't we? Maybe I am wrong in my definition of p-value, but I am quite sure that this is not exactly the same as the t-test or NHST in general. That's all I meant.
Wim Kaijser I am not sure, why you are addressing your answer to me. I am not a die hard Bayesian. I think frequentist approaches are valid and are justified and we should know and use both approaches. On the other hand, in my experience, Bayesian methods seem to have more flexibility to handle certain problems. Admittedly, Kruschke seems to be a Bayesian with a very strong attitude against frequentism. Therefore, I would take the title of his paper with a grain of salt. But nonetheless, I think his book "Doing Bayesian Data Analysis" is very useful and helpful, to dive into this realm (along with McElreath's and Gelman's publications, which all have slightly different views and standpoints).
To pick up the Kruschke article once more, I understood the title in a way that his "Bayesian t-test version" supersedes the "standard t-test", not BECAUSE it is Bayesian, but because the Bayesian approach give you more flexibility to account for the specifics of your data. Just to give some examples:
1) By default, Kruschke's approach uses a student t distribution as Likelihood, instead of a Gaussian. Since the nu parameter (degrees of freedom) of the student t is not fixed, but also modeled itself, this gives flexibility to account for outliers. Small df indicate outliers, but if they are accommodated for by the heavy tails of the student t, these outliers do not influence the estimates for mu and sigma so much, as if they would in a Gaussian, where outliers would increase sigma necessarily and maybe bias mu, when they appear more on one side. If the df is large, it means no outliers and the student t approximately transitions into a Gaussian --> more flexibility, behaves like a normal t-test if everything is fine, but adapts if outliers are present.
2) By default, Sigma for both groups are estimated separately, instead of assuming homogeneity. So, it already does what is suggested for the frequentist realm (Welch test), it accounts for heterogeneity, if needed, otherwise it returns the same results as a standard approach.
3) The typical of Bayesian: the prior. You have the flexibility to either incorporate informed priors or not, i.e. you use a flat prior. Again, if you use flat priors, you mimic the default t-test or frequentistic approach in general and get the same results. But you have the flexibility to do otherwise. In my experience, you can rule out very unplausible values very easily. If not, just keep flat or very uninformed priors. This may be a computational problem, but isn't a general one.
So, my conclusion was that it supersedes the t-test, not because it is Bayesian, but is more flexible, albeit giving you the same results if needed. Any other non-Bayesian approach that is able do this, could "supersede" the classical t-test (already the Welch test does it in some parts, imho...). The advantage is that all this flexibility is concentrated in one single test. Otherwise you need to know and use Student t-test, Welch test, Yuen trimmed mean t-test, bootstrapping....
Flexibility is often used as an argument against Bayesian approach, because it gives more freedom to cheat. I acknowledge the problem, but I think that 1) a disingenuous researcher will exploit anything and could also simply invent data (there are numerous examples and frequentism did not help either...), 2) it should be good practice to openly show your priors (as I did some posts before) and justify your decisions, and 3) we need flexibility to find the model which fits the data (not he results you would like to have of course!!). In the frequentist realm, we already test different models to find the best one (e.g. Poisson vs. negative binomial vs. zero inflated versions etc.).
I hope my point became a little bit more clear. I would not abandon frequentist approaches per se, they have their justification.
Thanks Rainer Duesing I am glad you are not dismissing the frequentist theory of inference in favor of the Bayesian theory of inference and in fact this is what Senn had to say about the latter:
"I am in perfect agreement with Gelman’s strictures against using the data to construct the prior distribution. There is only one word for this and it is 'cheating'. But I would go further and draw attention to another point that George Barnard stressed and that is that posterior distributions are of little use to anybody else when reporting analyses. Thus if you believe (as I do) that far more use of Bayesian decision analysis should be made in clinical research it does not follow that Bayesian analyses should be used in reporting the results. In fact the gloomy conclusion to which I am drawn on reading (de Finetti 1974) is that ultimately the Bayesian theory is destructive of any form of public statistics. To the degree that we agreed on the model (and the nuisance parameters) we could communicate likelihoods but in practice all that may be left is the communication of data. I am thus tempted to adapt Niels Bohr's famous quip about quantum theory to say that, anyone who is not shocked by the Bayesian theory of inference has not understood it’."
Just as a side note to Senn's quotation regarding clinical statistics: a full Bayesian treatment would have to make a lot of thinks more transparent. The frequentist results are contingent on a lot of things that are (usually) not reported. This starts with the selection of inclusion/exclusion critera, the selection of covariables considered relevant, the selection of a study time-frame, the transformation (or omitted transformation) or variables, the selection of functional relationships between variables (linear, quadratic, ...), the inclusion/exclusion of interactions between variables, and much much more. This is often described as a "garden of forking paths" and "researchers degrees of freedom". At the end, the results depend on a very subjective selection and treatment of the data, and a Bayesian view would simply highlight that this is a inherently subjective interpretation anyway.
Rainer Duesing No 'negative' reason perse (sounded like you infered this to be the case). I refrence you because of your previous comment "I wonder where this mix up of p-values itself", sparkeled some 'intrusive' thoughts about where my own (mixup) came and still come from.
Jochen Wilhelm
Thank you for confirming that “NHST is logically flawed”. However, I am confused, how can you separate p-values from NHST? A p-value is a result from NHST. If we admit that NHST is logically flawed, we should not use it, and our universities should “Stop teaching NHST”. Then, p-values are gone, and the debate is over.
The p-value is a tail probability of a statistical model under a fixed null. Nothing more, nothing less. As such, it's just a kind of normalized measure of the information relative to the fixed null provided by sample data. This was promoted by Fisher to be used in a "test of significance", to find a judgement regarding the discernability between an unknown parameter value of which we have a sample estimate and a hypothetical value of that parameter. Neyman was criticizing that this procedure lacks the logical foundation to make a decision, as only the rejection of the null was useful to make any claim, whereas failing to reject the null meant that the data were just not sufficiently conclusive. Neyman said that proper decision-theoretic approach requires a specified alternative hypothetical value, and a test must provide a decision between these two distinct alternatives, A and B. That requires a specified loss function on the possibly wrong decisions, and based on that (and some more information) a sample size can be determined to minimize the expected loss from the decision. The problem this Neyman's approach, however, was that it is next to impossible to find sensible alternatives and to define sensible loss-functions in almost any area of research (it may be different for industrial production-processes like quality control etc, where production costs, stopping costs, and profits are all known and can be used to find a loss function). Neyman's approach did not require calculation a p-value. It was sufficient to calculate a rejection region for the hypothesis A and to check if the test statistic falls into that region (if not: act as if A was true, otherwise: act as if B was true; the expected loss on these actions will be minimized by design).
The NHST was an attempt to combine these two approaches. You can read more details in the Wikipedia article ("The modern version of hypothesis testing is a hybrid of the two approaches that resulted from confusion by writers of statistical textbooks"...). It proposes a decision, (a pseudo-decision between "reject the null" and "accept the null") but uses only a p-value of the (Fisherian) null hypothesis. The "alternatives" are defined as "the null is true" vs. "the null is not true" (one of the stupid definitions that lead to so much confusion). It does not require the specification of a loss function and does not require a sample size to ensure the minimization of the expected loss.
There are many flavors of that approach, some of which are a bit more aware of the problems (e.g. they would not claim to "accept the null", but to "fail to reject the null"). These seem to be attempts of going back from NHST to Fisherian tests of significance.
By the 1940's NHST was created as an amalgamation of the Fisherian significance test and NP decision framework. While there are claims that no p value is involved in NP processes, the alpha threshold is a p value threshold and therefore claiming that there is no decision p values seems false. Perhaps an actual p value is not important but the decision is clearly p value driven.
No one can claim that the scientific thinking behind NHST is flawed, rather its interpretation is flawed and what needs to be upgraded is how one reasons from the data. That will never happen till we drop both the terms significance and confidence because without doing so there is no incentive to update our thought processes and we keep getting drawn to the wrong thing. I always give the example of dehydration and hypovolemia - they mean two completely different things in Medicine but everyone with hypovolemia is termed dehydrated because the term dehydration implies (dictionary) "loss or deficiency of water in body tissues" and should that not be the same as hhypovolemia (hypo=less; volemia=volume of liquid in the body). 30 years after this was pointed out that both are different, textbooks and specialists still call everyone 'dehydrated" - why? No one bothered to change the terms to make them mean what was really at stake - tissue vs vascular volume.
The thinking behind NHST is not scientific. NHST is mathematical and applies to data. It does not address a scientific question. A p-value does not address a scientific question. It is data all the way down.
We cannot fix the literature of the past by a bandaid approach. We have to examine and identify the disease.What went wrong?
Somebody decided that NHST was part of the scientific method. This decision about deciding that the numbers can decide independently of the context is the flaw. The trickle down from NHST to significant p-values as a decision point is hardly surprising. As Jochen Wilhelm said,"The p-value is a tail probability of a statistical model under a fixed null." It cannot decide no matter how statistically significant it might be. The p-value indicates something to the investigator concerning the data collected.
The p-value is not the problem. The term 'significant' is not the problem. Statistical textbooks and their mindless examples are the problem. The problem will persists as long as the p-value is considered a prize.
The question we really need to ask is why the p value is considered a prize by researchers. If we do not know we dont know, nothing will change. Lets see what the reviewers think ..... They might agree with you and then we maintain the status quo
Because the p-value is relevant as the first line of defense against over-optimistic interpretation of an estimate. A sufficiently low p value allows the researcher to state that the data are minimally conclusive (wrt. the null*). That's important, as no one likes to publish or read a paper saying "we collected some data that turned out to be inconclusive with regard to anything possibly interesting".
The relevant aspects of the research are meaningful and helpful models that are embedded in a current environment of theories and known facts and that provide some form of improvement (typically some wider applicability, explaining a wider range of observations better or more precisely). It remains important to show that the observed data are not contradicting the proposed model, and excludes alternative explanations. For this it is required at some point to judge the information content of the observed data regarding the model (or some aspect of the model). For this, checking the p-value is a reasonably good tool we have. Admittedly, this tool for use on the simple case of a two-sample problem is shooting at sparrows with cannons (the "inter-ocular trauma test" (1) would be fully sufficient), but it becomes valuable for more complicated models (multivariable, non-linear, including interactions, etc).
A low p-value alone is not sufficient to qualify good research. The sensibility of the tested hypotheses and the quality of the experiment and data are way more relevant criteria. I'd like to know where in the line this got missed, and why there is so much worthless research published just because meaningless p-values are
Indeed, as Jochen Wilhelm said, “The p-value is a tail probability of a statistical model under a fixed null. Nothing more, nothing less.” However, for a specific problem, what exactly does this tail probability mean? Chén et al. (2023) mentioned common misconceptions about the p value, including that “the p value measures the probability that the research hypothesis is true” and that “the p value measures the probability that observed data are due to chance”. In my opinion, the definition of the p-value, based on the NHST, is the root cause of these misconceptions. In other words, there is no way to correctly interpret the p-value under the NHST paradigm (or using NHST languages) because “NHST is logically flawed”. Therefore, unless we abandon NHST, the endless debate about p-values will be fruitless.
Furthermore, the p-value at best is an indirect measure of uncertainty in the estimated effect size, and as Alessandro Rovetta puts it: “The P-value is (finally) an inherently uncertain measure” (Preprint The P-value is (finally) an inherently uncertain measure
). But why bother using this indirect measure that causes so much confusion and misinterpretations when we can directly measure the uncertainty using “standard error” or “standard uncertainty” as we do in measurement science? Hirschauer (2022) stated,“What we can extract – at best – from a random sample is an unbiased point estimate (signal) of an unknown population effect size and an unbiased estimation of the uncertainty (noise), caused by random error, of that point estimation, i.e., the standard error, which is but another label for the standard deviation of the sampling distribution. We can, of course, go through the mathematical manipulations to transform these two pieces of information into a signal-to-noise ratio (z- or t-ratio), a p-value or even a dichotomous significance statement. But why should we do that and accept the loss of information and risk of inferential misinterpretations?”
Hirschauer N 2022 Some thoughts about statistical inference in the 21st century SocArXiv. December 20 doi:10.31235/osf.io/exdfg
What many people seem to miss is that a p-value is a sample statistic. It naturally varies from sample to sample, just like any other sample statistic. My impression is that many think that a p-value is some sort or miraculous truth that represents an actual, true uncertainty. The problem is more difficult than for the sample mean (also a sample statistic that will vary from sample to sample) that has a particular distribution one has in mind (an approximate normal distribution), whereas the p-value is from a distribution which shape depends on (the unknown) discrepancy between the hypothesized model and the the unknown true state of nature and therefore has to be interpreted relative to a reference distribution (the null distribution, the standard uniform distribution). Of course I think that all this weird kind of thinking is rooted in teaching NHST.
Exactly. The NHST is about making decisions. The data cannot decide. A p-value cannot decide.
The investigator decides and must defend the decision. The p-value is not sufficient as a defense. A small p-value indicates that the data are unusual if the null were true. Unusual indicates that it might be useful to pursue the line of inquiry.
The standard error of the effect size has no scientific meaning without the context of the measurement. If an effect has a large standard deviation the certainty of the effect and ability to apply it are not improved by a small standard error based on a large number of measurements.
I think the question you are all trying to ask but not articulating well is this: If a researcher calculates a p value based on a target hypothesis, what information value does this have for the researcher?
Before I answer this question, it should be pointed out that there are many misconceptions floating around so that is not a reason to blame p values just as a bad driver can not blame the car for a car crash. Both the p value and the car are a tool and they will only help if used wisely.
My answer to the question is this: A p value tells the researcher if the trade off between random error and effect size is of sufficient magnitude to believe that the parameter of the assumed model is unlikely to have generated the data. Nothing more or less.
What does this mean? If the p value is say 0.01 and my assumed model has a parameter value of zero, then I conclude that this model was not the data generating model. This does not mean that the data generating model is clinically different from zero e.g. it could be 0.5. For that we need an interval of models that could have generated the study data and a range of clinical indifference. One can argue that this interval can be used in lieu of the p value to make the same assessment. Perhaps that is true but keep in mind that the limits of the interval stop at p=0.05 or whatever threshold you set while the actual p is more informative e.g. 0.0001 gives more information than just say 0.04.
I don't conform with your answer. That's not in the p-value. Only the process of considering low p-values as reason to consider the estimate indicating a population value being different from the hypothesized value provides a way to safeguard against over-optimistic interpretations.
Under the almost true null, which is the theoretically worst scenario, you get (almost) no control, all your conclusions are "random", with an expected 50% probability of the estimate being at the correct side of the null, and 50% probability that it is on the wrong side. Not better than tossing a coin. The data provides (almost) no information, so actually no data would be required at all (tossing a coin would be enough). Under this scenario, only few p-values will be low. So among all experiment carried out under that scenario, you will only rarely make a conclusion (which, then, would have a 50% probability to go in the right direction). The fact that most results will remain inconclusive does not change the weight or interpretation of the (few) low p-values.
Under a considerably false null, almost every estimate will be at the correct side of that null, and almost every p-value will be low.
Doing "stupid" research (testing only nulls that are almost true) will produce a series of conclusions of which, expectedly, half are in the wrong direction. By completely skipping the significance tests, this would be the very same outcome, only for many more experiments. The benefit lies in the fact a "smart" researcher (testing only nulls that are considerably wrong) will get more low p-values, and due to constraints on time, money and publication space, papers with correct conclusions will accumulate faster than papers with wrong conclusions. The benefit is at the side of the scientific community, not at the side of the researcher. For a researcher, a low p-values is as good or as bad, as informative or as uninformative as a high p-value. There is nothing to be interpreted w.r.t this particular instance of that p-value. The act of refraining from making any conclusion about an estimate in all cases where the corresponding p-value is not low is a broom that is travelling the road of the scientific community.
The p value is just a percentile score. if p = 0.05 then 5% of scores (test statistics) under the model are equal or more extreme than the observed test statistic. If p =0.01 then 1% of scores under the model are equal to or more extreme under the model than the study test statistic and so on. You are free to choose your percentile cutoff for divergence. Once you are decided on your threshold for making a decision (how high or how low), the concepts you have outlined do not feature here.
Also there is nothing magical about the p value. Its like a board exam, we decide if a student scores low enough to fail by deciding on a percentile threshold. We could endlessly argue that we will make a mistake regardless of threshold we choose and therefore we should not make decisions about admission solely from student scores. I agree with that - the p value is just one tool and should not be used in isolation to make decisions about study results.
Based on this discussion a new and concise version of the preprint has been uploaded that removes ambiguity in interpretation as has been discussed above by many commentators.
A more fundamental point is context - Fisher worked in agriculture and was solely interested in improving yields - a group effect. He had no interest in which plant contributed to the effect. Clinical care is always concerned with the individual, his approaches are unsuitable for us. However, public health questions, being concerned with the community at large, can be evaluated this way.
Sumpter, in his recently released book 4-Ways of Thinking argues that the frequentist thinking is the most unsophisticated way of thinking with the least likely way of providing new understandings. Why is medical research so stuck on the least effective ways of gaining new understandings?
I agree that Fisher was not that interested in individual plants as we are in terms of individual patients but the methods we use are population based and like in agriculture, clinical trials also report group effects (the average treatment effect). As such, this study average effect size will differ from the effect in individual study participants, which may be smaller or larger than the average study effect. This variation in treatment effect observed among participants within a single RCT is described as heterogeneity of treatment effects (HTE). Medical practitioners require that such heterogeneity be addressed so that a study reports the best empiric individual treatment effect allowing medical treatments to target the right patient. Addressing HTE is therefore central to the practice of personalized (or precision) medicine. This has little to do with either the frequentist method used or p values but rather other epidemiological aspects such as how we select relative effect sizes (collapsible or not) and how we deal with prognostic and interacting covariates. The latter are not limited by the way in which we think about the analyses and would apply across the frequentist-Bayesian divide.
Coming from a systems and complexity sciences perspective, the question arises: which are the critical variables that contribute to each of the heterogeneous patterns?
Answering this question goes well beyond - in Sumpter's word - simple ways of thinking.
That said, I agree mostly with your explorations in the paper. I hope it helps to move the field of patient outcome focused research into a direction that can provide better understanding about individual decision-making.
I share your sentiments. I also hope that language helps as all other modalities have been tried and neither banning nor calls for retiring p values have worked and obviously blaming end users (physicians) by saying that they collectively have cognitive failures have failed to change anything in the last decade. Misleading language has not been addressed and use of precise terms that cannot be manipulated may be the straw that breaks the camel's back!
Indeed, “… blaming end users (physicians) by saying that they collectively have cognitive failures have failed to change anything in the last decade.” In fact, why should the end users of statistical methods (tools) be blamed if the problem lies with the tools?
Suhail A. Doi Sir, you may follow
https://medium.com/@shikhartyagi_93772/understanding-frailty-and-error-terms-in-regression-analysis-a79800a41159
The p value is a tool and like every tool it can be used wisely or misused. Many posts in this thread have blamed the user so clearly many believe that users do not know how to use the tool. The tool has endured for 99 years and calls to retire it have been based solely on its misuse. When the tool has been misused, the blame for the misleading conclusions based on misuse cannot be attributed back to the tool, rather we need to find out why it has been misused and propose solutions.
We should not mix up the contributions of each party here, before we start blaming anyone. We have the inventors of tools, the tools itself, the manufacturers/distributors of the tools (which write the manuals, so to speak), and the end users.
I would say it is hard to blame anyone directly, but the heavy lifting needs to be done by the manufacturers/distributors, who "sell" the tools to an audience. They are the link between the inventors and the users. It does not matter how good a tool is (and that's where for example Hening Huang , Jochen Wilhelm and myself disagree in the first place), if the end user is unable to understand how the tool needs to be used correctly, because the manuals are too vague, imprecise, or simply wrong.
One may argue that the user needs to be better educated, but this is hardly manageable and cannot be accomplished, if the "manuals" are sub-optimal. Most statistics books I know, which are addressed to end-users work like a cookbook. If you have A, do X. If you have B, do Y. They are only mechanical applications of tools, but it is not explained WHY we are doing this.
Even "outsourcing" the statistical part completely to experts and trained statisticians will not help in general, if the end-user is not trained at all. Then (s)he will not understand at all what the results mean and just regurgitate what has been said by others.
Suhail A. Doi : Surely, yours is a well-presented suggestion, Dr. Doi. As a statistician, having practised in non-academic environments who had been involved in statistical analyses over a length of time, I would like to add a comment. I wonder whether the change in terminology would make a huge difference if the issues mainly arise from our "attitudes". In many situations, I noticed, using significance tests where there is lack of proper experimental conditions, i.e. randomisation and other design issues, posed a larger problem than the p-value. As Prof. Fisher writes (1955) "the conclusions drawn by a scientific worker from a test of significance are provisional, and involve an intelligent attempt to understand the experimental situation". In some fertiliser experimentation, I remember arguing to conclude even 0.10 significance level as useful. If we clearly set out, before any significance testing, our strongly-argued criteria to draw conclusions from the outcomes, wouldn't, from a practitioners' point of view, much of the discussion around the p-value be resolved? Isn't the problem about reporting bias mainly about the academic attitudes on "acceptable" or "publication-worthy" research, too?
My argument is that, if a tool is bad or flawed, no matter how good its manual is, the tool should not be used. At least we agree that “NHST” is a flawed tool. Then we should abandon it and not waste time educating the end-user.
Hening Huang I do not see how a tail probability (p value) can be flawed - it can only be misused. NHST is just a term for the use of tail probabilities to make an inference or make a decision. Both of the latter can be flawed but the 'tool' underpinning both (inference and decision) is again the p value. The NP decision theoretic approach is sometimes touted to not use p values which is false as it does use an alpha level which is simply a threshold for a p value. So in summary the tail probability under a Gaussian model is what it is and there is nothing flawed about it. (As Senn has said - two cheers for p values). What one does with it could be flawed and this discussion proposes one way to avoid flawed interpretation and reporting of p values.
Suhail A. Doi
Yes, the p-value is a tail probability; it is a neutral number that cannot be good or bad or flawed. However, the p-value is a result of (or calculated based on) the logically flawed NHST. So, the problem is not with “misusing” of NHST or p-values. Trafimow (2023) stated, “NHST is problematic anyway even without misuse.” And “There is practically no way to use them [p-values] properly in a way that furthers scientific practice.”
Trafimow D 2023 The story of my journey away from significance testing A World Scientific Encyclopedia of Business Storytelling, pp. 95-127 DOI: 10.1142/9789811280948_0006
It is interesting that you mention David Trafimow as he banned p values and confidence intervals in his journal Basic and Applied Social Psychology in 2015 but evaluation of research post-ban have shown that authors simply have ended up overstating results in the journal and evidence reporting in the journal has not improved.
As Lakens has said,
Article The Practical Alternative to the p Value Is the Correctly Used p Value
"if we want to know which statistical approach will improve research practices, we need to know which questions researchers want to answer. Polarized discussions about which statistic we should use might have distracted scientists from asking ourselves what it is we actually want to know"
In my view what we actually want to know is if a newer treatment works to alleviate human suffering more when compared to an older treatment. I could report an odds ratio of say 5, indicating a five fold increased odds of a hypothetical "good" outcome. Say we have a p =0.7 associated with this odds ratio of 5 when assessed on a model whose parameter value is odds ratio of 1. The question for readers of this discussion is does knowledge of this p value give you any additional information about this effect size and the confidence you may or may not have in the efficacy reported for the new treatment? If your answer is no, please tell us why.
Aside from the ubiquitous misinterpretation of "significant," and aside from the far more important question about bias in study design, and aside from the massive impact of power on the calculated p-value, and aside from the absurd notion of a yes/no cutoff (p=0.051 is not different than p=0.049) ... no, the p-value doesn't add anything. Because I don't care if the newer tx is not likely to be exactly the same as the control (and no two things are in fact exactly the same; use a large enough N and every comparison will -> a tiny p). I only care if the new tx lessons human suffering in a meaningful way (which is of course what effect size can begin to tell me).
Indeed a low p value in the context of a null hypothesis of no difference does tell us that "the newer tx is not likely to be exactly the same as the control". Therefore unless one has a p value below a threshold one has chosen as indicating divergence, there is no point in doing further analyses to determine the clinical significance of this result. It just means we need more data or a meta-analysis.
To consider an odds ratio of 5 to indicate lesser human suffering has as its first step to be to remove uncertainty due to random error (p below whatever threshold you accept). However once we reach a threshold that satisfies us, then we need to proceed with other analyses to determine the clinical significance of the result. As you say, perhaps 5 is in our range of clinical indifference and means nothing despite a low enough p value because of a large sample size.
Not reporting the p value at all could be an option if we go directly to confidence intervals from which the p value is nevertheless inferred so we are back to it. Avoiding both the p value and the confidence interval as has been enforced in David Trafimow's journal has done nothing useful except to open the playing field for authors that wish to overstate their results. As Lakens aptly said about papers in David Trafimow's journal post-ban in the 20% statistician:
In their latest editorial, Trafimow and Marks hit down some arguments you could, after a decent bottle of liquor, interpret as straw men against their ban of p-values. They don’t, and have never, discussed the only thing p-values are meant to do: control error rates. Instead, they seem happy to publish articles where some (again, there are some very decent articles in BASP) authors get all the leeway they need to adamantly claim effects are observed, even though these effects look a lot like noise.
The absence of p-values has not prevented dichotomous conclusions, nor claims that data support theories (which is only possible using Bayesian statistics), nor anything else p-values were blamed for in science. After reading a year’s worth of BASP articles, you’d almost start to suspect p-values are not the real problem. Instead, it looks like researchers find making statistical inferences pretty difficult, and forcing them to ignore p-values didn’t magically make things better.
As far as I can see, all that banning p-values has done, is increase the Type 1 error rate in BASP articles. Restoring a correct use of p-values would substantially improve how well conclusions authors draw actually follow from the data they have collected. The only expense, I predict, is a much lower number of citations to articles written by Trafimow about how useless p-values are.
https://daniellakens.blogspot.com/2016/02/so-you-banned-p-values-hows-that.html
This preprint has now been reviewed by a prominent internal medicine journal and we had two reviewers - one positive and one negative. What is of interest to report here is that one of the reviewers felt that we were advocating dichotomy or trichotomy of p values as a way to help alleviate misconceptions regarding p values and “statistical significance” and the way this was stated finally made it clear to us what exactly was the problem with the message we were delivering.
Yes, different intervals of p values have been named, but these are nothing more than qualitative labels for an interval. For example, a p < alpha has a recommended qualitative label “statistically divergent” - this is just a name for a p value in this range and nothing to do with us advocating that we should categorize p values or bin p values. We are just saying that since the qualitative labels are the norm, please call it a correct label that cannot be incorrectly used. We are not advocating categorization - we are advocating renaming the labels.
As an example in clinical medicine, serum hemoglobin below normal has qualitative labels such as Mild anemia: Hemoglobin 10.0 g/dL to lower limit of normal. Moderate anemia: Hemoglobin 8.0 to 10.0 g/dL. Severe anemia: Hemoglobin 6.5 to 7.9 g/dL. Do we then say that clinicians are guilty of dichotomizing or trichotomizing hemoglobin? These are just convenient terms that immediately allow inferences to be clear - shall I call the blood bank and say I need blood because the patient has a hemoglobin of 6 or because the patient has severe anemia? The 6 is already on the request - I do not need to say that as the label is more efficient and conveys my meaning. Shall we stop labeling hemoglobin levels? That is a different question and not in scope of our paper.
Anyway, the reviews have led us to a major understanding of what was unclear and we will be updating the preprint soon to clarify all this