Due to growing concerns about the replication crisis in the scientific community in recent years, many scientists and statisticians have proposed abandoning the concept of statistical significance and null hypothesis significance testing procedure (NHSTP). For example, the international journal Basic and Applied Social Psychology (BASP) has officially banned the NHSTP (p-values, t-values, and F-values) and confidence intervals since 2015 [1]. Cumming [2] proposed ‘New Statistics’ that mainly includes (1) abandoning the NHSTP, and (2) using the estimation of effect size (ES).
The t-test, especially the two-sample t-test, is the most commonly used NHSTP. Therefore, abandoning the NHSTP means abandoning the two-sample t-test. In my opinion, the two-sample t-test can be misleading; it may not provide a valid solution to practical problems. To understand this, consider a well-posted example that is originally given in a textbook of Roberts [3]. Two manufacturers, denoted by A and B, are suppliers for a component. We are concerned with the lifetime of the component and want to choose the manufacturer that affords the longer lifetime. Manufacturer A supplies 9 units for lifetime test. Manufacturer B supplies 4 units. The test data give the sample means 42 and 50 hours, and the sample standard deviations 7.48 and 6.87 hours, for the units of manufacturer A and B respectively. Roberts [3] discussed this example with a two-tailed t-test and concluded that, at the 90% level, the samples afford no significant evidence in favor of either manufacturer over the other. Jaynes [4] discussed this example with a Bayesian analysis. He argued that our common sense tell us immediately, without any calculation, the test data constitutes fairly substantial (although not overwhelming) evidence in favor of manufacturer B.
For this example, in order to choose between the two manufacturers, what we really care about is (1) how likely the lifetime of manufacturer B’s components (individual units) is greater than the lifetime of manufacturer A’s components? and (2) on average, how much the lifetime of manufacturer B’s components is greater than the lifetime of manufacturer A’s components? However, according to Roberts’ two-sample t-test, the difference between the two manufacturers’ components is labeled as “insignificant”. This label does not answer these two questions. Moreover, the true meaning of the p-value associated with Roberts’ t-test is not clear.
I recently visited this example [5]. I calculated the exceedance probability (EP), i.e. the probability that the lifetime of manufacturer B’s components (individual units) is greater than the lifetime of manufacturer A’s components. The result is EP(XB>XA)=77.8%. In other words, the lifetime of manufacturer B’s components is greater than the lifetime of manufacturer A’s components at an odds of 3.5:1. I also calculated the relative mean effect size (RMES). The result is RMES=17.79%. That is, the mean lifetime of manufacturer B’s components is greater than the mean lifetime of manufacturer A’s component by 17.79%. Based on the values of the EP and RMES, we should have a preference of manufacturer B. In my opinion, the meaning of exceedance probability (EP) is clear without confusion; a person even not trained in statistics can understand it. The exceedance probability (EP) analysis, in conjunction with the relative mean effect size (RMES), provides the valid solution to this example.
[1] Trafimow D and Marks M 2015 Editorial Basic and Applied Social Psychology 37 1-2
[2] Cumming G 2014 The New Statistics Psychological Science 25(1)DOI: 10.1177/0956797613504966
[3] Roberts N A 1964 Mathematical Methods in Reliability Engineering McGraw-Hill Book Co. Inc. New York
[4] Jaynes E T 1976 Confidence intervals vs Bayesian intervals in Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science, eds. Harper and Hooker, Vol. II, 175-257, D. Reidel Publishing Company Dordrecht-Holland
[5] Huang H 2022 Exceedance probability analysis: a practical and effective alternative to t-tests. Journal of Probability and Statistical Science, 20(1), 80-97. https://journals.uregina.ca/jpss/article/view/513
Why do you post this question after you have published a paper saing that you find the EP more useful? What's the point of this discussion here? Are you willing to retract your paper or is it to promote your paper?
Most of what you write in the post above is a bit weird (I am NOT saying that you are weird! It's NOT meant in any insulting way!).
"In my opinion, the two-sample t-test can be misleading; it may not provide a valid solution to practical problems." -- Well, we don't need to recapitulate that the p-value very often misunderstood and misinterpreted. That's not the fault of the p-value. It's the fault of a miserable statistical education. And of course does it not provide a valid solution to any kind of practical problem a researcher might want to have one. But it does provide a valid solution to that particular practical problem it was invented for: to provide a standardized statistical measure of data sufficiency.
"We are concerned with the lifetime of the component..." -- this is a considerably strange example to illustrate the t-test. Lifetime is way better modelled with random variables having a Weibull, Gamma or logNormal distribution.
"He [Jaynes] argued that our common sense tell us..." -- this is ridiculus. I don't know the original context, and afaik is Jaynes a very smart person from which I would not expect statements like this. We all do statistics to calibrate our "common sense", to put our hopes and recognized patterns back to what the data can reasonable tell us.
"Moreover, the true meaning of the p-value associated with Roberts’ t-test is not clear." -- Why? The p-value has a very precise definition/meaning. Again, it's not the p-value's fault that you would like to use it for some different purpose. The data provided an estimate of -8 (the mean difference A-B of the lifetimes). It's negative, so the data favors B (longer lifetimes). If this was it to come to a conclusion, why using samples of 9 and 4 units? It would have been sufficient to just have a single unit per sample, so one can see if the difference is positive or negative. Hey, but n=1 is not sufficient, I hear you say. Of course. But how large a sample would be sufficient? I don't know, and you don't know either. So the best one can do ist to take samples as large as possible (feasable, within the limits of time and budget) and then check somehow if the available data can be considered sufficient. This somehow is what we do with the p-value.
Your exceedence probability is in fact something that would be interesting, but as you correctly write in your paper: we don't know it. Then, in a very nanchalant way, you substitute the required but unknown parameter vector by its estimate and do as if this was the truth. But it is not. It can be some terribly imprecise estimate, rendering the whole story about the precentages you calculate useless. But is may also be sufficiently precise to make it useful enough to at least say if the EP is larger or smaller 50%, say. But how to check the data sufficiency? I hope you know the answer.
Jochen Wilhelm First of all, thanks for your comments. The purpose of this discussion is to draw people's attention on the issues of t tests and associated p values in practical problems. The example I give is from a textbook of Roberts (1964). Similar examples can be found in many other textbooks. These examples provide a way to teach students the t-test paradigm. Students are taught that t tests can be used to help them make a decision when the sample size is small.
Regarding your comments: ‘"He [Jaynes] argued that our common sense tell us..." -- this is ridiculus. I don't know the original context, and afaik is Jaynes a very smart person from which I would not expect statements like this."’ Please see the original context of Jaynes (1976) through the link: https://bayes.wustl.edu/etj/articles/confidence.pdf. Jaynes (1976) also stated, “… the merits of any statistical method are determined by the results it gives when applied to specific problems.”
Regarding your comments: “Well, we don't need to recapitulate that the p-value very often misunderstood and misinterpreted. That's not the fault of the p-value. It's the fault of a miserable statistical education. And of course does it not provide a valid solution to any kind of practical problem a researcher might want to have one.” My argument is, if a statistical method is often and easily misunderstood and misinterpreted, and even our schools cannot teach it properly, then there must be something wrong with the method. I hope this discussion gets people thinking about what is really wrong with the t test and associated p value.
Dear Hening Huang ,
I fully agree with you and am glad to see this topic being addressed. The automatic application of the t-test as well as other tests without a theoretical foundation leads to errors. This is obviously due to lack of understanding and therefore insufficient statistical education of the researcher. Any scientist conducting experimental research on a sample basis should be able to interpret the results, while he or she does not have to be and cannot be an expert statistician. Cooperation and support in this area is required.
Erroneous t-test results occur not only with small samples, in the case of an extremely large sample any hypothesis will be rejected, the test is too sensitive.
Hanna Mielniczuk , regarding that tests on extremely large samples are too sensitive:
Technically, they are not "too sensitive". The distribution of p-values under the null remains uniform, independent of the sample size. That a huge sample almost neccesarily provides more than sufficient information is also expected and something the test correctly indicates. Sometimes, testing data sufficiency in huge samples is simply nothing sensible to do; estimation (including precision) and interpreting effect sizes (in including precisison) should be the way to go.
The actual problem with huge samples is that the test will reject the null because of any reason for which the data seem much less expected under the null than under the alternative. This includes any mis-specification of the statistical model (distribution model, functional relationship between predictors and response, missing predictors and interactions etc.). The tests are typically not very sensitive to mild mis-specifications, so rejecting the null based on some "small" sample can be attributed to the low statistical compatibility of the data with the null, also in the correctly specified model. But for a huge sample, the data could be well compatible with the null in the correctly specified model but not at all with the mis-specified model.
But here one could also argue that the selected model sets the stage, and that all conclusions drawn based on the data relate to this stage and are interpretable only on that stage. The stage itself is not and cannot be under question in a statistical analysis. This question comes before and goes beyond statistics. It needs expert knowledge in the subject matter and involves creativity.
Dear Jochen Wilhelm ,
thank you very much indeed for your discussion.
You give a comprehensive explanation of the fact that for huge samples the test will reject null for whatever reason the data seems much less expected under the null hypothesis than under the alternative.
Jochen Wilhelm Regarding your statement: “The distribution of p-values under the null remains uniform, independent of the sample size.” Are you saying that the p-value is not a function of the sample size in the t test? Please make it clear. It is well known that p-values decrease with increasing the sample size. It is also well known that the sensitivity of p-values to the sample size can lead to the so-called “p-hacking”.
Dear Hening Huang , note the part "under the null" in my statement.
I don't understand your statement that "the sensitivity of p-values to the sample size can lead to the so-called “p-hacking”.". -- It's just the desired property that p-values are sensitive to the sample size under the alternative. P-values are (statistical) measures of data sufficiency, and this requires a sensitivity to the sample size.
"P-hacking" is somehing else. It's to try getting a low p-value by repeating small experiments until one happens to give a small p, or to screen alternatives of which one may give a small p, or to try out different statistical models using different sets of predictors,transformations, functional relationships and interactions - and just presenting the "significant" trial(s). These actions pretend a different null, and under the null pretended the p-values are not uniformly distributed (the correct null includes all the trials, and a p value must be calculated for the whole set of trials, not for each trial independently - a remedy is to control the family-wise error rate rather than the test-wise error rate).
Jochen Wilhelm You listed several forms of p hacking. However, a very common (and very effective) form of p hacking is missing in your list; it is called “N chasing” (please see https://dustinstansbury.github.io/theclevermachine/p-hacking-n-chasing). A “scientific discovery” can be guaranteed by t tests and “N chasing”, which is one of the main reasons for many false scientific discoveries. Therefore, many scientists have been calling for the abandonment of the NHSTP (p-values, t-values, and F-values). Again, if the t-test paradigm is often and easily misunderstood and misinterpreted, and easily hacked, there must be something fundamentally wrong with the t-test paradigm. While the debate about the validity of the NHSTP continues, one thing is certain: the NHSTP and t tests have failed the test of time after about 100 years.
Hening Huang , it is correct that adding observations until a p-value happend to fall below some threshold is a form a p-hacking. But the argument you gave is wrong. The reason is not the p-values are "too sensitive" to the sample size. The reason is that, under the null, p-values are uniformly distributed: they jump around between 0 and 1. If you follow the p-value adding more and more data- under the null- you see a random walk of the p-value. It certainly sometimes goes below some lower cut-off value, and if you continue adding data it will raise again above that value. This kind of p-hacking is essentially the same as doing a lot of small experiments (trials) and then selecting the one that gave a sufficiently small p-value, and not to mention or report the rest of the trials made.
There are group sequential designs that allow to peek into the data before the study is finished. This is achieved by some alpha-spending function. See e.g. https://jamanetwork.com/journals/jama/fullarticle/2784821
Regarding your statement "Again, if the t-test paradigm is often and easily misunderstood and misinterpreted, and easily hacked, there must be something fundamentally wrong with the t-test paradigm." -- It's not very ceal what you mean with "paradigm" here. We do agree that procedures of using and interpreting t-tests in stats books and courses (especially in those starting with "Statistics made easy", "Statistics of dummies", "Statistics for biologists", "Statistics without formulas" etc) do a lot of harm. But your conclusion is like that there must be something wrong with the wheel because every year there are so many people dying in traffic.
None of this is actually surprising if you understand the relationship between P-values and reproducibility. A single p-value doesn't have the magical authority to reject or accept a null-hypothesis to settle the scientific hypothesis in question. A picture is attached from Jim Forest's ebook (pg 99 of 394) called "Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions" that describes the relation between some P-values that float just the magical 0.05 number and the reproducibility rates associated with such studies. Also to borrow a quote from a paper that describes this phenomena "Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than was variation in the characteristics of the teams conducting the research (such as experience and expertise). " (PDF) Estimating the reproducibility of psychological science. Available from: https://www.researchgate.net/publication/281286234_Estimating_the_reproducibility_of_psychological_science [accessed Dec 25 2022].
Only a poor artist blames his tools!
Hening Huang Your question "Does the two-sample t-test provide a valid solution to practical problems?" indicates a fundamental misunderstanding. This is not your fault. It is how statistics is usually taught. The t-test is introduced and, then followed by examples of 'practical' problems. The t-test is not a practical test. It does not care about the height of children in the third-grade class. The t-test evaluates data adequacy under the given assumptions.
Your statements about the t-test are a reflection of what and how most have learned about the t-test. We learned about the t-test the way the 7 visually-impaired-persons learned about the elephant. None reported the elephant as grey (the color of the t-test.)
The t-test was doomed for misunderstanding when the first journal declared the p-value was the price for publishing.
The lowest level in Dante's inferno shall be populated by those who have declared the NHSTP the keystone to the scientific method.
"the difference between the two manufacturers’ components is labeled as “insignificant”"...
Doesn't the conclusion "insignificant" depend on the selected (arbitrary?) threshold? Also, I understand that the two sampled t-test is measuring whether a difference between two distributions is due to noise (or not), which means the validity of the conclusion "insignificant" also depends on how well the used noise model (t-distribution in this case) aligns with reality.
A well-known limitation of this test is that it cannot determine if a "significant" difference is due to an inaccurate noise model, a signal, or whether you got unlucky on the noise draw. At the risk of stating the obvious, a good noise model will create a uniform distribution of p-values when there is no signal, meaning some noise will appear significant. A bad noise model could be biased either way--as "significant" or "insignificant"--when there is no signal. When there is a signal, assuming the problem was setup correctly, the p-value is biased towards "significant"--i.e., its p-values are not a uniform distribution, given a good noise model--to a degree that depends on SNR.
Now, if you have to pick one distribution as favorable over the other (and not both as you imply), then that's a different situation than just determining the significance of the difference, and you might not need a t-test. You might get by with using different criteria such as, which has the best mean, or the least extreme worst case, or other.
At some point the difference between two distributions might be considered so minor/insignificant that it makes more sense to use the one that is cheapest to procure/work with (assuming there is a difference in cost). This is a judgement call IMO.
To answer your question: does the two sampled t-test provide a valid solution to practical problems? Yes, given the validity of certain assumptions, and depending on the context.
The t-test is but one tool that might help achieve an objective, but the objective determines whether it makes sense to use it or abide by its conclusion.
Jochen Wilhelm Can Kiessling, and Joseph L Alvarez Sorry for the late response. I was out of town for 10 days. I would like to briefly describe my background and my work so that you can understand why I challenge t-based inference methods such as t-tests and t-interval methods for measurement uncertainty estimation. As you probably already know, I am not a statistician. My undergraduate study was civil engineering and my PhD study was hydraulics. In my work over the last 20 years, I have processed thousands of small samples collected in river flow measurements. Our customers, such as the US Geological Survey, Environment Canada, Yangtze River Conservation Committee, deal with small samples of river flowmeasurements on an almost daily basis. We usually make multiple observations (2, 4, or 6) and take the sample mean as the measured discharge (flow rate). We know that the measurement uncertainty (or random error or noise) decreases according to the -1/2 power law. On the one hand, we want to ensure that the measured discharge is precise (bias is considered separately), so we need to make a sufficient number of observations. On the other hand, we don’t want to overdo it, i.e. we don’t want to make too many observations (to minimize the field work). Therefore, we have established a measurement quality control criterion in terms of ‘maximum permissible relative expanded uncertainty’ at the 95% level. However, We found that the t-interval method gave unrealisticestimation of the expanded uncertainties (at the 95% level) when the sample size was small, leading to a high false rejection rate. Therefore, the t-interval method for measurement uncertainty estimation is invalid for uncertainty-based measurement quality control. In contrast, the mean-unbiased estimator method based on the Central Limit Theorem can provide realistic expanded uncertainties and can be used for measurement quality control. The mean-unbiased estimator method is recently adopted in the ISO standard for streamflow measurements with acoustic Doppler current profiler (ISO:24578:2021(E), Hydrometry — Acoustic Doppler profiler — Method and application for measurement of flow in open channels from a moving boat, first edition, 2021-3).
Since I didn’t learn t-tests in school, I am not committed to the t-test paradigm. I accept the mean-unbiased estimator method at work intuitively or with common sense. When I first got across the t-interval method for measurement uncertainty estimation and t-tests in 2005, I thought they were strange, even “weird”. I admit I don’t understand why people use t-tests to compare two samples; the logic behind the two-sample test does not work for me. For me, comparing two samples is straightforward: it just requires answering two questions: (1) how much do the two samples differ on average (i.e. what is the deference between the two sample means)? and (2) what the odds is that the sample A is larger (or smaller) than sample B?. The two-sample t-test simply cannot answer both of these two questions. Instead, it answers the question: “is the difference between two samples significant?” However, this question is meaningless to practitioners. So, to me, the two-sample t-test is misleading in the first place: the problem setting is wrong; it is not just misunderstood or misinterpreted by practitioners.
I think it’s worth mentioning that, prior to Student (William Sealy Gosset), the probable error (i.e. expanded uncertainty) was estimated with a method based on the maximum-likelihood estimator (MLE) of the population standard deviation. This MLE-based method significantly underestimates the probable error for small samples. The relative difference between the probable error estimated by the MLE-based method and true probable error is -43.6%, -20.2%, -7.7% at n=2, 4, and 10, respectively. I think this underestimation problem was what Student was trying to solve, but his solution based on the t-distribution he invented turned out to be overestimating the probable error (i.e. expanded uncertainty). Also, it is interesting to note that, according to Ziliak and McCloskey (2004) [Significance redux. The Journal of Socio-Economics 33: 665–675], “Student used his t-tables a teensy bit…” They said, “We have learned recently, by the way, that “Student” himself—William Sealy Gosset—did not rely on Student’s t in his own work.” Ziliak and McCloskey (2008), in their book entitled "The cult of statistical significance”, addressed the logical flaws inherent in statistical significance tests such as t-tests.
I think it should be emphasized that statistical methods are man-made ‘tools’ for scientific research. Since humans can make mistakes, a statistical method may be flawed. Using flawed tools may be one of the reasons leading to false discoveries in sciences, as addressed in a popular paper by Ioannidis(2005) “Why Most Published Research Findings Are False”(11748 citation as of January 5, 2023) (http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124). To avoid making false discoveries, flawed tools ‘tools’ should not be used. This is why Basic and Applied Social Psychology (BASP) officially banned the null hypothesis significance testing procedure (NHSTP) (p-values, t-values, and F-values). Norman Matloff, a professor at University of California, Davis, deliberately excludes the t-distribution and t-interval in his statistics textbook. He said “…… I advocate skipping the t-distribution, and going directly to inference based on the Central Limit Theorem.” (https://matloff.wordpress.com/2014/09/15/why-are-we-still-teaching-about-t-tests/).
According to Jaynes ET (2003) [Probability Theory: The Logic of Science Cambridge University Press p758], a paradox is “something which is absurd or logically contradictory, but which appears at first glance to be the result of sound reasoning.” Also, “A paradox is simply an error out of control: i.e. one that has trapped so many unwary minds that it has gone public, become institutionalized in our literature, and taught as truth.” In this regard, the two-sample t-test is a paradox.
Jochen Wilhelm I use the term “the t-test paradigm” to emphasize that t-tests is the standard and rule of the current scientific research. According to Thomas Kuhn, “Men whose research is based on shared paradigms are committed to the same rules and standards for scientific practice.” However, I think the shift from “the t-test paradigm” to “the estimation paradigm” is on the way in the scientific community, although this paradigm shift may take decades.
Can Kiessling I have different perspective regarding your statement: “Only a poor artist blames his tools!” I think “it is extremely important to have right tools for the job” in scientific research. Unlike an artist who may be able to paint a picture using an inferior brush, scientists rely on “right statistical tools” in their research. A flawed statistical tool can damage scientific research. According to Ziliak and McCloskey (2008), “Statistical significance is surely not the only error in modern science, although it has been, as we will show, an exceptionally damaging one.” However, it has been my impression over the years that, whenever something goes wrong with the application of statistical methods, practitioners are blamed for misunderstanding or misinterpreting these statistical methods (e.g. t-tests). This is not fair. We should not ignore the problems of statistical methods. Siegfried (2010) wrote, “It’s science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.” Wire (2013) cited “numerous deep flaws” in null hypothesis significance testing. Siegfried (2014) stated, “statistical techniques for testing hypotheses …have more flaws than Facebook’s privacy policies.” So, I share the opinion of some scientists and practitioners that we should admit that the two-sample t-test is a flawed tool and therefore it should not be taught in schools or used in practice.
Hening Huang I think I detect a fundamental misunderstanding of what statistical tests are and what they are not. I'll put it simply: statistical tests are stupid. They know nothing about the real world or the nature of the variable you're trying to study. They will take whatever input you give them, apply an algorithm, and give an output. They cannot tell you whether your result is right or wrong, whether your finding is valid or invalid, or whether your decision is correct or incorrect.
Statistical tests operate exclusively in the world of the model, not the real world the model is intended to represent. And this is true of any statistical approach a researcher might use - not just the t-test. The wall separating statistical results and practical knowledge is an impasse that simply cannot be climbed over. In other words, all statistical approaches are probabilistic, entail assumptions, and contain some degree of error. The question "does the two-sample t-test provide a valid solution to practical problems" is an ill-formed question. It should be modified to "does the two-sample t-test provide a USEFUL solution to THIS practical problem"?
"All models are wrong, but some models are useful" - George Box
Blaine Tomkins although I agree with your post and appreciate Richard McElreath, I think the quote is originally by George Box ;-)
Rainer Duesing Thanks for the correction. There was a second error also: it should read "wrong", not "false".
@Hening Huang
To add on to what I wrote above, that a statistical test A produces less error for data D than statistical test B does not imply statistical test A is a 'better' or 'more valid' test in general.
Hening Huang Statistics is a tool box. There are no answers in the box. Statistical tests are tools. The tool you use should be appropriate to the task. (I once saw a window installer using a caulk gun as a hammer. Yes. He broke the window.) Do not use the t-test if it does not work for you. There is no need to forbid its use or teaching by others. Do encourage the use of knowing what the test does, how to use it, and when to use it.
The scientific method is not based on statistics, despite claims to the contrary.
When you set out to make measurements you must first know why. You have to know why so you can answer what. What in this sense is what do I need to adequately answer why. What is the practical answer. It is not a statistical answer. We do not care if A is statistically larger than B, if practically A must be 3 times larger than B.
The knowing of why and what means you know the (needed) answer before you start.
You can now work on how to sample and measure. How is often a major problem. How may be specified by regulation, standards, or precedent. It can be the only method at hand. It may be required by the client. How must be capable getting back to why and what.
I think there are a lot of interesting things in this thread...
The first comment I have is on the streamflow measurements. This might merit it's own post. ... I'm not clear on the situation described in the responses, but I imagine that a t-test isn't the best tool to assess the variance or needed sample size for streamflow measurements, because they are likely skewed in distribution, and the variance is probably related to the magnitude of the measurement. That is, low-flow measurements would be common and with relatively low variance, but measurements during high-flow times would likely have a high variability. ... But I'm glad there are better methods out there to address this situation.
The second comment I'd like to make is on the Roberts two manufacturers example. I couldn't find the original data, but I made up some data that fits the summary statistics very well.
My comment is that I absolutely disagree with the idea that it's obvious that there's a difference between the measurements from the two Manufacturers. Essentially, with only 4 observations for B, we don't have great evidence about the difference, or particularly, the performance of B.
This is really the point of the t-test.
In the data I invented, one could imagine that if the one high data point in B were lower --- which is very possible with random sampling, we would come to a different conclusion. If I were gambling with money or people's lives or anything meaningful, I wouldn't be willing to make any conclusion.
Below is the data I used, R code, results, and a plot.
Note that the t-test and Wilcoxon-Mann-Whitney test have similar results.
Also note that Vargha and Delaney's A (0.806) is similar to the EP result in the original post. I tried to get a confidence interval for the A statistic, and --- with the small sample size --- couldn't get anything particularly useful. But if I do force it, the confidence interval is very wide, and doesn't justify the conclusion that one manufacturer has higher observation than the other.
if(!require(FSA)){install.packages("FSA")}
if(!require(rcompanion)){install.packages("rcompanion")}
if(!require(ggplot2)){install.packages("ggplot2")} library(FSA) library(rcompanion) library(ggplot2) A = c(45,50,42,29,38,46,53,35,40) B = c(47,43,59,51)
Data = data.frame(Y = c(A,B), Group=factor(c(rep("A", 9), rep("B", 4)))) Summarize(Y ~ Group, data=Data)
### Group n mean sd min Q1 median Q3 max ### A 9 42 7.483315 29 38 42 46 53 ### B 4 50 6.831301 43 46 49 53 59 t.test(A, B)
### Welch Two Sample t-test ### t = -1.8915, df = 6.3735, p-value = 0.1046
ggplot(Data, aes(x=Group, y=Y)) + + geom_dotplot(binaxis='y', stackdir='center')
wilcox.test(A, B)
### Wilcoxon rank sum exact test ### W = 7, p-value = 0.1063 library(rcompanion)
vda(x=B, y=A, ci=TRUE, reportIncomplete=TRUE)
### VDA lower.ci upper.ci ### 0.806 0.452 1
Sal Mangiafico Thanks for your input. Your analysis of the Roberts two manufacturers example follows exactly the standard procedure of the two-sample test as described in many statistics textbooks. If we stick to the t-test paradigm, the conclusion would be “no significant evidence in favor of either manufacturer over the other”. However, as Jaynes [4] wrote, “I think our common sense tell us immediately, without any calculation, this [dataset] constitutes fairly substantial (but not overwhelming) evidence in favor of manufacturer B.” And, “… any statistical procedure which fails to extract evidence that is already clear to our common sense, is certainty notfor me!” (Laplace’s famous quote: “Probability theory is nothing but common sense reduced to calculation”). On the other hand, if we stick to the estimation paradigm, we can use the given datasets to estimate the location and scale parameters of the underlining distributions for each sample. So, I think we need to decide which paradigm is more suitable for this practical problem. By the way, your result for Vargha and Delaney's A (0.806) is consistent with the common language effect size (CL): 0.791.
In my experience people refer to "common sense" if they do not have any good arguments/evidence to back up their claims. The hope is that everyone else (or at least the majority) will agree (kind of a rhetorical trick). But an individual's belief what "common sense" is, can vastly differ between individuals (look at politics, where every party beliefs that their point of view is correct and that this should be common sense). Therefore, we need some mechanics to back up our claims. CLES or VG_A are also nice tools, but they do not tell us anything about the uncertainty for which we use another tools to estimate a) the amount of information in the data and b) with some wrangling the uncertainty of the estimates.
I do not advocate to solely rely on t-tests or other models, but to try to understand the data itself, for which all proposed methods are just tools. From where does the idea come that we should use EITHER t-test OR CLES OR visual insprection, instead of t-test AND CLES AND visual inspection (and maybe other tools).
Data out of context has no meaning. There is no common sense to apply.
Teaching statistics with word problems gives the appearance of context, but, in reality, the words put names to columns of data churn through the statistical test. Graphing the data allows the application of more common sense than a statistical test. Nevertheless, lack of context prevents a common sense approach beyond a gut-feel.
You must know how the data were collected and under what conditions. If one data set was taken under field conditions and the other under laboratory control, Simpson's paradox looms. If one data set was taken for a specific reason, while the other to answer a different question, similar controls cannot stave-off Simpson.
Many fields of study accept p-values from parametric or nonparametric tests as proof, because constraints preclude conduction of a definitive study. Proof may be more realistically termed as 'indicates need for further study' in the publication and less realistically as 'major breakthrough' in the press release.
My point was that I think Jayne is wrong. It's not just the judgement of the *t*-test. It's that if we stop and look at the data critically, we shouldn't have much confidence that the products for Manufacturer B is better than that from Manufacturer A. There are only four observations from B, and if we change the value for one point, we have a different picture. We could come to that conclusion based on intuition, but that's also why statistical tests are useful: because we don't always look critically at the data in front of us.
Likewise, I wouldn't have much confidence in the point estimate for CLES or VDA or a similar statistic. The point estimates are suggestive, but if we get a confidence interval for these statistics, it's obvious that we shouldn't have much confidence in the size of the effect.
Sal Mangiafico Everything you said, but "but that's also why statistical tests are useful: because we don't always look critically at the data in front of us." The test cannot come to a conclusion even if we look critically at the data in front of us. The test is for some aspect of data quality, depending upon the test. Meeting a data quality objective does not mean we have what is needed to decide make a practical decision.
Joseph L Alvarez , agreed. My comments are really just on the step of assessing the data itself. Not even thinking about any practical conclusions.
Sal Mangiafico The p-value from the two-sample test is also a point estimate and has high uncertainty when the sample sizes are small. A very big difference between the p-value and the exceedance probability (EP) P(B>A) or CLES or VDA is that the p-value is a function of the sample size; it decreases as the sample size increases. For this example, the sample size of the original data is n=9 for manufacturer A and n=4 for manufacturer B. The resulting two-test concluded “there is no significant evidence in favor of either manufacturer over the other”. If we assume that the sample sizes for both manufacturers A and B are 30, the p-value from the two-sample test would be smaller and the conclusion would be “there is significant evidence in favor of manufacturer B over the manufacturer A”. However, the difference between the two sample means (effect size) will be about the same regardless of the sample size because the effect size is not a function of the sample size; it may only slightly fluctuate with the sample size. Therefore, the two-sample test is misleading or meaningless. That is why I think the two-sample t-test is methodologically invalid; it is not a valid inference tool. In contrast, EP or CLES or VDA is only a very weak function of the sample size; it does not change much (maybe only fluctuate slightly) with different sample sizes; it is a valid tool for inference.
I refer to this part (I shortend it to make the point clear):
"If we assume that the sample sizes for both manufacturers A and B are 30,[...] the [...] (effect size) will be about the same [...] Therefore, the two-sample test is misleading or meaningless."
You are mixing concepts, and therefore you come to a wrong conclusion.
"Effect size" is used to indicate a (statistical) population parameter as well as a sample statistic. The value of the population parameter is unknown. Data should provide some information about its value. The sample statistic is calculated from the data as an estimate. I hope you aggree that this does not reveal the complete and perfect information about the population value. And that you further aggree that the amount of information provided increases with the amount of (independent) data in the sample. But how to quantify this information, since we still don't know the population value?
The only chance is statistically compare the estimate to some hypothetical value and see how likely one can expect larger differences between estimate and hypothetical value if the population value would, hypothetically, be that hypothetical value. Finding that this propbability is really small is taken as indication that we may believe that the data contains enough information to statisticall distinguish the estimate from the hypothetical value. We can also distinguish the estimate from all other hypothetical values being more distant. Since the estimate is either smaller or larger than the hypothetical value, we know that the estimate is statistically distinguishable from all hyopthetical values on one side of the hypothetical value we tested. So we still don't know the population value, but we now can say that we believe that the estimate is on the correct side of the hypothetical value. If the hypothetical value is zero, this means that the estimate has the correct sign.
If while increasing the sample size, the estimate will vary, and, when the population value is zero, will jump between positive and negative values. There is per se no reason to believe that the estimate will still favour the same manufacturer when you take a larger sample. This is something we can not say! The t-test is a method that tries to judge the amount of relevant information provided by the data. As the t-test can also not know the population value, it gives only a statistical advice, saying whether or not an interpretation of the estimate (e.g. regarding its sign) is too dangerous.
If the population value is exactly zero, then actually exactly half of the time in such an experiment the estimate will be positive, and in the other half it will be negative. In this case, interpreting the sign of the estimate would make no sense - but now imagine the population value is very slightly (compared to the variance of the data), say, positive. Still about half of the estimates would be negative, so in every second experiment (on average) the estimates will have the wrong sign.
If the population value is very much different to zero (compared to the variance of the data), then most or even all estimates will have the same (correct) sign.
The t-test compares the estimate to the estimated standard error (a function of the variance of the data and the sample size) via the t statistic and provides Pr(|T|>|t|).
Hening Huang you are right, that effect sizes are not directly a function of sample size, BUT they are influenced by measurement error and sample size. As Loken and Gelman (2017) demonstrated, small samples will tend to overestimate the true effect. Therefore, your statement "However, the difference between the two sample means (effect size) will be about the same regardless of the sample size because the effect size is not a function of the sample size" is not true without additional assumptions (measurement without error). In small samples you should rather expect that the effect is overestimated and not constant with increasing sample size.
Otherwise we should recommend to draw rather small samples to save ressources, shouldn't we?
Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584-585.
Rainer Duesing I agree with you that the statement "However, the difference between the two sample means (effect size) will be about the same regardless of the sample size because the effect size is not a function of the sample size" is not true without additional assumptions (measurement without error). Since we focus on the two-sample test in this discussion, the assumption is that the samples are drawn from normal distributions. Since the normal distribution is also called “the law of distribution of error”, it should already contain errors. I read the paper by Loken, E., & Gelman, A. (2017). They stated, “Measurement error can be defined as random variation, of some distributional form, that produces a difference between observed and true values (1).” However, it is unclear what kind of distributional form of errors is used in their simulation to generate their scattering plots. It is also unclear that how they come up with the “estimate”. Do they take the sample mean for N=3000 in the high-powered study and N=50 in the low-powered study? I don’t understand why the estimates from the high-powered study and the estimates from the low-powered study have nearly the same degree of scattering. Regardless, I don’t think their simulation results and conclusions apply to our discussion of the two-sample t-test.
Let me restate my concern about the two-sample t-test. Suppose we have two samples: A and B, with the same size n. When n is small, say n=5, the two-sample t-test gives “there is no significant difference between the means of the two samples.” When n is large, say n=100, the two-sample t-test gives “there is a significant difference between the means of the two samples.” Thus, a two-sample t-test with “N chasing” (an effective form of p-hacking) can guarantee a “statistical significance”. However, the sample mean is an unbiased estimate of the population mean; it is not a function of the sample size, although it may slightly fluctuate with sample size and may have a high standard error when the sample size is small. Therefore, the two-sample t-test is not “self-consistent” or contradictory. I think this is a fundamental flaw of the two-sample t-test.
Hening Huang I think Loken and Gelman just draw samples from a predefined population and calculated the estimates. In a next step they took these samples and added additional error in form of random noise (normally distributed in the simplest form). At least this is what I did in a small simulation to reconstuct their findings (and I had basically the same results). They clearly state that they used correlation as estimate depicted, but it would look quite similar if you use CLES; VD_A or Cohen’s d/Hedges’ g (as I did in my simulation). The scatter is not about the mean values, but the estimates of interest (correlation or difference), which is a crucial point! Imagine under the true NULL, your mean1 is a bit off in a positive direction and mean2 in a negative direction. If you are interested in the difference, these two add up and might indicate a substantial difference, although the mean values itself are not so much off itself.
I do not get your point about the scatter: for the large sample the bulk of estimates lies between .10 and .20, whereas it ranges from -.10 to .40 for the small sample. This is what I would expect and also found myself. I think the results contribute insofar that your argument about the validity of the effect sizes in small samples does not hold, as you argued yourself in the last post. Therefore, for effect sizes, as well as t-test we need larger samples to draw more valid conclusions.
And now comes the point that you ignored in my previous post: t-test and CLES do not tell the same story, i.e. they tell you different information about the sample. So why do you want to use only one of them and not both to interpret your results? As Jochen already pointed out, the t-test gives information how certain you can be about the sign of the effect. If you may have some confidence that it is in the correct direction you may also be more confident about your effect size. An effect size where I cannot be confident about the sign (or if it lies above/below 0.5 in case of CLES/VD_A) does not help to interpret your results accordingly. Therefore, inspect both and other parameters. If your concern is if the t-test is appropriate for your error distribution, fine, look at the residuals and the data distribution. If they do not look normal, fine, think about your data generating process and what kind of model may be appropriate (maybe some form of glm or whatever). The t-test works fine, but only where it is appropriate and nobody tells you to use it as only source of information to evaluate your data.
Hening Huang said, "Therefore, the two-sample t-test is not “self-consistent” or contradictory. I think this is a fundamental flaw of the two-sample t-test."
Jochen Wilhelm said, in a previous answer, "The t-test compares the estimate to the estimated standard error (a function of the variance of the data and the sample size) via the t statistic and provides Pr(|T|>|t|)."
You cannot fault the t-test for what it does not do. You cannot fault the t-test because it is widely abused in publications and publishers. You cannot fault the t-test because statistical text books teach how to perform a t-test instead of why to perform and what you learn from t-testing (see previous JW answer.)
You can teach the t-test for what it can do and when to use it. You can remind co-workers when, what, and where to use the t-test. You can and should object to and correct misuse of the t-test (a p-value of 0.048 shows cows are not horses.)
{I rarely use the t-test and, then, in preliminary discussion to show direction of investigation.}
I would just like to make a comment on the fact that the p-value is affected by sample size. I consider this a feature of the hypothesis test, not a bug.
To use the classic example, if I flip a coin two times and both come up heads, that's not good evidence that the coin is biased. But if I flip it 100 times, and it comes up heads 65 times, that is good evidence that the coin is biased.
Note also the difference in effect sizes. In the first case, the effect size is very large (100 / 100, or 2 / 0, or Cohen's g of 0.5, depending on what you want to use). But relying on this effect size doesn't give us any confidence in a conclusion.
The effect size in the second case isn't huge (65 / 100, or 65 / 35, Cohen's g of 0.15, depending on what you want to use). But we have at least some confidence that these are reasonable estimates of the effect size for the population.
Sal Mangiafico Since this discussion focuses on the two-sample t-test, could you use an example of two-sample t-tests for your "comment on the fact that the p-value is affected by sample size. I consider this a feature of the hypothesis test, not a bug?" If this is not a bug, is this a desired feature?
Hening Huang ,
the t-statistic is the mean difference, divided by the standard error of the mean difference. Under the null hypothesis, this statistic is t-distributed.
The mean difference is independent of the sample size. The standard error of the mean difference is a function of the (pooled) variance (which is independent of n) divided by n. So the denominator of the t-statistic increases with increasing n. When the expectation of the t-statistic is non-zero (i.e., when the null hypothesis is false), then the expectation of t depends on n. This is what the t-test is about, and nothing else.
You can surely go through the formulas ans see that this follows from the math. However, I made a simulation to show you what it means.
The plot shows 3 x 3 histograms with the distribution of 10000 p-values each. The p-values were obtained from t-tests with simulated data of different sample sized (n) from populations with different effect sizes (Δ). Each row uses a different sample size, each column a different effect size.
The left column shows the results under the true null hypothesis. Note that the distribution of the p-values remains uniform, independent of the sample size. For all other histograms, the null hypothesis is false. This leads to a right-skewed distribution of p-values. The upper middle diagram is still indistinguishable from a uniform distribution, because here both, n and Δ are small. As Δ gets larger (top right), the distribution becomes more skewed. The important point for you is to recognize that the distributions in the middle and right columns become more skewed from top to bottom (with increasing sample size). And, notably, the distribution remains unaffected from n only when the null hypothesis is true.
In a real experiment we get one single p-value. What we actually want to know is if this p-value is an instance from one of the diagrams in the left column or an instance of one of the other diagrams. Of course, there is no way to definitely answer this questions. However, if the p we got from our sample is very close to zero, then it is at least more likely that it might be an instance of cases shown in the middle or right diagrams, because there small p-values are much more frequent than larger p-values.
We are now close to understanding the 2 sample t-test. It does what it does. We can now discuss, “What is the value in the difference of means?”
We could value this difference using measures as provided by Cohen or Hedges and call it effect size. The effect size is the difference in means divided by the standard deviation of the population. The effect size is not a quantified value, it is an indicator to the investigator of the strength of the difference and some degree of usefulness to interpreting the difference. Definitions of the effect size differ, as do effect size formulae. Some even claim to calculate an unbiased effect size.
The difference in the means of the 2 populations is quantified by the propagated uncertainty of the difference. The propagated uncertainty is the sum in quadrature of uncertainties of both populations. The minimum uncertainty of the difference is 1.4 times larger than the population standard deviation, if the populations to have equal dispersion. The uncertainty in the samples is likely larger than the population uncertainties.
An effect size of 0.5 has a population standard deviation 2 times larger than the difference in means. The propagated uncertainty is 2.8 times larger than the difference. A value is not quantified if its uncertainty 200%. An effect size of 10 has a population standard deviation that is 10% of the difference or 14% of the propagated uncertainty.
The t-test is useful in designing sampling plans and evaluating preliminary data. It may be of little use to answering why in the original question, but may be useful in establishing what difference is needed to claim a practical difference.
Jochen Wilhelm Your simulation examples are excellent! However, I have some different perspectives about your simulation results.
First, “…the distributions in the middle and right columns become more skewed from top to bottom (with increasing sample size).” Consider the right column. In the top diagram (Δ=0.5 and n=5), there are a large number of p-values >0.05, indicating “no statistical significance” at the 95% level. However, in the bottom diagram (Δ=0.5 and n=50), there are a large number of p-values
Hening Huang can you then please propose a method how to determine that you can be confident about the sign of your effect size estimate (or that it is not compatible with 0.5 in case of CLES or VD_A)? Again, in your reasoning the delta of 0.5 with N = 5 in Jochen's example is as convincing as a delta with N = 50. How would you solve the problem?
Hening Huang if you despise the t-test, why not going fully Bayesian? Here you can directly estimate the uncertainty of all parameters (e.g. mean differences, CLES) directly from the posterior. I tried to simulate Jochens approach (but with only one sample each, otherwise it would have taken too long), with a hierarchical Bayesian approach (with very flat priors to let the data speak). As you can see for example in the N=5 and Delta=0.3 condition, the CLES value is quite off and in the wrong direction. Wihout any estimate for the uncertainty, you would have taken this value (since it is quite large but in the wrong direction!) as it is. But from the posterior draws you can see that it is still compatible with 0.5 and this is what the t-test should tell you in a similar manner. And as with the t-test, the uncertainty of the estimate decreases with increasing sample size, exept that there are no p-values to chase after, only the estimates itself.
Rainer Duesing As I wrote before, "For me, comparing two samples is straightforward: it just requires answering two questions: (1) how much do the two samples differ on average (i.e. what is the deference between the two sample means)? and (2) what the odds is that the sample A is larger (or smaller) than sample B?" Additionally, I will estimate the standard uncertainty (i.e. standard error) associated with the effect size (i.e. the deference between the two sample means). In measurement science, we don’t have difficulty determining the sign of an effect size; we know the sign of the physical quantity we are measuring, and we have measurement quality control protocols to ensure that the data collected are valid, i.e. free of significant bias. In some other fields, the signs of effect sizes may be uncertain or different in independent studies. I think this can be addressed by a meta-analysis of multiple studies. I don’t think the estimated Δ will be much different for n=5 or 50 unless the samples are drawn from a population with a very large standard deviation. But the standard uncertainty (standard error) associated with the estimated Δ from n= 50 will be much smaller than the estimated Δ from n= 5. That is, the estimated Δ from n= 50 will be much more precise than the estimated Δ from n= 50. However, for the same Δ, the p-value for n= 5 will be much different from the p-value for n= 50, because the true value of the p-value is not a constant, it is a function of n.
Regarding your Bayesian simulation, since only one sample was considered, I’m not surprised that the results of the CLES deviate from the true value or central tendency when the sample size is small (n=5), especially if the sample was drawn from a population with a very large standard deviation (I don’t know what value of the standard deviation was used in your simulation). I agree with you, “… as with the t-test, the uncertainty of the estimate decreases with increasing sample size, exept that there are no p-values to chase after, only the estimates itself.” For a comparison of two samples, I think the CLES or EP is more meaningful than the p-value; it has a clear physical meaning. In contrast, the physical meaning of the p-value is not clear and it can be chased.
Hening Huang I dont want to put words in your mouth, therefore I repeat what I understood how you approach your analyses according to your last post:
1) You estimate the delta, i.e. the difference of the sample means.
2) You calculate the standard error for the difference of the sample means and compare it to the delta, to get a grasp of the uncertainty.
3) You estimate the effect size in some form, e.g. CLES, to get an impression of the practical relevance.
Aren't the first two steps exactly what a t-test does?!?!? I mean, you may not call it t-test, but you are doing one implicitly. A t-test is defined as the ratio of delta to the standard error. What you did not mention was, how YOU decide, when you can be certain enough about the estimate. Gut feeling? "Common sense"?? This is where the "test" in the t-test comes into play. Admittedly, the threshold is completely arbitrary (e.g. 5%), but at least it is one that a lot of people agree on (in contrast to common sense....). You just take your ratio and compare it to a critical t value, associated with the amount of certainty you want to have (89%, 95%, 99%, 5 Sigma.... your choice). Sorry, but I think you are doing a t-test without openly clarifying your threshold. I would say this is also a form of "p-hacking" if you can decide at will if the SE is "small enough". If it is otherwise, please clarify and tell us your decision criterion.
Interpreting the effect size (unstandardized, Hedges' g, CLES, VD_A) is what everyone should do, so what you describe is the standard approach to analyses in my opinion. I do not see where you disagree with anything the other commenters have posted so far.
To the Bayesian simulation: it was not my intention to make a simulation to show how "good" or "bad" the Bayesian approach is (for this I would have need lots of samples, correct), but to a) show a method where you can also have an uncertainty measure for CLES (afaIk there is nothing in the frequentist version, but please let me know if I am wrong) and b) to demonstrate how important the uncertainty measure is with the one sample being off, by chance. Nothing more, nothing less. BTW: All samples were drawn from a population with a sigma of 1, therefore, this cannot be the origin of any difference.
Rainer Duesing First, a small clarification on step 3). I estimate the exceedance probability (EP) P(A>B) (i.e. the odds that the sample A is larger than sample B). EP is essentially the same as CLES but it does not require the normality assumption (please refer to my paper on EP). In addition, I don’t think CLES is an “effect size”; it is a probability.
Second, the first two steps are “estimation”, not “t-test” because we don’t calculate the t-statistic, and most importantly, we don’t use the t-table or t-distribution for inferences. In fact, we don’t make the null hypothesis in the first place. We evaluate the estimate (or estimated effect size) and its associated uncertainty. Then, we make scientific inference (not statistic inference) based on professional judgment. We do use thresholds in practice. In our river flow measurements, we set a threshold 4.09% as the ‘maximum permissible relative expanded uncertainty’ at the 95% level. If the relative expanded uncertainty of a measured flow rate is smaller than 4.09%, we accept it. Otherwise, we reject it. In a previous post I wrote, “We found that the t-interval method gave unrealistic estimation of the expanded uncertainties (at the 95% level) when the sample size was small, leading to a high false rejection rate. …. In contrast, the mean-unbiased estimator method based on the Central Limit Theorem can provide realistic expanded uncertainties and can be used for measurement quality control. The mean-unbiased estimator method is recently adopted in the ISO standard for streamflow measurements with acoustic Doppler current profiler (ISO:24578:2021(E), Hydrometry — Acoustic Doppler profiler — Method and application for measurement of flow in open channels from a moving boat, first edition, 2021-3).”
Third, I am a practitioner using statistical methods (tools). I want to use valid tools for my work. In my experience, t-tests and t-intervals are invalid tools for practical problems. I advocate the estimation paradigm. I agree with statistics reform proposed by Cumming (2014), which mainly includes: (1) abandoning the null-hypothesis significance testing (NHST), and (2) using the estimation of effect size (ES) (The New Statistics Psychological Science 25(1) DOI: 10.1177/0956797613504966).
Fourth, I think there is a frequentist version for uncertainty measure for CLES. Please see the discussion: https://www.researchgate.net/post/Is_it_possible_to_calculate_confidence_intervals_for_CLES_via_Fishers_Z_transformation. In addition, for samples draw from a population with sigma=1, the standard deviation for A-B will be 1.414, which is quite large compared to delta=0, 0.3, or 0.5. I think this explains the large dispersion in your simulations for small samples.
Hening Huang
1) I dont think that it is important to distinguish between CLES and EP for the this discussion per se. But can you elaborate, why a probability cant be an effect size? (At least it is in the same common language effect size).
2) I am not sure about your approach, could you please elaborate? You say "we set a threshold 4.09% as the ‘maximum permissible relative expanded uncertainty’ at the 95% level. If the relative expanded uncertainty of a measured flow rate is smaller than 4.09%, we accept it." How do you calculate the uncertainty and the 95% level?
3) Fairt argument, but why do you generalize to all other fields of research and practical applications to conclude that the t-method is invalid per se? As others have pointed out, you have to check for your data generating process an appropriate model. You found that the t-model does not work for YOUR applications, what is perfectly fine, but you shouldnt conclude that this is the general case.
4) Thanks for the link!
5) You are right about the standard deviaton of the differences in uncorrelated samples, irrespective of the sample sizue, but how does this matter here? For the t-test you would use the pooled standard deviation as well as for CLES. Maybe I am missing something.
Hening raised a very interesting and important topic for discussion. There are a lot of insightful discussions as well. My basic stance is in agreement with what were elaborated in Statistical Inference Enables Bad Science; Statistical Thinking Enables Good ScienceArticle Statistical Inference Enables Bad Science; Statistical Think...
and The Limited Role of Formal Statistical Inference in Scientific InferenceArticle The Limited Role of Formal Statistical Inference in Scientif...
.Therefore, although mathematically nothing is wrong with t-test, I do not think it is very useful in analysis of real life data. For example, in analyzing the two independent sample data sets, what we really need to ask is 'what is the difference between the two data sets (e.g., the central tendency, the spread, the shape, etc.)'. A question like 'if there is a real difference in the means of the observed sample data sets' is scientifically meaningless which unfortunately is exactly what the NHST paradigm taught us to do for decades. Essentially, with any single set of sample data, the best of the statistical data analysis can do for us is exploratory data analysis / descriptive data analysis or what-if analysis.
But is it not a relevant question to ask, "Could the observed difference be the result of random chance"? Or, "What is the probability that the observed difference is because of chance alone"? Is this not an important question to answer? This is one reason why I really like resampling stats (the other is that it makes no assumptions of normality). It explicitly tests how often a random redistribution of the data between the two groups could result in a similar difference (and thus how likely the observed difference is to be the result of chance alone).
Chavoux Luyt , "and thus how likely the observed difference is to be the result of chance alone" -- nope, not at all. No matter is you do resampling or if you infer the sampling distribution by a statistical model: the probability ("p-value") you get is the tail probability of more extreme results under the assumed null hypothesis. In you resampling approach you simulate one aspect of the true null hypothesis (you set this - not nature!). What you see is the distribution of your data (or statistics) in this simulation, and not the probability how likely the true state of nature equals your simulation.
Jochen Wilhelm, in other words, if I am understanding you correctly, my resampling test does not test if my sampling from nature (i.e. my observations) truly represent reality?
E.g. I regularly see more Bontebok in one area than the other. Resampling can tell me if this is likely to be by chance alone. It could be just by chance, but a random reshufling of the observation data will show me that this is possibly simply by chance, or unlikely to be by chance alone. But it cannot tell me if Bontebok really prefers one area to the other, because it cannot test if my observations were regular enough or in the right season or whatever to show a true preference... it cannot test for the validity of my experimental design? Or is it something other than experimental design that is involved?
Chavoux Luyt , you simply cannot assign any probability to a hypothesis based on the information from observed data.* But this is what you state: "based on these data, its this and that (un-)likely that the null is true". It has nothing to do with your experimental design.
There is another problem in your wording: "chance" is not a cause of something happening. Chance is not a "state of nature" but a "state of mind". It expresses how much we expect an observation or event (normalized to the set of all possible events). Chances depend on our knowledge and about our models we use to describe the world. In your case, you provide the (arbitrary) resampling model. And under this given model you can then calculate the chance for the data being this and that extreme. This does not tell you anything about the chance of your model being "good" or "correct". Statements like this make no sense at all.
You may also think of the dependency of the p-value on the sample size. For any given state of nature that is not identical to your model (as a model is a model and not the reality, this is always true!), increasing the sample size results in lower expected p-values. Why should the chance that nature has a particular state be a function of the sample size? This again makes absolutely no sense.
---
* it is possible to refine a given probability distribution over possible hypotheses using the information from observed data in a Bayesian analysis - but this is not what p-values are about!
Jochen Wilhelm "Why should the chance that nature has a particular state be a function of the sample size? This again makes absolutely no sense."
Because, using my example, if I see more Bontebok in one area than the other on one occasion, there is a good chance that it was only by chance (they might actually be more in the other area on most days). However, if I see them more in one area than the other, every single day over a whole year, this should be a good indication that they are actually more in the one area than the other and that my observation(s) is not simply a result of chance (if we state in this case as our null hypothesis that Bontebok does not prefer any habitat to any other and are therefore equally likely to be observed in either). Would you disagree? Or I can show that the difference in numbers that I observed from 10 observations, could happen by chance alone... if I shuffled the actual observations randomly between the 2 areas, a similar difference between the two sites can be seen more than (arbitary) percentage of the time (so my 10 observations could insufficient to claim that there is actually a difference). For me this makes logical sense. What am I missing?
- I can see how experimental design might mislead me in my conclusion... maybe all my observations was made in the morning and Bontebok actually preferred the other habitat in the afternoon, so I missed it because of a problem with experimental design.
But what is that "chance" that should be responsible to evenly distribute Bonteboks in the habitats you are observing?
There are causes for each Bontebok being where it is. Because it was there some time before, because the herd is there, because there was a predator on the north, because there was wood et the west, because one of the herd members stepped onto a stone, because one of the neurons in frontal cortex fired, etc etc. Not all are good reasons, but taken together they all would explain why a particular animal is where it is. This is so for every animal - so where is the chance that you see some here and some there when you take a look? The chance is only in your mind, because you don't know all the reasons that cause the animals to be where they are when you look at them.
What you do is to ask: if the animals should be evenly distributed (and there is nothing more we know about the animals and their reasons to be where they are), what is the chance to get more unevenly distributed data than it was observed?
Also note that this implies that all animals act independently. This is a bad model here because we should believe that these animals move as a herd.
Chavoux Luyt I think the problem is that you cannot answer the question "Could the observed difference be the result of random chance". It does not matter how you formulate or test your null hypothesis, you assign a non-zero probability to your data under the given hypothesis. Therefore, your observed data can always be "a result of random chance" (under the null) [although I do not like this formulation]. What you can say is that under a given null data such as yours or more extreme are very rare (small probability), nothing more, nothing less. If this probability is small enough, you may come to the conclusion that the data is quite incompatible with your hypothesis and therefore the null model does not describe the data very well (and therefore another model should be better). What you cannot derive from such a result is a probabilitly statement about the likelihood of your hypothesis itself. Or more formally pr(data|hypothesis) is not the same as pr(hypothesis|data). The former is what your test gives you (no matter if t.test or permutation test), whereas the latter is what you want (as I understood you).
Chavoux Luyt Statistics was developed to study chance in gambling. The language of chance and odds has persisted in the application of statistics to non-gambling situations. The study of gambling is based upon a known, closed system and an assumed a fair game. A game is fair if all conditions are knowable and known.
The extension of statistics to an open system has additional uncertainty. The additional uncertainty comes from unknown and unknowable conditions. Experimental design and sampling of an open system strives to identify and constrain conditions and then to sample according to assumptions concerning the data distribution.
The null hypothesis reasoning applies to data sufficiency. Are the data sufficient to show an average difference from the null? A sufficient difference is defined by the two sample t-test and a chosen alpha. One can argue that all available data are limited to and known by the conditions and assumptions of the experimental design. The chances that sampling the null population twice will produce sample averages sufficiently different or more extreme are small. Small is defined by the chosen alpha.
Obtaining a p-value smaller than alpha indicates that chances are one of the samples did not come from the null. Practical?
Rainer Duesing
1) Indeed, there is no need to distinguish between CLES and EP for this discussion. In a two-sample comparison, the difference between the two samples (or two sample means) is the “effect size” that we are really concerned with. Defining another effect size such as CLES or EP would cause confusion. Importantly, estimation of CLES or EP is a “probabilistic analysis”, not an “effect-size analysis”. We should distinguish these two types of analysis. BTW, I think the term “CLES” is confusing; it does not reflect its real meaning: “probability”.
2) I use the mean-unbiased estimator method based on the Central Limit Theorem to calculate the expanded uncertainties at the 95% level. Please refer to a review paper: “Huang 2020 Comparison of three approaches for computing measurement uncertainties Measurement 163 https://doi.org/10.1016/j.measurement.2020.107923.”
3) I don’t think it’s me to “generalize to all other fields”. Criticisms of NHST have come from scientists in many fields. For example, in March 2019, more than 800 scientists and statisticians around the world have signed a manifesto calling “for the entire concept of statistical significance to be abandoned.” Also in March 2019, The American Statistician published a special issue on statistical significance and p-values. The editorial in this special issue recommended eliminating the use of ‘p
Another journal joins the “estimation” camp:
“Statistical inference through estimation: recommendations from the International Society of Physiotherapy Journal Editors” European Journal of Physiotherapy, (2022) 24:3, 129-133, DOI: 10.1080/21679169.2022.2073991, https://www.tandfonline.com/doi/epdf/10.1080/21679169.2022.2073991?needAccess=true&role=button
“This co-published editorial explains statistical inference using null hypothesis statistical tests and the problems inherent to this approach; examines an alternative approach for statistical inference (known as estimation); and encourages readers of physiotherapy research to become familiar with estimation methods and how the results are interpreted. It also advises researchers that some physiotherapy journals that are members of the International Society of Physiotherapy Journal Editors (ISPJE) will be expecting manuscripts to use estimation methods instead of null hypothesis statistical tests.”
Hening Huang , "examines an alternative approach for statistical inference (known as estimation)" -- that's a misleading statement!
With "estimation" they mean to evaluate the maximum likelihood estimate (MLE) and placing a confidence interval (CI) around it. But this CI is nothing else but the set of values around the MLE that can not be rejected in a hypothesis test about the respective parameter at a certain level of significance. So this is not an alternative, it is simply a different was of using hypothesis testing. Instead of testing one point hypothesis a continuous spectrum of hypotheses is tested.
Having a p-value of 0.1 for the test for d = Ho just means that the 90% CI touches Ho (and that any more-than-90%-CI will include Ho and any less-than-90%-CI will exclude Ho). Giving the confidence interval provides somewhat more information about values of d that are not too incompatible with the observed data (at a given level of significance!). But there is nothing fundamentally different in the kind of information provided by p-value and confidence interval: they evaluate if the observed data is (too in-)compatible with hypotheses about the parameter of a statistical model.
There are no two camps here.
@ ALL
A lot of answers by a lot of scholars and a super-lot of Recommendations!
Would you decide with your methods if the means components of the first row and of the second one are significantly different?
📷
Awaiting your decision…
It depends on (1) what distribution model you assume and (2) what level of significance you choose.
Assuming a
Weibull distribution: p = 0.079;
Extreme value distribution: p = 0.005;
log-normal distribution: p = 0.092;
log-logistic distribution: p = 0.18.
The Weibull-Model has the lowest AIC (304), the extreme value model has tha largest AIC (333). My guess is therefore that the low p-value from the extreme-value model is more an indication that the distribution model does not fit rather than being an indication of that the means (or the extreme value distributed RVs) are not equal.
I forgot to say "alfa"=5%.
Sorry.
If you go to the Montgomery book you find that he says: "Exponential distribution".
Exponential distribution: p = 0.055; p > 0.05 --> not statistically significant.
I am curious how you address this question and what your result is and why mine is wrong (I presume that this will be your conclusion).
PS: In fact, the AIC of the exponential model was even smaller (302).
Well said, Jochen Wilhelm . (RE: European Journal of Physiotherapy )
the point is in the question/discussion:
Does the two-sample t-test provide a valid solution to practical problems?
Discussion, Started on December 21, 2022
Due to growing concerns about the replication crisis in the scientific community in recent years, many scientists and statisticians have proposed abandoning the concept of statistical significance and null hypothesis significance testing procedure (NHSTP). For example, the international journal Basic and Applied Social Psychology (BASP) has officially banned the NHSTP (p-values, t-values, and F-values) and confidence interval …
What can we decide about the case?
the first mean is different from the second with "banned"CL=95%?
One could use a likelihood interval (they didn't ban that too, did they)?
https://www.jstor.org/stable/2985006
The following article may be relevant to the discussion.
https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1537892
Assessing the Statistical Analyses Used in Basic and Applied Social Psychology After Their p-Value Ban
In this article, we assess the 31 articles published in Basic and Applied Social Psychology (BASP) in 2016, which is one full year after the BASP editors banned the use of inferential statistics. We discuss how the authors collected their data, how they reported and summarized their data, and how they used their data to reach conclusions. We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.
@ Jochen and Salvatore
Then what can one decide about the Montgomery problem?
Something?
Or nothing?
One can decide that the available data is not sufficiently convincing to expect that the log ratio of the expected survival times has the same sign as the estimated log-ratio.
Massimo Sivo , at the article I linked above, this is what they list as recommended assessment measures for BASP.
Given, these, I'm not sure what conclusion you would come to for the Montgomery example.
The probability of an observation in Row2 being larger than an observation in Row1 is 0.64. The means are 403 (se=109) and 1018 (se=327). And a dot plot is attached. Cohen's d (if appropriate) comes out as 0.80.
These summary statistics and effect sizes are suggestive. But the sample size is small.
I think this is the problem with what BASP is advocating (if that article is fairly representing this).
None of these point estimates of the effect size captures the fact that the variation within each Row overcomes the effect of the two Rows.
That is, if we use any appropriate hypothesis test, or put confidence intervals on any of these effect sizes, we will come to the conclusion that the values for the two Rows are not significantly different.
I think this is a good example to illustrate the problem with the BASP approach.
Sal Mangiafico ,
Cohen'd s clearly makes no sense here, afaik, as it refers to (approximately) normal distributed variables. Here we know that the distribution is exponential. For this distribution, V(X) = 1 / E(X)², so the sd is known to be 1 / mean (quick check: the ratios of sample sd and mean are 0.86 and 0.98 for the two groups, resp.). So there is no information used in Cohens'd that is not already in the mean.
A sensible effect size measure would be the ratio of the expectations in the two groups. The point estimate is easy to calculate, it turns out to be simply the ratio of the means (2.4 in this example, B vs. A). So in this sample, the mean in group B is 2.4 times as large as the mean in group A. This point estimate is > 1, so we might be inclined to hypothesize that E(B) > E(A). The test checks if the data are sufficient for this conclusion and it says: no, they are not. There are hypotheses with E(B) < E(A) under which the observed data would not be sufficiently surprising, so the data does not provide enough information to discern E(B)>E(A) from E(B)
@ Jochen and Salvatore
The conclusion
is "appropriate according to the BASP approach"
but it is NOT according to the Theory...
Then we are left in ...
Massimo Sivo , in my opinion: If the article I linked to accurately reflects the expectations of BASP, those guidelines are silly and not helpful for evaluating observed data.
But what is your stance on this ?
@ Jochen Wilhelm
You wrote:
Actually, with the right computation
we have E(B)>E(A) at the desired level of significance
@ Salvatore S. Mangiafico
To give you my stance on ..., I must wait to find the paper you suggested.
Until then, unfortunately, I cannot answer.
Massimo Sivo , don't you see your silly, unfriendly, arrogant behavior?
I made a step towards you, asking for your solution. What do we get? A conglomerate of vague questions and unsupported statements. It - again! - is extremely unsatisfying and ineffective to communicate with you. It seems obvious to me that no one can learn anything from you as it seems you have nothing to say but to make some strange, provocative, and unsupported statements. It was another attempt to learn something from you, and I give up on this again.
@ Jochen Wilhelm (and recommender)
ARROGANCE?
where is it?
Truth is truth....
YOU said:
I wrote
THEORY is Arrogance?
Where is the ARROGANCE?
Massimo Sivo yet alone to claim to have/know the "truth" is arrogance. Statistics are conventions and models are more or less appropriate to describe the data, there is no truth in it. Is there more truth in the 95% interval than in the 89% interval or the 99%? What is truth? For example I calcualted the model with Bayesian methods (uninformative priors), where there is no need to claim any distributional form for the posterior of the A/B ratio, just describe it.
If I use the HDI interval (which I would prefer in most cases) the interval is [0.74, 5.35]. If I use the equal tailed interval (ETI, which is similar to the frequentist approach) I get [1.01, 6.14]. So which one is true now??
Either way, 97.67% of posterior density mass is larger than 1.
My contribution to the debate...
First, the given example is, let's be clear, a perfect example of incorrect usage of T-test (based on automatism and not reflection), because it is well known 1) that lifetimes are not Gaussian and 2) that T-test is quite sensitive to the Gaussian hypothesis for small sample sizes. So, unless a careful data analysis suggests that a Gaussian approximation may be acceptable (which is at best very difficult with so small sample sizes), trying to interpret the T-test is just a waste of time, its p-value can be anything and there is noway to know it's distribution, so no possibility to use it in any useful way. Just having mean and standard deviation does not allow to do any useful diagnostic or analysis, by the way, raw data are needed. Not also that most often, "automatism" given for practical usage is to use non-parametric tests for so small sample sizes (which has its limitations as any automatism instead of thinking, but at least is more reliable than a stupid usage of a T-test).
By the way, "common senses" definitely does NOT tell me there is a real difference between the two sets of lamps, because standard deviation is very high, so the difference seems small compared to the between-lamps variability, that would be my first concern for both constructors (keeping in mind that standard deviation is to be taken with care with skewed distributions, as lifetimes are often, but here again, without the raw data, its impossible to make a correct interpretation).
Second, saying that T-test is useless, or should be banned, just because some people use it out of purpose, like in this example, would be the same as saying that you cannot use antigenic tests, cars, electronic microscope or any other tool just because some people do not know how to use it. Note that for cars, a licence is requested; basic knowledge in statistics should be needed to use statistic tools as well!
Third, if H0 is completly true (that is, data are iid from a Gaussian distribution, and come from a single population), then the p-value does *not* depend of the sample size. In any other cases, as well explained in other answers, it does. The problem is that true H0 does not exist in reality: data are never Gaussian, never completly identically distributed, and there is always a small but useless difference between groups (I cannot believe the two lamps lifetimes are exactly the same at the microsecond scale). So in practice, any high enough sample size will help to have small p-values, but for practically irrelevant departures from the true H0. Here again, the problem is not with the tool, but with its blind usage, with the same comparison with any other experimental tool.
Alternative propositions have the same drawback: to be correctly interpretated, one should NOT use automatism and predefined absolute rules, but one should THINK and understand what they are and how they relate to the practical question.
In brief: do not ban anything just because some people do not use them correctly, or you will ban everything. EDUCATE people, educate researchers in statistics, but not as a set of tricks, as a real understanding of models, of interpretation, of relation with scientific démarche.
Massimo Sivo , I'm pretty sure the American Statistician article reviewing BASP guidelines is open access: https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1537892 . It's possible that it's not open in all countries.
@ Emmanuel Curis, Rainer Duesing, Jochen Wilhelm, Joseph L Alvarez, Salvatore S. Mangiafico
That’s why I presented the case form the Montgomery book. Not Gaussian BUT Exponentially distributed data… Then t-test is a “nonsense”.
THEN The Random Variable R, Ratio of the RVs A (related to the first 10 data) and B (related to the second 10 data), R=A/B is F distributed.
This is TRUE!
Truth is not arrogance!
It is WRONG
Remember Deming statements:
[1.] "Beware of common sense"
[2.] "experience alone, without theory, teaches nothing what to do to make Quality"
[3.] "The result is that hundreds of people are learning what is wrong. I make this statement on the basis of experience, seeing every day the devastating effects of incompetent teaching and faulty applications."
Statistics is a Theory able to provide “statistics” computed from the data and used to take decisions with stated Confidence Levels and connected with Probability Theory …
The practical value of “statistics” depend on the Theory used and on its AXIOMS.
The data from Montgomery’s book is very interesting. According to a post of Emmanuel Curis, the data are about the lifetime of two sets of lamps. So, for the sake of discussion, I name the data in row 1 “Lamp A” and the data in row 2 “Lamp B”. And according to a post of Massimo Sivo, Montgomery said the data follow "Exponential distribution". Therefore, the two-sample t-test is out of picture for this example.
I would like to borrow the statement from Jaynes [4] regarding the data of two manufacturers’ components (i.e. the example in the original post), “I think our common sense tell us immediately, without any calculation, this [dataset] constitutes fairly substantial (but not overwhelming) evidence in favor of manufacturer B.” I think his statement also applies to Lamp B. But of course we need some statistical analysis to prove our common sense correct. To quote Laplace: “Probability theory is nothing but common sense reduced to calculation”.
Here are the summary results of my analysis. Lamp A: sample mean = 403, sample std. = 344, estimated rate λ = 0.002484; Lamp B: sample mean = 1018, sample std. = 1035, estimated rate λ = 0.000983. The difference between two sample means = 615. The relative mean effect size (RMES) = 133%. The signal content index (SCI) (or heterogeneity index) = 68.53%. The estimated exceedance probability (EP): Pr(B>A) = 71.7%.
The rate parameter λ of exponential distributions is estimated based on the maximum likelihood principle. The exceedance probability (EP) is calculated using the estimated distributions for Lamp A and Lamp B. Note that the estimated EP is 71.7%, or Lamp B’s lifetime is greater than Lamp A’s lifetime is at the odds 2.5:1. The odds estimated from the data are 7:3 (or 2.1:1), which is consistent with the model (EP analysis) estimated odds 2.5:1.
I think the results of (1) the estimation of the effect size (the difference between two sample means and the RMES), (2) the SCI analysis, and (3) the EP analysis provide sufficient evidence in favor of Lamp B.
@ Emmanuel Curis, Hening Huang
We have TWO “Common Sense” statements
They are in CONTRADICTION!
[1.] "Beware of common sense"
[2.] "experience alone, without theory, teaches nothing what to do to make Quality"
[3.] "The result is that hundreds of people are learning what is wrong. I make this statement on the basis of experience, seeing every day the devastating effects of incompetent teaching and faulty applications."
WITH HOW MUCH PROBABILITY Pr(B>A)=71.7%?
Hening Huang : Indeed, I interpreted "component" as "lamp" as it is a classical example of lifetime, but this was overtinterpretation of the post, sorry. That does not change anything to the rest of the discussion, thanksfully!
The main problem with your approach is that you trust your lambda, mean and standard deviations estimations to have an infinite precision and to be the "true" values. But they are not, they have an estimation uncertainty, which can be obtained by a confidence interval (may be difficult for such a small sample size, but that's another problem).
Because of this uncertainty, your EP is also uncertain. If you have no idea of this uncertainty, you simply cannot decide which evidence it gives for one or the other. You should have a confidence interval for it, which may well be very wide and include values below 50%. The same remark apply for effect size and all other criteria you give.
Note that, despite a comment saying that, it does not apply to p-values because the risk of incorrect conclusion is controled by the rule "p
I'll just describe the actual Montgomery example, although at this point the actual example in the text is relatively irrelevant to this discussion.
It's from Introduction to Statistical Quality Control, D.C. Montgomery. It's Example 7.6, Table 7.14 in the 6th edition.
It's not divided into to groups at all, but is a single time series representing the number of hours between failures for a valve.
It's exponentially distributed, and the text recommends using the transformation x = y ^ 0.2777 to transform this distribution to a Weibull distribution that is close to normal.
The point of the example is to construct a control chart, in this case to show that the time between failures isn't increasing or decreasing over time.
Emmanuel Curis
I actually like your interpretation of "component" as "lamp" because it gives the physical meaning: lifetime to the component, so we are dealing with a “real” physical quantity, not just numbers.
Like any point estimate, the estimated lambda, mean and standard deviations from my analysis have uncertainties, and I am not claiming that these estimates have infinite precision. I think your p-value is also uncertain. If you have a confidence interval for your p-value, it is likely to be very wide. However, no one seems to report a confidence interval for a p-value in practice; it seems to be considered “exact”.
I know that the uncertainty associated with any point estimate must be considered in data analysis. In fact, we do uncertainty analysis all the time in measurement science. However, I think “central tendency” is our main concern in data analysis, especially in decision-making. As for this lamp example, my analysis suggests that people should “buy” Lamp B, not Lamp A. In fact, the data speak for themselves and constitute fairly strong evidence in favor of Lamp B, since (1) Lamp B’s mean lifetime is 1018 hours and Lamp A’s mean lifetime is 403 hours, and (2) the odds ratio that Lamp B’s lifetime is greater than Lamp A’s lifetime is 7:3.
What is your buying decision based on your null hypothesis test and your p-value or your confidence interval?
Nope, Hening, the p-value is a sample statistic. It has no uncertainty. It is an observation, as fixed as the observed sample data. This is, by its very definition, known or given. What is not known is the distribution of the random variable from which the p-value is a realization. We only know that this distribution is uniform under the null model and right-skewed under models that are specified correctly except for the value of the respective parameter. Of course don't we know as well if the model is specified correctly - this is something we may only assume and what may be more or less reasonable.
You must not confuse a random variable with its realization.
There is nothing like a confidence interval for a p-value, just like there is no confidence interval for any other observation. If this would be so, then the limitas of confiedence intervals would themselves be uncertain and should have confidence intervals, with uncertain limits... ad infinitum. This ideas gets us nowhere.
Regadring your last paragraph: If it is just to select which lamp to buy: is doen't matter, because all lamps investigated are dead :) I think what you meant is not the lamps but the manufacturer, say. If you have to decide between the two, and all you have is the data at hand, then simply chose the manufactory from which the lamps in the sample you analyzed worked longer, on average. As simple as that. The problem starts when you want to know the level of confidence such a decision might have...
You write about "evidence". This is an inherently difficult concept. I can call "evidence" whatever I like, and I can weight "evidence" as I want - unless I give a clear and unambigious definition of "evidence". Many statisticians have failed to give such a definition. Some claim the "p-value" is a measure of evidence, what is clearly wrong and led to tons of unneccesary papers published around this topic and the accompanying confusion, particularily among non-statisticians. To my opinion we should stick to the working-mode of "avoid talking about evidence" as long as no useful definition of "evidence" exists.
@ Hening Huang : please reread my answer, where I also explained why « confidence intervals » on p-values are not needed (if they would exist, because I agree with Jochen Wilhelm that they do not make sense for p-values).
In more details : p-values are (realisation of) random variables, by definition, since they are basically p(C>Cobs), where C is the test criterion and Cobs its observed value (note that there are two random variables in this definition, C and Cobs, but observed at different times and playing very different roles, this is why the result is also a random variable and not a single number). As such, they are not estimations of a theoretical parameter (there is no single theoretical value of what is the p-value), hence confidence intervals are meaningless. In the contrary, you can build "prediction" interval, that is an interval that has a known probability of containing the result of the random variable. In fact, it is exactly what is done when building the "decision" rule « reject H0 if p Xb ) can be computed from the law of Xa and Xb ; there are two random variables here also, but they are observed simultaneously and play the same role, hence the result is really a number, not a random variable here), and you are estimating this single value with your samples. Hence confidence intervals are defined and should be used to interpret the estimation.
Concerning central tendancy: this notion is already very linked to Gaussian or Gaussian-like (unimodal, symetric) distributions. But how do you define this for highly skewed distributions, typically like exponential distributions? Expectation is a choice, median would be another... Not so clear. And to this practical problem, I may prefer a manufacturer that makes lamps with a smaller expectation value, but much higher reproductibility of life-time, than one with a greater mean, but very low reproductibility, just for easiest product follow-up after (better knowledge of the real failure time of the lamp!). But it strongly depends on the context, of course.
Concerning my decisions: assuming having a greater mean is my first decision criterion, if the test (correctly performed) is significant at my decision level, I would go for the manufacturer with the highest mean lifetime. If it is not, I will either ask for more data or fall back to secondary decision rules, like "cheapest" or things like that.
Last, if indeed the context is the one found by Sal Mangiafico , then definitely I think the non-significant T-test is a more correct result than seeing a difference by alternate methods, and I'm quite sure a bootstrap sampling would give a very huge confidence interval for your own criteria (and an interesting view on the real distribution of p-values of a T-test assuming normality under H0, by the way). And I'm note sure splitting the sample is a good method to detect a trend in the data...
Salvatore S. Mangiafico, Jochen Wilhelm, Wim Kaijser, Rainer Duesing
I am really concerned about the answers to the Montgomery case.
The following statement makes no sense
Montgomery makes a wrong analysis and he concludes that the process is In Control.
Actually, with the right analysis, using the EXPONENTIAL distribution the process is Out Of Control, in several ways not seen by Montgomery…
The Montgomery case shows that we cannot accept the following statement about Confidence Intervals…
About p-values:
[1.] there is the definition
[2.] there is the computation (“estimation”) of the p-values
[3.] so we can compute the Confidence Interval from the “estimated” p-value
[4.] therefore it is doubtful that «confidence intervals» … do not make sense for p-values
The Montgomery case, as I provided it, was chosen to highlight the problems arising in analysing the data from designed experiments.…
Massimo Sivo : definition of p-values is "Probability of the observed test statistic, or more extreme, under the null hypothesis" (not discussing what means exactly "more extreme").
If you consider a single experiment (Fisher approach), this probability cannot be computed before any experiment, since the observed value of the test statistic is unknown. So p-value is a raw number, not a theoretical parameter of the model, hence there is no confidence interval for it (since, by definition, a confidence interval is an interval with a given probability to include the true value of a theoretical parameter). Or, seen differently, a p-value in this approach is always equal to its true value (except rounding error problems), and its confidence interval is reduced to itself, if you prefer. But of course, you cannot infer much from this alone.
If you consider repeated experiments (frequentist approach, let's say to be simple), each experiment will give a different test statistic value, the test statistic being a random variable, so the p-value is also a random variable. Since it is a random variable (whose one realization is observed), here again, it is not a theoretical parameter, hence no confidence interval can be built for it.
This is exactly the same situation as for any single observation, or for the sample mean (you don't have confidence intervals for the sample mean, you use sample mean to build confidence interval for the theoretical mean, aka the expectation ; you may have prediction intervals for the sample mean, but that is very different) or any observed quantity.
The only way to imagine a confidence interval on the p-value would be a kind of two-stage process where you distinguish 1) the random sampling of the units (« patients ») that gives you measures x1 to xn and 2) the observed measure with an experimental uncertainty, giving you the observed value y1,... yn. The « theoretical » p-value is the one you would have obtained with the true x1 and xn values; the « observed » one is the one computed with the y1... yn values. But that's a quite complicated model, and if experimental errors are indeed so high that you should consider this model (like strong rounding errors for instance), and not just sum up the two variances assuming independent processes (and knowing that the observed result in the (y1, yn) sample is already this total variance, so the model would be somehow unidentifiable), I guess you cannot used classical tests anyway that do not assume this model, so p-values are not interpretable, so the question of estimating them vanishes. And if you build such a complex model, then the final test and its p-value are falling back to the previous cases, so...
PS: the presentation somehow implicitly assumes that H0 is a point hypothesis, but it generalizes to composite hypotheses by the usual definition of a p-value in such cases...
Since some of this discussion is about the specific example in Montgomery, below I have the original data (in R code). And I attached a plot of the data, similar to how Montgomery plots it.
Personally, I don't see any good evidence that the time between failures is increasing.
There's an obvious problem with selecting the first 10 observations a group and the other 10 observations as another group. Although if I use Pettitt's test for change-point, it does suggest that this where the change point is.
And looking at e.g. moving averages is suggestive.
Massimo Sivo , can you explain what analysis you used ? Like, in a way that I could duplicate ? I really can't follow what you are suggesting for an analysis.
_________________
Data = read.table(head=TRUE, text="
Failure Hours TransformedHours
1 286 4.80986
2 948 6.70903
3 536 5.72650
4 124 3.81367
5 816 6.43541
6 729 6.23705
7 4 1.46958
8 143 3.96768
9 431 5.39007
10 8 1.78151
11 2837 9.09619
12 596 5.89774
13 81 3.38833
14 227 4.51095
15 603 5.91690
16 492 5.59189
17 1199 7.16124
18 1214 7.18601
19 2831 9.09083
20 96 3.55203
")
Sal Mangiafico I bet 5€ that Massimo Sivo answer will contain some of the following buzzwords:
- Deming said...
- F. Galetto said...
- ...people are learning what is wrong...
- true scholars would be able to calculate it themselves
or a mixture of the above mentioned or something similar.
Or better: I will spend 5€ to a charity organisation if Massimo Sivo will provide an exact way and explanation how to do his calculations, and why this is the correct way, so that you can repeat it in R. I will upload a proof of my payment.
Sal Mangiafico
Thanks for showing the data in the specific example in Montgomery. Your scatterplot is helpful. However, since the data are transformed according to x = y ^ 0.2777, the transformed data are “distorted”, which is called “transformation distortion”. Harrell (2014) pointed out: “Playing with transformations distorts every part of statistical inference…” Therefore, the inference based on the transformed data may be incorrect: “the time between failures isn't increasing or decreasing over time.” I think we should examine the data in their original physical unit (lifetime), not the transformed data. If we do this, we can see that, on average, the lifetime of the components increases over time, although the data are scattered.
By the way, the t-statistic is a transformed quantity, so inferences based on the t-statistic suffer from the so-called “t-transformation distortion” (please see https://iopscience.iop.org/article/10.1088/1361-6501/aa96c7).
Harrell F 2014 Comments on: ‘Pitfalls to avoid when transforming data?’ Cross Validated, http://stats.stackexchange.com/questions/90149/pitfalls-to-avoid-when-transforming-data
Jochen Wilhelm and Emmanuel Curis
Below I choose to respond to a few of key points from your posts.
(1) On Jochen Wilhelm’s statement: “… the p-value is a sample statistic. It has no uncertainty”.
If “the p-value is a sample statistic”, it must have uncertainty, just like any other sample statistic (e.g. the sample mean).
(2) On Jochen Wilhelm’s statement: “There is nothing like a confidence interval for a p-value… “ and Emmanuel Curis’ statement: “Since it [p-value] is a random variable (whose one realization is observed), here again, it is not a theoretical parameter, hence no confidence interval can be built for it.”
In my opinion, there is a theoretical value or ‘true” value for p-values. Let’s consider a one-tailed t-test for two samples: 1 and 2 with the same size n. The true p-value can be calculated by the corresponding one-tailed two-sample z-test, in which the parent population parameters μ1, σ1, μ2, and σ2 are known. For each pairs of samples randomly drawn from their corresponding parent distribution, a p-value, denoted by p(sample), can be calculated by the one-tailed two-sample t-test. This p(sample) is a sample statistic (random variable) that can be described by a distribution. Since (according to Jochen Wilhelm) “We only know that this distribution is uniform under the null model and right-skewed under models that are specified correctly except for the value of the respective parameter”, we can use this distribution to construct a coverage interval at a specified coverage probability. We can perform simulations to generate a large number of the p(sample) values and the associated intervals. Then, we can verify the “coverage” just like we usually do for any other interval procedure.
(3) On Jochen Wilhelm’s statements: “There is nothing like a confidence interval for a p-value, just like there is no confidence interval for any other observation. If this would be so, then the limits of confidence intervals would themselves be uncertain and should have confidence intervals, with uncertain limits... ad infinitum. This ideas gets us nowhere.”
This is actually one of the shortcomings (or flaws) of the concept of confidence intervals. In fact, “the limits of confidence intervals are themselves be uncertain” because the limit involves random quantity (e.g. sample standard deviation), which is called ‘uncertainty’ of the uncertainty in measurement science. But people seems to ignore this ‘uncertainty’, thinking of a “confidence interval” as “exact”, similar to thinking of p-value as “exact’.
(4) On Emmanuel Curis’ statements: “In the contrary, if I understand clearly your excedence probability, it is a theoretical, well defined, but unknown single value (p( XA > Xb ) can be computed from the law of Xa and Xb ; there are two random variables here also, but they are observed simultaneously and play the same role, hence the result is really a number, not a random variable here), and you are estimating this single value with your samples. Hence confidence intervals are defined and should be used to interpret the estimation.”
Your understanding about excedence probability (EP) is clear. However, I think your statements apply to p-values as well. Please refer to section 5.2. Comparison with the z-test and t-test of my paper about the excedence probability (EP) analysis (https://journals.uregina.ca/jpss/article/view/513). In that paper, I discussed two types of problems: (a) assessing the difference between the two samples XA and XB, and (b) assessing the difference between the two sample means. In fact, for problem (b), there is a relationship between the EP=Pr(sample A’s mean>sample B’s mean) and the one-tailed p-value produced by a two-sample z-test or t-test. Because the t-test approaches the z-test when the sample size is large (say, n>30), the p-value produced by a one-tailed t-test will be approximately equal to 1-Pr(sample A’s mean>sample B’s mean). That is, for one-tailed two-sample t-test, p-value≈1-EP. Therefore, the uncertainty associated with an EP value will be about the same as the uncertainty associated with the corresponding p-value. So if you think you can build a confidence interval for the EP, so can the p-value.
Hening, you still seem to confuse the concepts of "random variable" (RV) and "realization (of a random variable)". The former is modelled using a statistical model, the latter represents observations. The RV does have a (probability) distribution, a realizations does not have a distribution. From whatever you have values is NOT a RV and does NOT have a distribution nor an uncertainty.
Hening Huang : beware not to confuse uncertainty (and its different origins), random variability and variance/standard deviation - and random values and their realizations.
For the sample mean for instance : you have a set of n values x1 to xn; you compute the arithmetic mean m = (x1 + ... + xn) / n. There are two kinds of uncertainties, roughly:
- uncertainty because would you have done another experiment, you would have obtained n other values, x1* to xn*. This is random variability and is what is modelled and taken into account by usual statistics, by saying that xi are realisation of random variables Xi, and hence m is the realisation of the (estimator) random variable M. Random variable have (usually) a standard deviation, which is taken as a measure of random variability, uncertainty kind 1 let say
- uncertainty because the value x1, xn you have are rounded, or the measurement apparatus is biased or... that makes that the observed values are in fact not the real values, but approximations of them (see my answer to Massimo Sivo ). In general, it is assumed that these errors are negligible compared to the random variability; to some extent, they are also included in the random variability model (but more complex models exist to handle this more explicitly; that's not the case of usual statistics), let's say it is uncertainty of kind 2
When we say that realizations of sample statistics, like m (or p), have no uncertainty, it is because they are numbers, hence do not have standard deviation (or standard deviation = 0 if you want): no uncertainty of kind 1. And one assume uncertainty of kind 2 is negligible. In particular, m and p do *not* have uncertainties; M and P (their associated random variables) do have a standard-deviation and a distribution, allowing to make intervals and so on. But not directly confidence intervals
Note that uncertainty of kind 2 is taken into account by none of the discussed methods (confidence intervals, p-values, expectation of probability ratio [which is not an odd-ratio]...) and could be seen as the question "how many significant digits in the result?", roughly.
For the sample mean: the model says that Xi has an expectation value µ and the question is about µ. The law of M allows to construct a confidence interval on µ. But not that m and µ are very different, and that M is just an estimator of µ.
Now, for the p-value case: what is the equivalent of µ? What does p or P estimate? You say that you can compute it theoretically, before the experiment. But it's not possible, with the definition of the p-value, please see my answer to Massimo for details! So p does not estimate anything, hence the impossibility to define any confidence interval about what would p estimate.
Just try the computation, and if you manage to do it even in a very simple case, I will think about it and may change my opinion.
Salvatore S. Mangiafico, Jochen Wilhelm, Wim Kaijser, Rainer Duesing, Gang (John) Xie, Hening Huang, Emmanuel Curis
Rainer
You bet and win, for the time being because I want you to know that
Deming said:
[1.] "Beware of common sense"
[2.] "experience alone, without theory, teaches nothing what to do to make Quality"
[3.] "The result is that hundreds of people are learning what is wrong. I make this statement on the basis of experience, seeing every day the devastating effects of incompetent teaching and faulty applications."
Salvatore the true Control Limits of the Individual Chart, using the Exponential distribution, are
LCL=103
UCL>>1000
I will show the way to compute them by tonight
Massimo Sivo that is the whole point. I am pretty sure that I am wrong about a lot of stuff and have to learn many things. I would love to improve and correct myself, but this is not very fruitful, if someone only makes insinuations, claims to have the truth and does not give enough imformation to follow his/her position.
So, I am very curious and I would be happy to spend some money to a charity organisation. (technically I won a bet and maybe lose a bet.... but hey, it's for a good purpose, I'll spend it anyway, if sufficient information is provided ;-) )
Hening Huang : as far as I understand it, your formula 15 in the paper, that « shows » the link between EP and p-values, is wrong, because the definition of the p-value is wrong.
P-value is the probability of observing the observed test criterion, or any value more extrem, under the null hypothesis. Hence, for the one-sided (on the low side) Z-test for the mean, with H0: µ = µ0, the p-value is Pr( (M - µ0)/ (sigma/sqrt(n))
Salvatore S. Mangiafico, Jochen Wilhelm, Wim Kaijser, Rainer Duesing, Gang (John) Xie, Hening Huang, Emmanuel Curis, Daniel Wright,
Here I am, as I promised
In my previous post, I remembered wrongly a value…
Here you find the right numbers
Document attached.
Remember that
See the Journals publishing wrong papers on TBE (Time Between Events) data.
Emmanuel Curis
I would like to respond to your comment: “as far as I understand it, your formula 15 in the paper, that « shows » the link between EP and p-values, is wrong, because the definition of the p-value is wrong. P-value is the probability of observing the observed test criterion, or any value more extrem, under the null hypothesis. Hence, for the one-sided (on the low side) Z-test for the mean, with H0: µ = µ0, the p-value is Pr( (M - µ0)/ (sigma/sqrt(n))
"More extreme" means "more extreme under the Null". The extremeness is measured as the difference between the estimated value m and the hypothesized value µ0: µ0-m. This is an observed value. The inference is based on assuming that m is a realization of a RV M with an approximate normal distribution with E(M) = µ0 and V(M) = σ²/n (where σ²=V(X)). For this assumed distribution of M we can calculate Pr(M
Salvatore S. Mangiafico, Jochen Wilhelm, Wim Kaijser, Rainer Duesing, Gang (John) Xie, Hening Huang, Emmanuel Curis, Daniel Wright,
· I AM VERY SORRY.
· I POSTED the file in a SECOND post just after the first.
· BUT unfortunately the system did not sent the file to you.
Please forgive me.
======================================000
Here I am, as I promised
In my previous post, I remembered wrongly a value…
Here you find the right numbers
Document attached.
Remember that
· There is in ResearchGate a professor with 176 publications and 7133 CITATIONS
· With MANY papers WRONG
· Whose methods can find the Montgomery data In Control, WHILE they are Out Of Control
See the Journals publishing wrong papers on TBE (Time Between Events) data.
Solution of the Montgomery Control Chart.docx