Let's consider the (standard) 2-sample t-test for a difference in means. The t-value is calculated from empirical difference in sample means divided by the SE of this difference (in turn derived from a pooled variance estimate). Therefore, by construction, the test is sensitive to differences in location, but not in dispersion. The p-value is calculated from a t-distribution (with n1+n2-2 d.f.).
As I understand, this t-distribution is derived for one single condition: the data from both groups is sampled from *one* population with normal distribution. The normal distribution if often called an "assumption", and the fact that both samples are from the same population is called the "null hypothesis" (H0).
Is there any reasonable argument to assume that under H0(!) the two samples are coming from two different populations, with possibly different variances but the same means?
My question has two aspects:
I wonder if such a test is justified when we believe/know that the samples must have been taken from different populations (that might have the same mean or not, but they surely have different variances). What is the rationale behind judgeing differences in mean values when I seem to compare apples and peaches anyway? (Just as a side note: often the difference in variances under H0 can be explained by inhomogeneities within one of the groups; this is for instance often observed in studies with diseased and control animals, where the diseased group suffers several side-effects increasing the variability of the response, but not neccesarily the mean. - Wouldn't it be more appropriate, if possible, to adjust for these indirect effects instead of simply "assuming different variances"?). (and I know that if we ignore all these logical things there is a Welch-correction)
The second aspect is: Since the p-value related specifically to H0, only the conditions under H0 are relevant. Right? Again a typical example from biomedical research: mean and variance of concentrations are usually correlated; the higher the observed mean, the higher the observed variability. I know that a log-normal or gamma-glm with log link is most appropriate here for analysis, but here I am asking for a simple t-test again(considering the violation of the normal-distribution assumption is negligible!). The observed differences in variances are related to different (sample)means. And under H0 (!) I think it is justified to assume equal variances (otherwise see the previous paragraph). Having said this it follows that the t-test (without Welch-adjustment) would be perfectly fine, although the sample data has apparently very different variances in the groups. The problem might be to get a good estimate for the SE. Using a pooled estimate might result in unneccesarily low power, but nothing could be done so wrong to accidentally inflate the type-I error rate. Right?
The same questions apply to the "non-parametric" alternative, the Wilcoxon test. The p-value here is again derived from under the assumption that both samples are taken from the *same* population (that dosen't need to have a normal distribution). It is often stated that this is a test of location-shift (equality of the medians), if and only if all other other moments of the distributions are the same. Again I wonder if H0 does not automatically and necessarily imply hat all moments must be identical, since there is only one distribution under H0.
Jochen,
I'm going to zero in one point for now:
"(considering the violation of the normal-distribution assumption is negligible!)"
I'm not sure that it is negligible, if the mean and variance are associated. I believe one of the characteristics of the normal distribution is that they are not.
Am I missing something?
Pat
Patrick, if you have a headache thinking about the normal distribution, then go on the last paragraph. Despite this, you can imagin the gamma distribution with a fixed scale parameter. The variance is given by (shape)*(scale)². For larger values of the shape parameter, the distribution approximated the normal distribution. You may select a value for "shape" where the difference to a normal distribution is negligible for you.
"I wonder if such a test is justified when we believe/know that the samples must have been taken from different populations (that might have the same mean or not, but they surely have different variances). What is the rationale behind judgeing differences in mean values when I seem to compare apples and peaches anyway?"
It is justified and chemical engineers use it all the time. Say you have two DIFFERENT dyes (so they do not come from the SAME distribution). You want to compare their drying time. And the factory that produces the second dye is known to use process control more tightly, thus quality difference between their dyes (variance) is smaller. So to sum up, the scenario is dye 1 ~N(mu1, var1) and dye 2 ~ N(mu2, var2). Now the question we are interested here is e.g. mu1 > mu2 (because money matters!). In this case, we are well justified to use a t-distribution (given that their populations are normally distributed). I guess your previous statement is the one which may be the cause of confusion:
"As I understand, this t-distribution is derived for one single condition: the data from both groups is sampled from *one* population with normal distribution."
As far as I know, this is not the assumption of t-distribution. Actually, this has got nothing to do with t-distribution (since increasing the sample size up to 40, we may use normal distribution), but related with the sampling distributions.
When we use ho: mu1 = mu2 and z (or t) = (x1bar-x2bar)/(stde*sqrt(n)) as the test statistics, we do not need to make the assumption that both groups should come from the SAME population. But E{z} = 0 and std(z) = 1, as long as x1 and x2 are iid. Actually x1 may be Gaussian distributred, but x2 is uniform distributed (of course a large sample size is required for the normality of x2bar, in this case).
I hope I have understood your question correctly.
Jochen,
Gotcha. I think the rest of the question is above my level!
Pat
Jochen. I have researched econometric data (land distribution, incomes, ...) where the total is the aggregated of several regions. Each one usually has a different media, a different adimentional structural distribution -which are usually non-normal ones-, and a different fraction of population. Give me a two sectors data made with 1)media and 2) a decils table for each one so I may produce a paper with graphs for each sector, and for the total universe expressed with medias as fractions of total media. My opinion is negative about your question "Is there any reasonable argument to assume that under H0(!) the two samples are coming from two different populations, with possibly different variances but the same means?". This is more a problem of sets theory than one of bayesian premises and hypothesis testing. Thanks, emilio
Maybe I have not understand the problem. But the data from both groups is not sampled from *one* population with normal distribution. It is sampled from two populations with normal distribution. The variance can be known or not. This is a t test for two independent samples. In addition you might not know the variances but know that they are equal.
Hi Jochen,
As far as I understand your question, you are testing for two different things.
Standard statistics look for differences in the mean, assuming different populations have different averages. However, independent variables can affect the mean of the response variable or its variation. These variables are called fixed effects (acting on the mean) and random effects (acting on the variance). The student test is designed to test for differences in the mean, hence its assumptions. In your case, if you want to test, if two different samples have different variations, you could test for significance of the random factor. There is an interesting paper on this issue: Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, et al. (2009) Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution 24: 127-135.
We have a paper in press testing for factors affecting variance, not the mean, but I can not supply it because it is under press embargo until publication. Hope this helps.
It would be helpful, for my understanding, to know what the research question and population-variables are. The description stretches my ability to reason abstractly.
The fact that you wish to correct for (multiple?) possible covariates, one is forced away from the conventional t-test and into multiple linear regression analysis of some sorts... which would allow both populations to be corrected for several possible confounding factors, and you first study these factors under H0. The interaction between the possible confounding factor (location/age/placebo) with your "mean" may be present in one population/location and absent in the other, which leaves you with much more valuable information than the blunt total difference in variance and/or mean.
I'm not sure quite what you mean.
In the case of the t test the probability model for the test statistic is wrong if the variances aren't equal. This can increase or decrease Type I error rates in some situations - notably when there is unequal n.
For Wilcoxon the Type I error rate must also be able to rise or fall if the shapes of the distributions differ because you can (albeit under extreme situations) get circularity where A > B and B > C but A < C.
I will hazard an answer.
First, if we are comparing samples from two definitely different distributions, the t-test is not warranted; although it is used almost always. Our null hypothesis is that the two samples come from two normal distributions which share the same mean and variance (Welch's test if they share the same mean but not variance). If we know that the distributions are different in ways that are not reflected by these parameters then we are certainly violating one of the various assumptions. In practice the t-statistic still represents the observed mean difference over the SEM, but the p value will be incorrect in some way. So, the question here is what do you mean by a difference between the two relevant populations? If you "know" them to be different but consider that the null hypothesis is potentially correct because their distributions may be interchangeable, then it is not clear that what you "know" makes the t-test a poor choice. If you know that the distributions are definitively different, then the standard null hypothesis is pretty inappropriate.
Second, the p value does indeed reflect the probability of your particular observed t statistic under the assumptions of the null. If you are working with distributions in which the mean and variance are correlated, you are not working with the actual null hypothesis of either flavour of the t-test. So the p-value will not represent the truth (although common practice is to assume that it is "robust"). Can one construct a scenario in which type I error would be inflated by making a "mistake" about the assumption of equal variances? This is easiest to do if the samples sizes are not equal and can also be constructed if the mean also correlates with skew. It is actually easier to verify using a randomization test then it is to work out if it is the case looking at the moments.
Third, I would say that the claim that all the moments of the relevant distributions need to be the same for Wilcoxon is too strong. If I remember correctly, W is derived on assumptions about the ways in which the signs of pairs should be distributed. The easiest case obviously has two identical underlying distributions. However, there would be a number of other cases in which our lack of information of those distributions would lead to the same decisions about the distribution of signs. So the null hypothesis is pretty broad in this case.
Hope this helps.
Hi Jochen,
you mentioned the Wilcoxon signed-rank test, indeed it is a test on the scale parameter of two samples coming from the same population having symmetric distribution around the true median values; moreover, it is assumed that both dataset are collected randomly and independently. The Wilcoxon test has far better performances than the t-test whenever the distribution is not Gaussian against slightly worse performances for Gaussian data.
However, I understand that you want to test whether two samples shares the same distributions. If you can assume that the joint samples have continuous distribution, then the Kolmogorov_Smirnov test for two samples might be what you are looking for, since there is no assumption to be made on neither the scale nor the shape parameters.
I agree with Dr. Vagheggini
For comparing two independent population means we use;
a) For parametric sample, t-test for two independent samples
b) For nonparametric samples, we use either wilcoxom rank sum test/ mann-whitney U test or the Kolmogorov-Smirnov test
For comparing two dependent population means we use;
a) For parametric samples we use either t-test or repeated measures ANOVA
b) For nonparametric samples we use, sign test, wilcoxom's matched pairs test or Friedman's two way of variance.
I love when people defy usual and "solid" knowledge! The problem is always in things "we know"...
But let me say that is some minor problems in your question. First, and this is a very common misunderstanding, normality assumption is not an assumption to data distribution, but to mean distribution. Although data from a Normal distribution always presents means with Normal distribution, there are a lot of other data distributions that generate Normal mean distribution as well, depending on the sample size. At least, that is what central limit theorem tell us...
T test is a test to compare means. We usually associate equal means to equal populations, because means are the expected values. But the test aims to find equal means. In this point, you are completely right! If you have equal means and different variances, you don't have the same population.
At last, the use of standard t-test (assuming equal variances) is not a good choice. If you have equal variances and sample sizes, the result will be the same as if you apply Welch's T test. But if you have different variances and/or different sample sizes, Welch will perform better. So, why should anyone use standard t test.
In summary, I think the major problem here is with the interpretation of t test, attributing differences as population differences and not just differences in the means.
(Classical) Statistics starts from a (statistical) model for the data. The model should be realistic in the sense that it should account for data variability.
For example saying that data X1,...,Xn are IID N(mhu,sigma2) is such a model.
In the frame of the model, we may be willing to test an hypothesis H0.
Using some principle, e.g. likelihood principle, we may get a test statistic and, eventually, a p-value to understand if the data and the hypothesis are compatible or conflicting.
The p-value is a random variable that, under H0, is uniformly distributed in (0,1). If the test is not biased and H0 is false the distribution of p-value is stochastically smaller than the uniform distribution, i.e. it is closer to zero.
Changing the kind of data to be analyzed often implies a change in the statistical model and using the same test statistic may be misleading for two reasons:
1) the distribution of the test statistic under H0 is different and old formula for p-value is now wrong
2) the behavior when H0 is false may be unsatisfactory (low power or bias)
Testing the hypothesis H0 stating the equality of the averages of two independent populations with different distributions and finite variances is easily done in large samples using the central limit theorem. On the other side, one have to ask if this H0 as a practical meaning since comparing the expectation of differently shaped populations is known to be a dangerous exercise.
Hypothesis testing involves the careful construction of two statements: the null hypothesis and the alternative hypothesis. These hypotheses can look very similar when written down, but actually occupy positions in our hypothesis test that are not on the same footing. How do we know which hypothesis is the null and which one is alternative? There are a few ways to tell the difference.
The alternative or experimental hypothesis reflects that there will be an observed effect for our experiment. In a mathematical formulation of the alternative hypothesis there will typically be an inequality, or not equal to symbol. This hypothesis is denoted by either Ha or by H1.
From: http://statistics.about.com/od/Inferential-Statistics/a/The-Difference-Between-The-Null-Hypothesis-And-Alternative-Hypothesis.htm
Although others have referred to the issues surrounding hypothesis testing, I think it's important to realize more than just the nuances and complexities associated with various methods used. Rather, criticisms range from any tests that rely upon functions of mean deviation to the use of hypothesis testing (in the "alpha level not reached accept null/ alpha level reached reject null" form). To that end:
Taagepera, R. (2008). Making Social Sciences More Scientific: The Need for Predictive Models: The Need for Predictive Models. Oxford University Press.
Hubbard, R., & Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing. Theory & Psychology, 18(1), 69-88.
(http://wiki.bio.dtu.dk/~agpe/papers/pval_notuseful.pdf)
The Cult of Statistical Significance
(http://www.deirdremccloskey.com/docs/jsm.pdf)
Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic bulletin & review, 1-8.
How many discoveries have been lost by ignoring modern statistical methods?
(http://www.unt.edu/rss/class/mike/5700/articles/How_many_discoveries_wilcox.pdf)
Mindless Statistics
(http://people.umass.edu/~bioep740/yr2009/topics/Gigerenzer-jSoc-Econ-1994.pdf)
Hail the impossible: p‐values, evidence, and likelihood
(http://www3.nd.edu/~sjones20/JonesUND/BioStats_files/HailTheImpossible.pdf)
How to confuse with statistics or The use and misuse of conditional probabilities
(http://projecteuclid.org/download/pdfview_1/euclid.ss/1124891288)
The null hypothesis: closing the gap between good intentions and good studies
(http://217.219.214.30/documents/10129/44727/6.pdf)
The "significance" crisis in psychology and education
(http://laits.utexas.edu/cormack/384m/homework/Journal%20of%20Socio-Economics%202004%20Thompson.pdf)
We Agree That Statistical Significance Proves Essentially Nothing
(http://sites.roosevelt.edu/sziliak/files/2013/01/Statistical-Significance-Ziliak-McCloskey-Rejoinder-to-Mayer-EJW-2013.pdf)
Dear Jochen,
Your questions are relevant and fundamental. I think you have understood them quite well, yourself, so you only need some support and confirmation.
As you say, H0 of the twosample t-test specifies that the two samples come from the same population. This H0 is adequate for a first test more often than some text-books suggest (when they demand test variances first and if not equal, select Welch's t-related test). For example, if we try a new treatment versus control, this H0 can be formulated as "there is no difference between new treatment and control", i.e. "their responses are equivalent". If new treat is effective, on the other hand, it may of course affect both mean and variance, but we want a test sensitive to the mean. Thus when you see the two samples differ in both mean and variances, this observation alone is no argument against the H0 above. You have an example of diseased and control animals, and if you formulate H0: "No effect of disease on the response", then H0 is of the type above and the t-test is adequate.
In some practical situations the H0 above is considered too strong - the scientist believes variances would be different even if mean value were unaffected. What does then a difference in mean values mean? This is your apples and peaches problem. Should we advocate a treatment that is beneficial on average but has some risk for much worse outcome? The scientist should think carefully about H0.
On the other hand, after the t-test has rejected H0 and we go for confidence intervals, we may be faced with the dilemma of both means and variances being different. This brings us to your second aspect. It would improve the inference and increase power, if we could specify how the variance depends on the mean. It would both improve and simplify if we could find a transformation of the response variable that yields (more) constant variance. The log transform is typically worth serious consideration when variance seems to increase with the mean value (and all response values are >0). Such a transformation should then naturally be made already before the first t-test.
Best wishes,
Rolf Sundberg
I'm sorry I couldn't make this reply shorter, but I tried to touch all of Jochen's points.
Sometimes the name things go by is not completely indicative of what is actually going on.
For instance, "coming from the same population" actually means "the variable under study has the same distribution in both groups" (hence both groups from a single population in the sense that you could sample from the union of both groups using a single distribution).
---
Similarly, the test for "different variances" works perfectly fine for equal variances. It does not make the assumption that the variances must be different for the test to make sense. The moniker only suggests that you should use the test for equal variances if you are confident that the variances are equal. (Thus using the Welch test is not necessarily inferior to `making adjustments' if variances are similar, like in your example.)
And why is that? There are two reasonable arguments. Let n,m be the sample sizes.
1) If the variances are equal, the estimate of them used in the denominator of the t-test will presumably be superior to the one in the Welch test. That is because, if variances are equal, all your data come from the same distribution and you can put both samples together forming an n+m sample. Welch uses n data to estimate one variance, and m data to estimate the other. If they are truly equal, this is inefficient.
2) The sampling distribution of the Welch statistic is not really a Student t. It is a more complicated distribution which is approximated by a t (the number of degrees of freedom to be used is estimated using the method of moments). I am not sure how large the deviation from the true distribution is, but I seem to recall coming across samples with similar sample variances for which the number of degrees of freedom for Welch is larger than for the t statistic. In view of argument (1), the notion that a statistic using an inferior estimate would provide you with shorter confidence intervals is really dubious, so possibly there can be a relevant deviation at times.
---
If you test the difference of means while positively knowing that the variances are different, you are no longer able to conclude that both groups form "a single population". In that case, the motivation of the test can only be that you are interested in the mean difference by itself (examples were given above), even if that tells you little as to how the distributions compare as a whole.
In the t test, and in the Wilcoxon / Mann-Whitney test, if you reject H0 (say, a one-sided H0) you are able to conclude that e.g. P(X>Y) > 1/2 and that P(X>=a) > P(Y>=a) for every a. If you know variances to be different, you cannot get that. In some situations that weakened conclusion is unsatisfactory (e.g. you conclude that a new drug is better "in average" but it remains possible that, for most patients, the old drug outperforms the new!) Averaging the effect over different patients may get to make very little sense.
How much easier is the Bayesian approach! Obtain posterior of mean mu1 from group 1 (could be t) and mu2 from group 2, and calculate Pr(mu1 - mu2), typically easily done by simulation. This calculation is similar to a one-sided p-value but without the assumption of equal variances (and in case of two t-distributions equivalent to, but much easier than, the Behrens-Fisher approach).
Thank you all for your participation!
Rolf, thank you for your kind support. This is exactly what I meant to ask, just put in a better wording and more understandable form.
@Pedro and others: regarding the "coming from the same population" statement: Given I have blood pressure measurements of humans. In this case, mankind is the population. I could use a (random) sample and estimate the population mean. New by some research idea I might suspect that the blood pressure depends on the sex. So I regard my my data as coming from two (sub-)populations: females and males. When I would use a t-test to compare these two samples, the p-value gives the probability of the actual or more "extreme" data under the assumption that sex was completely unrelated with blood pressure. The additional assumptios of the (standard two-sample) t-test is that the data in both sub-populations is normal distributed with the same variance. Now take these things together: Both distributions (for females and for males) are normal, and they have the same variance. Under H0 the means are equal. But the normal distribution is *completely* definied by the mean and the variance, so if you have two normal distributions with same mean and variance, there is no way to distinguisch these populations. They are mathematically identical. This is why I stressed this "single population" concept. This was probably misleading.
This concept runs even a little deeper: If I had to think of males and females as different populations anyway, what insight would a difference in in mean blood pressures of these populations give then? Of consider a drug treatment instead of the sex: If I can, a priori, consider that the treated and the untreated patients are two different populations anyway, finding differences in some population parameters would't be really interesting. The other way around: It would be silly to expect that the populations had identical means anyway, so finding a "significant" difference is simply a matter of power. (I know that test on H0: |mu1-mu2-delta| could be performed, with delta being a minimum relevant difference, but this is abolutely untypical in my field and a reasonable value of delta typically can not be provided ("any difference would be interesting", the researchers say).
I know that the power may be suboptimal when the t-test is applied on data under HA where not only the means are different (but also other moments like variance and skewness). In the research practice I know this is typically not adressed at all, no power is calculated anyway. You take the data you can afford to acquire and you hope to see something. (Ideally) Only "significant" findings are interpreted (accidentally and wrongly sometimes also non-significant findings are interpreted, I know) and non-significant findings are ignored (left uninterpreted). This way the type-I error rate (but not a type-II error rate) is controlled. Well, that is actually my question: Is the type_i error-rate controlled if the assumptions are violated, when these violations are practically negligible under H0 but not anymore under (the actually observed) HA? Tests reject H0, this is not related to HA, so only the conditions under H0 should be relevant for rejections (but not for "accepting H0", where the power and the HA plays an important role - I am only talking about rejections and type-I errors).
Frank, yes, Bayesian testing is easier ... to say.
In practice, you need the posterior, which is easy to define but, in many cases hard/impossible to compute in closed form.
Of course you may use simulation ... but it is a step further from easiness.
Jochen, I like your rejonder 22min ago.
Considering the blood pressure example I can suspect that the overall distribution is in fact a mixture of Gaussian, (for sex, age, alimentation style, etc etc).
One want to test if there is "a shift" of the distribution when moving form one subgroup to the other (sex or treatment).
If the shift is approxiamtely well represented by the mean, the point is: which is the approximate (asymptotic) distribution of an appropriate test statistic (e.g. the difference of sample means) under H0. In large sample variances may be estimated on data without loss of information for the mean, so no point about homosckedasticity.
Not too far from Frank's Bayesian posterior probability. Here no a priori statement is necessary.
Jochen,
I was just confused by your initial reference to “one population”. Your recent rejoinder allows me to better understand the situation you wanted to depict. This falls in the frame of some reasoning that I had published in some of my publications that one can find on ResearchGate.
From my (narrow) viewpoint of issues in statistics, the one-population is one of the red herrings in statistics, a myth (necessary oversimplification) that is basically a tautology: a population is single when it is single. It is similar to the definition of “repeatability condition” requiring conditions be the same “over a short period of time” where “short” actually means ‘short enough so that the repeatability condition applies’. Consequently, this condition should first be proved (tested).
Secondly, your “apples and peaches” concern seems to me ill-based, as your last statements demonstrate: male and female are the >same< population if you want to test some properties of the humankind, while, singularly taken, they would be two different populations, male and female, in general with different statistical parameters. I used exactly the same example, which I called the ‘camel distribution’, for the distribution of height of adult humankind.
It is a striking self-evident example of a ‘mixture distribution’, the >sum< of distributions. In measurement, it is a situation almost always happening when analysing data obtained by fusion of data series of different origin, so generally affected by fixed-effects: actually, I always prefer to assume ‘mixed-effects’, since the possibility that different sub-populations have exactly the same variance is an exception rather than the rule –apparently also in your case.
Well, I found that statisticians do not like, in general, mixture distributions, and that the latter are generally used in the reverse sense, as far as I understood from textbooks. These sub-populations arise most often in measurement, due to the so-called ‘systematic effects’ bringing to ‘systematic errors’. In other fields, their origin and naming may be different. However, it is a type of effects (and errors) that statisticians historically have difficulties in treating consistently –certainly in measurement, in my opinion a basic shortcoming and the source of infinite misunderstandings.
Much of the confusion I think comes from mixing Fisher’s statistical test (it’s all about the data, there is no alternative hypothesis and hence no power problem) and the Neyman-Pearson hypothesis testing approach (which is all about the test aiming to make a Yes or No decision on the null-hypothesis balancing Type I and Type II errors). One can’t say that ‘the p-value gives the probability of the actual or more "extreme" data under the assumption that sex was completely unrelated with blood pressure’ (Fisher) and that ‘finding a "significant" difference is simply a matter of power’ (Neyman-Pearson terminology). You cannot talk only about rejections and type-I errors, because type I and II are two sides of the same coin. Nowadays, I believe, the Neyman-Pearson framework is not considered very helpful and the Bayesian approach is becoming more popular as the alternative to the p-value mania.
Good point, André. I simply wrote about the power to indicate that my whole problem is *not* related to power. But in fact I walked into the trap mentioning the control of the type-I error rate what comes from Neyman's philosophy and is not in-line with Fisher's philosophy. However, Fisher's significance test is obviousley meant as a safety-belt against false rejections of the H0. I got trapped because is is (very sadly!) quite a standard in my field that researchers do significance tests be comrating p to a fixed, constant (a priory set?) "alpha"-level (typically 0.05) and rejecting H0 whenever p
This is the Behrens-Fisher Problem, in frequentist statistics. A classic piece. There are many different interpretations, of which several have been noted above. You can find many descriptions of the Behrens-Fisher Problem around the web, via google.
No, Mary. The Behrens-Fisher problem is about unknown and unequal variances under H0 (ant the best known solution is the Welch/Satterthwaite approximation). My question aimed at something totally different.
Jochen and others
You and the answerers raise important points concerning hypothesis testing and P-values or confidence intervals. A few points should be addressed.
1) The t-statistic concerns the distribution of the variances. Contrast a test that has 25 samples to 25 tests of 25 samples, each. A single 25-sample test will have one variance, the 25 tests will have 25 variances.
2) To show a difference in 2 sets of data it is NECESSARY to have a difference in the means. Various tests (t-test is one) and criteria are used to establish the difference.
3) Data and data distribution are all-important. Simple examination will tell if there is a difference in 2 data sets, especially in answer to a clear question.
4) The statement, “ --- the normal distribution is *completely* defined by the mean and the variance, …” is part of the problem. Once we start talking about the parameters of the assumed distribution, we forget about the data.
5) Power, sample size, and type-I and –II errors generally are used for decision-making and often confuse experimentation. Nevertheless, sample size is important to show a NECESSARY difference.
Now to your philosophical question about ‘assumption’ and your practical dilemma of limited funds and time limiting sample size. Limited funds and time force some assumptions. Such assumptions and limitations should be clearly delineated referencing their adequacy and effect on the question asked (hypothesis?) Do the assumptions change the question?
For point 1, you should always expect different variances when taking samples from the same population. This can be important for point 2 because large differences in the variance indicate a difference in the mean. (A small variance results from samples having nearly the same size, while a large variance has more size differences. There is no expectation that the means will be the same.)
Point 2. For the alternate hypothesis to be different from the null hypothesis it is NECESSARY for the means to be different. Return to the question asked and all the assumptions made, including constraints to sample size. Is NECESSARY also SUFFICIENT? Certainly not, even with the tiniest of P-values. Replication is necessary. The P-value merely states that you have grounds for seeking replication.
Point 3. Look at the data again. Examine it against the question. Is the mean the parameter of interest or does the shape of the distribution (tails?) indicate another way to examine the data?
Point 4. Forget the parameters and go back to the data. If you can safely go with the parameters, still proceed with caution.
Point 5. What is necessary and what is sufficient?
you find information in our article:
Rasch, D., Kubinger, K.D and Moder, K (2011).
The two-sample t test: pre-testing its assumptions does not pay off Statistical Papers,
Statistical Papers , Volume 52, Number 1, 219 - 231
Thank you, Dieter. Interesting paper, although I am not really aware how it relates to my question ;)
ok, hypotheses you can postulate always and everywhere. For test, the hypotheses ar made before planning the experiment. What you call H0 and HA is up to you but often H0 means that nothing is happening (differences are zero a.o.). Then analysis is often simpler. If hypotheses are formulated you should not change the hypotheses.
A good alternative is sequential testing.
Well… If you do not reject H0, then you just cannot tell anything…
Under H0 there is two main possibility:
-They are from the same population
-They are from different populations
Never try to draw any conclusion from a situation where you do not reject H0…
May I tell you something I saw in a lab? There was 3 samples per group and a a statistical test according to Mann-Whitney… and then of course there wasn't any significative difference (Normal, impossible in such conditions to be signifiant) and the lab director drawn conclusions from it… Well I could have told her the ratio could have been 1 per one billion, stat's wouldn't have been significative…
Yes, Benjamin, interpretations of non-significant results ia a very common mistake. In my filed this mistake is done most frequently in two-factorial experiments, for instance where the differential response to a druck under two conditions (A and B) should be evaluated. Typically, the authors show that the drug-effekt is non-significant under A but significant under B, and so they conclude that the treatment is influenced by/related to the condition. This conclusion is based on an implicit interpretation that the "non-significant" finding indicates the absence of the drug-effect, what is nonsense. The correct analysis would be to test the interaction. But many people do not understand the concept of interaction, and the biggest problem finally is that they would not know where to put a star in the barplot. Brrrr.
(the linked paper is full of examples: see Fig 3-5)
http://www.nature.com/pr/journal/v71/n5/full/pr201215a.html
Jochen,
Absolutely agreed. The way I usually highlight it to people is asking:
"Say A is p=.04, and B is p=.06. Do you *really* want to argue that the effect of A is significantly stronger?"
Pat
And many people I know would answer: "Yes, for sure!" And after a while of thinking: "Well, not 'significantly', maybe... but there is a trend".
There's a good reason why I prefer to report actual p-values rather than so called 'levels of significance' indicated by 1 to 3 asterikses...
But, on the other hand, what was discussed earlier... One shouldn't dismiss a non-significant result per se. There are areas of research, where you just cannot realise the necessary number of replicates for a proper 'significant' difference.
What is significance? Significance is completely arbitrary! 0.05, 0.01 ... 0.1... It's solely a matter of definition.
If the difference between two treatments has a p-value of 0.12 it may still be of economic importance to follow up this trail.
Could we stay a little more focused? The question is still: when assumptions for a test have to be made, do they explicitely refer to H0? If this were the case, then it is actually nonsense and impossible to test any set of data for deviations of the assumptions since we do not know a priori whether or not H0 is true and if the distributional characteristics of our observations would be similar under H0.
Good question! I would like to add that a statisical test must always go with the verification of the conditions of validity, the list of possible biases and the context.
And when you say: "and if the distributional characteristics of our observations would be similar under H0." Yes, H0 is an assumption, not only a mean but a distribution. Thus, the conclusion of such a test can only be, there is (or not) a big chance that H0 is not true. That's it. Maybe, as you said, H0 would not be true in your context! (If I understood your question)
What you would you exactly want to test or respond to?
Jochen Wilhelm asked
"The question is still: when assumptions for a test have to be made, do they explicitly refer to H0? If this were the case, then it is actually nonsense and impossible to test any set of data for deviations of the assumptions since we do not know a priori whether or not H0 is true and if the distributional characteristics of our observations would be similar under H0."
Audrey Dugué answered
"Good question! I would like to add that a statistical test must always go with the verification of the conditions of validity, the list of possible biases and the context.
"And when you say: "and if the distributional characteristics of our observations would be similar under H0." Yes, H0 is an assumption, not only a mean but a distribution. Thus, the conclusion of such a test can only be, there is (or not) a big chance that H0 is not true. That's it. Maybe, as you said, H0 would not be true in your context! (If I understood your question)
"What you would you exactly want to test or respond to?"
To expand on Audrey's answer
The H0 must be clearly and completely defined. Several assumptions follow the definition of the population, these assumptions apply to both H0 and HA. These assumptions are part of the premises of the proposed test. If H0 cannot meet the premises, H0 is not defined and further questions are meaningless.
A further premise must be made concerning HA (best held to only one) to allow a test. This premise defines what is different from H0 and what constitutes a difference. Such a premise is impossible without defining H0.
A difference from H0 because of some treatment on the H0 population should cause a change in variance if the treatment has an effect. The variance will be the sum of the variances of the population and the variance of the effect on the population. The exception is when each member of the population responds exactly the same to the effect.
I am sorry to say that I read much of superficial discussion about this topic.
The null hypothesis is actually the set of conditions (hypotheses) that allow to associate a probability to a statistical event. When the probability is considered too small, the null hypothesis is rejected. Ideally, the conditions were set such that, say, only differences in means could cause the probability to be estimated low, so that the conclusion concerns means. But if other differences are present, they could also be responsible for the observed low probability. In structural equation modeling, the model with its fitted parameters constitute H0.
If one suspects that the samples represent groups with different variances (the original topic), then an appropriate test should be selected based on the hypothesis that the variance is the same and see if this leads to a low probability of the observed data. The point here is that the test should be sensitive to the way the groups might differ.
H0 does not have to be of no difference. It would be perfectly valid to test the hypothesis that the difference in height between males and females in a given population is 1.5 cm. This is implicitly what is done in estimating confidence interval: one establishes what H0 would be for the data to have p=.05 or .01. Viewing the confidence interval this way leads to always get sensible limits for low observed proportions. The simplification of taking the observed proportion +- a suitable multiple of the standard error calculated as if the observed proportion was the population value is logically incorrect and can indeed lead to a lower limit less than 0.
About interpreting the failure to reject H0, the positive approach to that is to identify the range of alternative values that one could formulate and that the present data would also not contradict (by assigning to them a low probability). This is the confidence limits approach. They can be interpreted that the present data are consistent with any H0 within those limits and are not consistent with any larger difference that one could take as H0. Thus, with only three cases, for instance, one would find non significant results, but the range of hypotheses not contradicted by these results is so large that the data are inconsistent only with extreme hypotheses about the effect size.
A word must be added about asymmetric distributions in which the variance is larger when the mean is larger. Most statistical tests, including the Student t test, assume the general linear model of additive effects. This includes the notion that the difference between, say, 2 and 4 is the same (has the same meaning) as that between 32 and 34. If these are counts of spelling errors in a dictation, a good student would acknowledge that making 4 errors while he used to make only two is poor performance. Everybody would agree, however, that 32 and 34 errors are essentially the same. The fundamental reason we want to transform such data before analysis is to respect their meaning. A suitable log or square root transform (including the addition or subtraction of an adequate constant) that would restore symmetry would also make it such that the original difference between 34 and 32 is much smaller that that between 4 and 2. It would also, typically, make the observed group variances compatible with the hypothesis that the variances would be the same in the two populations.
The problem is not so much in the definition of Ho that has validity in itself. but rather in the difficulty, in real life, yoto to test a hypothesis H1. For this reason, many researches are just happy to test only after Ho and automatically establish that H1 is valid.
There may be some events in which p = 0.01 (alpha) but knowing the distribution H1 this value does not imply acceptance of H1 as the beta error could be even smaller or ugale (beta
No. You must completely define H0, otherwise you cannot define a difference from H0. That difference from H0 must be important and quantifiable.
Contingent on sufficent background tools, D.R. Cox's Principles of Statistical Inference along with Cox and Donelly's Principles of Applied Statistics , offer the comprehensive view of one of the most influential statisticians of the last 60 years. The 3rd chapter of the first reference aims at significance tests. Cox was editor of Biometrika for 25 years and is intimately familiar with the relationship between statistics and science. He does remain cautious wrt Bayesian statistics as a universal paradigm.
I believe that the answer for all these questions could be the way the sample is ordered. One can eliminate the nuisance parameters by taking the likelihoot ratio to order the sample space. This ratio could be considered as the ratio of the likelihoods after considering the maximum of these likelihoods under the two hypotheses, Ho and H1.
This is the way the frequentists could eliminate the nuisance parameters and ordering the saple space. This is elimination from otimization.
As Bayesians we can order the sample space by considering the Bayes ration that is the elimination by integration under prior distributions. One example can bew seen in LE Montoya-Delgado; TZ Irony; CAB Pereira; MR Whittle (2001), An unconditional exact test for the Hardy-Weimberg Equilibrium Law: Sample space ordering using the Bayes Factor, Genetics 158:875-83.
After you order the sample space the p-value should be the tail under the null hypothesis, tail defined from the observed sample point.
Recently we have written a paper in arxive that can be obtained in the following page:
http://arxiv.org/pdf/1310.0039v1.pdf
Aditionally to my first answer if one look in a nonparametric way our sugestion can be in two pardigms: Baysian and Frequentist that can be view in the following pages:
http://arxiv.org/pdf/1312.2291.pdf
and
http://arxiv.org/pdf/1212.5405.pdf
IF you are asking "I wonder if such a test is justified when we believe/know that the samples must have been taken from different populations (that might have the same mean or not, but they surely have different variances)." the answer is NO! If you already believe/know that the samples must have come from two different populations, what is the reason to do any statistical test? You have to define H0 in a way that anything else is H1: they both combine to form the universe set of all possible outcomes. The pretext that, irrespective of the mean, it is same or different but they are from two different population is based on the belief and not based on the evidence. This will be contrary to evidence based research to belief on some pretext despite it is not supported by the evidence(data). The only exception to your case would be that the characteristic you are looking at is not the deterministic one as for as the difference in both samples are concerned (you belief in priori that both samples are from two different population) or you lake the power to detect the difference. In such case you must look for alternative, more deterministic and definitive primary variables in lieu of the one under consideration or a greater sample size (you can statistically test for difference in variance by numerous methods, such as sdtest to name one)
I hope this will help - Best
Given the extent of your description of the problems that you having shows that you have thought already a great deal about the meaning of inference in "real data analysis" and the conceptual belly-aches that come with it. From my own experience, I would strongly recommend taking your thoughts one step further with the aid of Sivia's small book entitled "Data Analysis: A Bayesian Tutorial". Bayesian logic follows the logic of well-designed experiments: Given fixed experimental conditions, what is the evidence in favor of, or against, a previously agreed upon outcome (your hypothesis or proposition). The quantitative rules for logical and consistent reasoning were developed by Richard Cox (see his book with the slightly scary title "The Algebra of Probable Inference"). To come back to your questions, using the concepts of probable inference leaves no doubt that context is an essential part of determining the likelihood of expected outcomes. Context and the Null Hypothesis enter through a logical AND: I & H0.
BTW: In the second paragraph you say that . This is not correct. That you sample under the same conditions is part of the context information. The Null Hypothesis is that the means are equal in the given context (meaning high likelihood of near zero variability of the difference between the assumed and the observed expectation (mean)). The way in which you calculate the mean, of course, depends on your assumption about the distribution underlying the data that you sample. It is part of your experimental design to clarify - in advance -, if you will first assume "maximum ignorance" (known as a uniform prior) and determine the real distribution as experimental data come in (known as the posterior distribution) or make another, hopefully justifiable assumption in the context of your experiments.
Hans, thank you. Funnily, I just bought the book you recommended! I am about to read it. Also your second paragraph hits a good point. I understood that this assumption of "same population" was too strict.
And, yes, one should make justifiable assumptions in the context of the experiments. Using the normal probability error model follows from two such assumptions: symmetry and independence of the deviations from the center. For many (most) biological contexts this is apriori wrong, but often a good-enough and simplified view. For instance take the famous example "body size" (m) or also "body weight" (kg). Such variables are commonly seen to be normally distributed, but they are bounded. There are simple physical boundaries (negative values are not possible in reallity, but predicted from the normal model), and more fuzzy (and less well-known) biological boundaries (what small or big can a body get to be still viable?). Since the obserations are typically very far from the boundaries, this is all of no practical relevance. However, if we come to the concentration of proteins, for instance, it happens (quite frequently to my experience) that the values are in fact close to one of the boundaries. Take the conc of a protein what is almost not expressed under "healthy" conditions but gets induced under "diseased" conditions. The distribution of the conc values in "healthy" is tight (low variance) and relatively symmetric around a center. A normal distribution might be a good-enough model to describe our expectations about this state. But in the diseased state, the induction introduces a strong skew to the distribution, so some samples show a quite large expression, whereas the mass of samples will show only a moderate increase. Here, the normal distribution would not fit the frequency distribution well. The disease does not only shift the center, it changes the shape of the distribution, too.
Now despite the fact that the distribution (in both conditions) might better be described by a gamma or beta distribution, wouldn't the null hypothesis mean that the diseased state has the same distribution like the healthy state? And that this distribution can be approximated by a normal distribution with a low variance? If so, the power of a t-test could be made much higher (at least in some cases). The classical test estimates the (pooled) variance under H0 from both samples, what gives a variance that ist considerably larger than under H0. This estimate is not meaningful in this context, although the test statistic can be adjusted using the Welch approximation, if and only if both distributions are normal. I wonder if in such examples (see above) not only the variance heterogeneity dosen't play a role but that also the whole distribution under H1 is not relevant to reject H0 (careful: I am in the Fisher regime, I am not considering to accept H0). Wouldn't it be appropriate here to say: if the disease has no influence on the protein conc., then all samples will look like the "healthy" samples? And if the data is unexpected under this assumption, we would conclude that the disease must have some impact? If so, then the variance (or, better: the standard error) should be estimated only from the "healthy" samples, right? It would mean to loose some d.f., but the estimate of the null distribution is much more precise.
Looking foreward to read your thougts and critiques :)
Dear Jochen, Thank you for the up vote. I appreciate that you expanded on the context that you work in. Quite some time ago, I ran into similar types of problems and decided to do away with classical tests and, instead, pursue approaches that deal with the comparison of whole distributions. Motivated by the Fisher information, I looked at the work of Kullback, Liebler and others, to determine the discrimination information (or information distance) between distributions from healthy controls versus distributions from disease states. Though Kullback's book on "Information Theory and Statistics" is not for the faint-hearted, the techniques serve for very clean decision making. You start with a Null Hypothesis stating that two distributions are close under an information measure ("there is no disease effect on the observables tested") versus the alternative, where "disease has effects". The latter corresponds to distributions that are far apart under the information measure. Often one can use the chi-square function as such a measure, in other cases the Kullback-Liebler distance applies etc. From what I gathered from your message, the above approaches should answer the questions that you posed.
Thank you Hans. This looks a promising approach, and I will have to read and learn a lot there.
Just to clarify: I wasn't looking for some alternative of the "classical" hypothesis tests. I was just wondering if the assumptions we make are reasonable and if they could be made differently. In particular: the assumption that the variance in a simple t-test can be/should be/is estimated from both groups, not only from the group that represents the "null" case. The classical t-test is "symmetric" with respect to the samples, so to say (at least in oart) ignorant of our knowledge what is expected under H0 (i.e., a "teated" sample should look similar to the "control" sample, and *not* the other way around and nothing inbetween).
Providing information distances instead of p-values seems very attractive to me personally, but I am a little worried that this would be non-understood or mis-understood with similar severity as the p-values by the authors and reviewers I know. As long as these people simply "want to have significance stars in the plots" (this is, by the way, an original quotation of a co-author!) it doesn't really matter practically how these stars are produced... but that's all a quite different topic :)
Dear Jochen, Comments regarding the second paragraph first. I have heard things like this before, though not as blunt as your co-author put it. Fortunately, the information measures have "their own" p-values; see, for example, the Akaike information criterion (often quoted as AIC). So, they are "safe" as far as stars in the plots are concerned :). I saw that you posted a question concerning R, so I gather that you are using it for your work. In R, you will find ways and means to determine the AIC or other such quantities. In Mathematica, these are delivered by standard as part of the built-in statistics package, which in my book, speaks volumes in favor of using Mathematica (the plots can have lots of stars, too, if warranted :) ).
Now to the first paragraph. The assumptions that we make for any particular test are reasonable in the sense that they influence the mathematics that we use to derive the validity of the test. It follows that those who apply a particular test verify (actually should have designed their experiments before doing them according to the assumptions of that particular test) that the assumptions hold. Variance must be estimated for both groups, no doubt, and not just the one that we define to be the "healthy" control, since you have to establish that the assumption of equal variance is true, or not. Therefore, a comparison must be made in a separate test to ascertain that they are, or are not, similar. And, alas, we must ascertain the likelihood that the data are generated from the same process (represented by one distribution. Hence, the need to establish that there is only one distribution (at the very least, from the same distribution family). Do do this cleanly, you will inevitably end up doing a goodness-of-fit test, which is another way of saying "information distance". That assumptions can be made differently is evidenced by the fact that there are other tests that are variations of the t-test and have different names. Finally, there is the topic of variance-adjusting transformations that can be applied to the data without harm. I do not like to do this, since I feel that variance is one of the most important pieces of information ever, in particular in biology and medicine, but the technique is useful from exploratory purposes.
The story that you tell is prototypical for how statistical work is still perceived by a large part of the bio-medical and clinical research community. This behavior appears to be unique, since I have never found engineers, finance people, economists, physicists, etc etc, to put up such a fight against clean procedure. Indeed, that's a whole different story.
Best, Hans
This is really nonsense!
If A implies B then notB implies notA. Cannot conclude that B implies A. In statistics uses high and low probabilities for A and B.
Hi Jochen,
In a previous post, I spoke of methods to test significance effects on variance, in a complete differente line with most of the discussion. If interested, here you can find a paper using them and with most of the relevant bibliograpy.
Best luck,
Jabi
https://www.researchgate.net/publication/260556979_Individual_quality_explains_variation_in_reproductive_success_better_than_territory_quality_in_a_long-lived_territorial_raptor?ev=prf_pub
Article Individual Quality Explains Variation in Reproductive Succes...
Dear Dr. Corneli:
Regarding your comment "Only under the null hypothesis is the distribution defined and only under the null can the probability be calculated. Anything else is mathematically intractable"
Could you perhaps expand on this? For example, from particle physics to climate science, the central use of statistics is predictive models, not null hypothesis significance testing (NHST). I don't think you were critiquing the use of predictive models as opposed to NHST, but I'm also not sure what "anything else" refers to. Also, there is a concise critique of the underlying logic of NHST that, if sound, makes defined distributions rather meaningless:
"The mistake here is known in statistical logic as “the fallacy of the transposed conditional.” If cholera is caused not by polluted drinking water but by bad air, then economically poor areas with rotting garbage and open sewers will have large amounts of cholera. They do. So cholera is caused by bad air. If cholera is caused by person-to-person contagion, then cholera cases will often be neighbors. They are. So cholera is caused by person-to-person contact. Thus Fisherian science.”
(http://stephentziliak.com/doc/Transposed%20conditionals%20in%20Biology%20and%20Medicine.pdf)
Put simply, rejecting the null means accepting AN explanation, but not necessarily THE explanation. If the "hypothesis space" could be defined in terms of some set of mutually exclusive and collectively exhaustive hypotheses H1, H2,...Hn, then we could use NHST (although it would be more than a little unnecessarily complicated) to ensure that for any hypothesis H, rejection of the null is also tested against all other possible explanations (which is almost always impossible, at least in practice). So rejecting the null tells you that one explanation has a less than alpha likelihood of being due to chance (granting a number of assumptions, e.g., adequate experimental design, statistical tests appropriate for the data, adequate sampling methods, etc.). But granting that we should reject the null doesn't mean that we should accept the particular hypothesis we are testing. As shown in the cholera example given in the quoted study above, null hypothesis significant testing can tell you to reject the null and accept one out of some set of alternative hypotheses without enabling you to test these or even leave you with some method for determining what alternative hypothesis would also lead you to rejecting the null. In fact, "[t]he most curious problem with null hypothesis testing...is that nearly all null hypotheses are false on a prior grounds"
(http://warnercnr.colostate.edu/~anderson/PDF_files/TESTING.pdf)
Nor is NHST the only hypothesis testing method. It is true that other methods have their own problems, but after some 90 years of criticisms for the foundations of NHST that remain unanswered, alternatives are perhaps better likened to technical problems one may encounter in NHST rather than a criticism that it is "sorcery" not science (Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1), 67-90) or that it is a "cult" ritual "signifying nothing" (http://virgo.unive.it/seminari_economia/McCloskey.pdf). Or perhaps all statistical significance testing is problematic but usable under the right circumstances (or perhaps all flawed at their foundations). Either way, that still leaves predictive models.
Andrew, if you use hypothesis testing with H0 and H1 that don't exhaust all possibilities then whatever happens to you afterwards is your problem, rather than the method's problem. Wouldn't you agree to that?
If H1 is not the negation of H0, why on Earth would anybody think that rejecting H0 would imply embracing H1?
To answer your question: they would "think that" because without thinking that there is almost no purpose to the entirety of null hypothesis testing. Arguably there is none, and this has been argued (for the most exhaustive list of references I know of up to 2001 see "402 Citations Questioning the Indiscriminate Use of Null Hypothesis Significance Tests in Observational Studies" (http://warnercnr.colostate.edu/~anderson/thompson1.html). For more recent reviews, see e.g.,
"The Null Ritual What You Always Wanted to Know About Significance Testing but Were Afraid to Ask" (http://www.sozialpsychologie.uni-frankfurt.de/wp-content/uploads/2010/09/GG_Null_20042.pdf)
"Significance tests as sorcery: Science is empirical--significance tests are not" (http://psychology.okstate.edu/faculty/jgrice/psyc5314/SignificanceSorceryLambdin2012.pdf)
"The Unreasonable Ineffectiveness of Fisherian" Tests" in Biology, and Especially in Medicine" (http://stephentziliak.com/doc/Transposed%20conditionals%20in%20Biology%20and%20Medicine.pdf)
or for more exhaustive yet simpler treatments The Cult of Statistical Significance as well as Making Social Sciences More Scientific are both books that provide simple yet thorough analyses of the problem.
True, it's not the method's problem in a certain sense yes, in that a method relying on correlation to identify causal factors is not a problem with method. After all, correlation entails a causal connection (at least classically- either A causes B, vice versa, or C causes both). However, as correlation implies some causation but not much else and certainly doesn't entail causation, using it as the basis for identifying causal mechanisms is at best mostly pointless. An ideally designed experiment using NHST can only tell you that rejecting the null means that the alternative is one explanation out of a set you can't identify unless you use methods that make NHST redundant (or worse) or because your null and alternative are used to answer questions nobody asks. For example, nobody uses hypothesis testing to determine that if one flips a coin and the result is heads, then the hypothesis that the result would be tails is null. That's negation. Researchers do not ask such questions. They ask questions like "all other things being equal, given a treatment group and a control group and a statistically significant different, I attribute the difference to the treatment". This is not a logical consequence of statistically significant differences between groups. In fact, we are so aware it isn't that we are actually comparing placebo treatments to control placebo groups using the same problematic designs to identify how e.g., activated vs. inactive placebos can change experimental outcomes. What null hypotheses have usable negations?
I'm aware of the problems with hypothesis testing. In my opinion, it would be great that at least some of the people routinely working in inventing new tests devoted their energy to coming up with a superior methodology.
But it's still true that any tool requires some understanding of how and for what it is supposed to be used. You can't just say: I applied the pencil to the door and it didn't open, the pencil is wrong!
If you find you need a methodology providing answers to the question "all other things being equal, given a treatment group and a control group and a statistically significant difference, I attribute the difference to the treatment" that's fine, adopt another or invent your own.
My point is simply that what you have called "the underlying logic of NHST" and explained with the quotation "If cholera is caused by person-to-person contagion, then cholera cases will often be neighbors. They are. So cholera is caused by person-to-person contact. Thus Fisherian science." has simply nothing to do with "the underlying logic of NHST".
The underlying logic of NHST is just "If A is unplausible, I'll take it that not-A is plausible", not "If A implies B and B is plausible, then A is true and furthermore A is the cause of B". That is a logic of wishful thinking.
It seems to me that in some testing models it should be possible to generate hypothetical tests, but it is necessary to precisely define all determinants that affect the entire testing model.