Question about the p-value?

12 August 2023 37 3K Report

Can I say that the probability of making wrong conclusion is lower when rejecting the null hypothesis at a p-value of 0.001 than at 0.045? In other words, can the probability of accurate conclusion be gauged by the value itself of the p-value ?

Jochen Wilhelm

No. The p-value is calculated under the assumption that the null hypothesis is true. It tells you something about he chance to observe data that results in a more extreme test statistic - under the assumed model & hypothesis. It does not address the question if or how likely it the assumed model & hypothesis are correct.

Further, an individual p-value does not contain any information about the individual information or evidence provided by a particular sample. The interpretation of p-values is not a "by-case" interpretation - it is a procedural interpretation. The procedure of calculating p-values from data has particular statistical properties. Therefore all we can say is that low p-values are unlikely under the assumptions, but we cannot turn this around to find how likely the assumptions meet the reality.

If you draw two samples from the same normal distribution, a t-test gives you a p-value fro the mean difference. This p-value provides no information, what you can easily see if you repeat this "experiment": you get another, different p-value. When you repeat this, you will get p-values that vary all over the place between 0 and 1 (they will have an approximately uniform distribution). Having two p-values, one being smaller than the other, does not tell you anything about the assumption was "less correct" for the lower p-value.

If you sample from two populations with different expected values, the p-values you get will not anymore be uniformly distributed. Their distribution is skewed to smaller values, so most p-values you get are closer to zero. But the conclusion remains the same: the fact that one p-value is smaller that the other does not tell you anything about which assumption was "less correct".

In the correct "procedural interpretation", the difference is that in the first scenario you will only rarely see p-values close to zero. So in some cases you will reject the null hypothesis there, and all these (rare) cases will be false rejections. In the second scenario you will more often get p-values close to zero and more often reject the null. By definition, you cannot make any false rejection here, but it can still happen that the conclusion about the sign of the expected difference is wrong (the sample mean difference may be positive, statistically significant, but the "true" expected difference is negative). I'll call this a "sign error" (from A. Gelman: http://www.stat.columbia.edu/~gelman/research/published/francis8.pdf).

If we assume that the "true" expected difference is never exactly zero (but it can be arbitrarily close to zero), the you can never wrongly reject the null hypothesis, but among the rejected hypotheses you can have a higher or lower probability of making a sign error. This probability only and exclusively depends on the unknown true expected difference relative to the standard error (which itself depends on the sample size). If this is very close to zero, the sign error probability is about 50% (you can toss a coin to decide if the difference is positive or negative). If it is large, the sign error will approach zero.

A bad scientist testing stupid hypotheses will rarely get low p-values, and in those (rare) cases he will publish conclusions with a sign error probability of about 50%. He could increase the chance to get low p-value by using large samples, but this is cost more time and resources).

A good scientist testing sensible hypotheses will more often get low p-values with a low sign error probability and publish more papers with a low sign error probability.

Taking a single paper from two scientists and looking only at the p-values does not allow us to conclude which one is the "bad" and which one is the "good" scientist. This would be a "by-case" interpretation that just does not work with p-values.

The research community finds the papers from both scientists, some from the "bad" one and more from the "good" one, and most of the conclusions presented there about the sign of the analyzed difference will be correct (that is, the sign error probability of the published work is lower than 50%). This is the correct procedural interpretation.

Of course, scientists try to hack this system, literally, by applying "p-hacking" (e.g, doing many very small & quick experiments or making a huge number of stupid tests to get "some" significant p-values that will be eventually published). This is a huge problem fore the scientific community.

Can Kiessling

To make all of what Jochen wrote about above resonate with you, I would highly encourage anyone to go into Rstudio and try the following (perhaps oversimplified) code which you can copy and paste to get results right away. If you didn't understand already, you will understand it better once you tinker with it yourself and see the results with different parameters (change the means, sd etc). Compare the results of population with different parameters of mean and standard deviation and see what you get. (Source inspiration -9.4 of https://sites.ualberta.ca/~ahamann/teaching/renr480/labs/Lab9.pdf)

# Test for p values using a t.test where you compare the null to itself. This will compare two samples with the same /similar mean and standard deviation then produce a histogram of P values. With the current parameters, it will produce a uniform distribution of P-values if you leave the x and y variables with the same mean and SD. You can imagine X as the control population and y as the second sample of that x, in reality you will not get the same mean and sd when you re-sample but I am oversimplifying to show a concept. Compare a population to itself shouldn't skew the P-values towards lower values if there was no difference right? If you want you can make your y variable's mean 10.5 and slightly alter the sd as well because no sample should really have the same sample parameters upon resampling (simplifying here to show core concepts).

r = c()

p = c()

for (i in 1:10000){

x = rnorm(10,mean=10,sd=5)

y = rnorm(10,mean=10,sd=5)

p = c(p, t.test(x,y)[[3]])

}

hist(p)

# Then you can change the mean parameter for the y variable (think of this as your treatment group and X as control) and look at the p value of the distribution of this test. Try to make this with some reasonable difference that you think could exist in real life in control vs treatment effect.

r = c()

p = c()

for (i in 1:10000){

x = rnorm(10,mean=10,sd=5)

y = rnorm(10,mean=15,sd=5)

p = c(p, t.test(x,y)[[3]])

}

hist(p)

Hope these simplified examples visualize these nuanced arguments. If you change the [[3]] to [[2]] you will get t-values which are also interesting to see.

About the non-inferiority study?

What is the acceptable p-value cutoff for GO enrichment analysis ?

How to do Mann-Whitney U test with Bonferroni corrected p-values?

Bonferroni correction. I have independent t-test, paired t-test and ancova conducted. Which test would require Bonferroni adjustment?

What is the impact of collaborations with key suppliers on an SME's competitiveness?

Chi-square test for allele distribution?

Does the bread-making process lead to a sufficient reduction in lectin activity of bean flour to guarantee food safety?

How to calculate Cohen's d from CI 95 and t value from a paired sample t test?

My question is about a barley seed ?

What should I do when P-value in Wilcoxson test doesn't meet my expectation?

Effect overheating egg on nutritional value and man's health ?