p-values seem like the accepted standard for publishing, yet I have the feeling that many do not ponder on the use of them or know what they imply. The use of effect-sizes with corresponding confidence intervals is more useful from my perspective and should be the goal in the first place. Furthermore, in most cases p-values do not answer any of the questions the authors asks. Yet, stating that p
Jochen Wilhelm
I stand corrected it is true it is about samples not the population. Indeed some more sophisticated answer would be, the p-value indicates the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. This is fine, and depends on the statistical test. But, this is not my issue, the issue is regarding what these tests compare. If you state µ1=µ1, what a t-test does this seems not relevant in most studies especially field studies. In field studies we cannot exclude any underlying hidden factors influencing the variation of the means (or otherwise) of our samples. In controlled experiments where only one factor is changed the focus on the central tendency (mean/median) might well be justified. The variation of our samples is then assumed to be caused by biological difference (i.e. genetic difference between organisms). However, under field conditions you can never exclude this.If we only look at the t-test it focuses on the variation of the mean in the samples. The t-test uses the mean and standard deviation to indicate something about the variation of the mean. It tells me how likely the mean of "sample A" is similar or as extreme as "sample B" as observed results of a statistical hypothesis test. But the story remains the same, if I replace that term "population" by the term "sample" the p-value still does not tell me anything about the probability of being correct or wrong. As I show simply randomly re-sampling the "samples" results in a more-or-less similar results. I have the feeling that p-values in general do not answer the question researchers ask.
"As I wrote above, the p-value is linked to the amount of information provided by a sample of data w.r.t. a particular statistical model. It's not about an effect size at all!?" focusing on the t-test, the t-test compares the variation between the means. It is the variation of this what it compares. A common effect size estimator is the difference between means and a statistical test compares the variation of the effect-size estimator. Using only the p-value and not mentioning the effect-size reduces the information provided (Smidt and Hunter 1997).
"I think part of your problem is that you seem to confuse a statistical conclusion with predictive accuracy. For instance, a large clinical study may find a statistically highly significant improvement of drug A over B, but it may still be that almost every second patient would benefit more from drug B." First, this is not my confusion. My confusion is that it is "significant" or "non-significant" (i.e NHST) while we talk about a probability gradient. Not about a "yes" or "no" question, while most articles read like this (I would completely exclude the term "significant"). Secondly, "it may still be that almost every second patient would benefit more from drug B". In this case you refer to odds ratios I assume (?). This is often hardly mentioned if it would be an ecological study. So I can infer nothing about the strength or size of the effect, which what bothers me.
So what you suggest is to keep using p-values even if my sample size is large enough to show that everything is "significant"?
Smidt, F.L., Hunter, J.E., 1997. Eight common but false objections to the discontinuation of significance testing in the analysis of research data, in: What If There Were No Significance Tests? Lawrence Erlbaum Associates Publishers, pp. 37–64.
Jochen Wilhelm
"And intentionally so." So why do so many use it if this is not the answer to the question to what is the probability I am wrong?"Models are models. There can be good, precise, useful, reliable etc." Lets assume we can only have normal and non-normal distributions (excluding, log normal, binominal, poisson etc.). How is "good", "precise", "useful", "reliable" than defined? How to quantify.
" but not 'correct' or 'wrong' models." If there are neither "correct" or "wrong" (simply falling back to a binary answer highlighting the issue of the NHST) models everything is justified, and everyone can do apply however the see fit. Neither my argumentation, the argument of a reviewer or conspiracy theorist is valid.
"But unfortunately there is no objective, "correct" way to specify the prior, and two scientists starting from different priers arrive at different posteriors after seeing the same data (the good thing is that the data will lead to the fact that the posteriors will be closer to each other than the priors)." This quite summarizes my thoughts, whereas I only see the "negative" point of the applications of most models. In some sense you can disprove (see the "negative" point) every analysis that is performed in any article. How to overcome this?
"Yes. And this is just what the p-values are supposed to do. They are a measure if the sample size is large enough to interpret the test statistic." This only tells me that the variation in the statistic is small enough to "refute" the null hypothesis, not if this effect is actual "note-worthy." "Note worthy" can be interpreted in the sense that the effect is large enough so, "we can actually do something with it" (not sure how to highlight this differently a.t.m.). Further, "significant" or "non-significant" < .05 is cut-off value which nobody really can justify. "Large enough" is quite subjective the American Statistical Association (ASA) simply published the following:
Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.
Article The ASA's Statement on p-Values: Context, Process, and Purpose
Moreover, the definition of the p-value (according to the ASA )does not say anything about the sample size. It simply states: A p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value (reference above). If I am wrong I just do not get the right information (mark me as an idiot), but this is simply the information I get (perhaps a flaw in the "scientific" process or is it just easy to blame someone else than myself?).
Let say that I correlate the number of plant species lost over a 20 degrees Celsius range of air temperature increase (simple linear model and normal distribution of the residuals, with a t-test applied on slop). I have an R-squared of 0.1 and a p-value of < .001 number of sample points is 1000. Than it is concluded there is a "significant" effect of temperature increase. The R-squared than states that there is only a 10% fit (not sure how to explain this non-visually). The p-value than states the probability the variation of the coefficient (slope) is 0. While there is a "significant" effect is this effect a "note worthy" effect? A 20 degree Celsius is "huge". Without the slope and its confidence intervals I cannot judge whether this is useful. Lets say the slope indicates a 100 species per 1 degrees Celsius disappear. Than I would judge the effect to be disastrous. But if the slope is 0.1 per 1 degrees Celsius I would simply "not care" even if this is "significant." Already the R-square of 0.1 makes me doubt the notation of "significant" how small the p-value might be.
Most articles only display p-values and even if they mention some form of effect-size that is minimal they get published (see my first reference). Yet, in my opinion (which is just an opinion) if they get published with minimal effect-size, I do not find them to be very convincing. Yet, I cannot find many very "good" articles that I can use to prove a points since non of them mention (strong) effect-sizes and only p-values. To hand in a "negative" manuscript, how honest it might be is simply not possible without any status.
p-values are also very subjective. Lets say I write a "nice" abstract where I say my results are "significant". The editor reads this and sends it to some reviewers. What I did not mention in my abstract is that my cut-off value for p-values was .1. The (normal) cut-off value for the p-value is .05, but there is no clear reason why it is .05. If I have the "right" reviewers the manuscript is accepted.
My issue is the use of p-values if they do not answer the question of the researcher(s). Or, the question I expect them to answer. If I always doubt, how to write a story that quides a reader into the topic of my study, even if I doubt my own study?
Wim Kaijser, your original question concerned means and distributions. The question was confused by the usual practice of p-values for differences of means. The difference in means using p-values relies on the central limit theorem and the average mean. Inferential statistics attempts to describe properties of the population based on a sample of the population. Jochen Wilhelm
correctly pointed out that the p-value is for the sample, not the population. Inference to the population requires that all assumptions are correct and the sample is representative of the population. There is a further complication that is not usually discussed. That complication is how the difference in the means translates to differences in the populations.You correctly observed that area under the curve has an interpretation that is different from difference of the means. Jochen addressed this by observing that treatment A is not necessarily better than treatment B for some patients when A was deemed better for most patients. It is important to note that when A was determined better than B, this determination was made using samples, not the population. A difference of A and B based on a p-value is highly questionable given the overlap of area under the curve and the assumptions and conditions of the sampling and testing.
Difference of the means has a meaning only when the question asked can be answered by the difference of the means.
This brings us to the question asked and hypothesis testing as a function of statistical analysis. Statistical hypothesis testing is about data. Statistical hypothesis testing does not address the question asked. Statistic texts will formulate problems as is A better than B. Form the null hypothesis that A = B. Establish a p-value for rejecting the null hypothesis. Test A against B using N samples.
The usual language of statistical hypothesis testing is a data quality method and has nothing to do with comparing A to B. It does not represent a scientific hypothesis. Nevertheless, by stating the null is A = B leads inexorably to the association of A is better than B if the p-value is reached.
The problem is that a statistical hypothesis is not a hypothesis in a scientific sense. Indeed, it is not a hypothesis, it is a data analysis procedure and nothing more. More unfortunate is that the differences in averages based on the central limit theorem is taken as sufficient to ‘reject’ the null. The scientific question includes much more than the differences in the means, except for the rare situation when all that matters is the difference in the means.
Interesting discussion and I agree with Jochen Wilhelm
's advices, especially the last one:“- try to get a better understanding what p-values actually are
- read about the concept of statistical power
- read about Bayesian statistics
- remain critical in your thinking”
I would like remind something about models:
https://en.wikipedia.org/wiki/All_models_are_wrong#:~:text=%22All%20models%20are%20wrong%22%20is,%2C%20but%20some%20are%20useful%22.&text=The%20aphorism%20is%20generally%20attributed,underlying%20concept%20predates%20Box%27s%20writings.
And I suggest thinking always to the mechanistic knowledge of models.
It is not about p-values and what they are, it is how they are derived and what they indicate and how they are used is what seems an issue. It is the conclusion that is drawn upon < .05 and what this p-value means. I perfectly understand how p-values are derived and what they indicate. This is clearly provided in the script (donot blame the simplicity of it). But, a p-value itself only tells something about the variation of the chosen statistics (i.e. t-test = mean). What concerns me is that the following :
(1) This variation does not in any way, warrant the conclusion p < .05 there is a "significant" effect.
(2) The p-value derived is not derived on the size of the effect, but on the variation of the effect size.
(3) < .05 does not or proves or assumes causality.
(4) Under field studies hidden correlations/effects cannot be excluded.
(5) The term "significant" has a value laden meaning ("yes" there is and the hypothesis must be "true", and, "no" there is not the hypothesis is false), while it only indicates something about the variation.
(6) The assumption that the p-value suggest that your observation is not only due to chance is also incorrect, but still assumed.
(7) The focus of many studies is to reach
Jochen Wilhelm
I agree the p-value is not the issue it is the interpretation and consequently use of them. Perhaps I should have made this more clear in the first place (sorry for this). But still, if we can agree the cut-off value of > .05 is an urban myth. Then you can never claim, "Since m > 0 and we can reject M = 0, we can also reject any M < 0, so the information from the data is sufficient to conclude that M > 0." rejecting M = 0 needs some cut-off values since it is either M = 0 or M > 0. We can neither substantiate that .05 is well founded (or is there any study comprising many fields to show that p < .05 has some natural basis?). Neither can we say that .05 means we have enough samples since it is either "enough" or "not enough", resulting again in dualistic approach requiring a cut-off value. The issue I have with the statement that it is unexpected is that including more samples always results in M > 0. Is it than still unexpected to conclude M > 0 if you know that simply including more samples automatically results in M > 0?An issue is, using p-values without any cut-off value and only using them in such a way that it is clear what they indicate. Or not using them at all if they do not provide any benefit. While I can justify not using them, this does not mean someone else has to agree with this and perhaps has other ideas on it.
"They are all about urban legends and misconceptions many people have about p-values." I am not sure if they are legends or real life villains?
Jochen Wilhelm
"He always calculated an (at least approximate) p-value and interreted this in the current context. There are examples where he found 0.01 not significant but 0.2 significant." Interesting, you have the name of the book?"Under M=0 you have a 50:50 chance or expectation to observe m0. If M is very close to 0, the odd ratio is still very close to 50:50. Thus, claiming that the sign of m is correct based only on the observed sign of m will by correct only with a probability of 0.5." I agree, but the p-value does not convey, the information how close you are to zero, but only the approximated variation of this odd ratio given your samples. You can have 0.53 and if the confidence level at 95% is 0.02, you have a "significant" effect. Yet, given the effect-size, possible knowledge, perhaps a clear figure, I would conclude there is no "significant" effect. Or I would simply state that the results would not be very convincing and effects are neglectable. The question is, if many would except this statement given p < .05.
"We must work more and harder to provide a better statistical education. This does not start with students of empirical sciences! We must train teachers to be able to teach statistical thinking." The issue I think is the large gab between statisticians and the practical application of statistics. I am not a statistician/mathematician and most technical articles are very hard if not impossible to decipher by me. So I need to go for 20+ other sources to get an idea what they mean. While the ASA statements on p-values are correct. This does not answer the question, why do they not do what they are mostly used for and what do they indicate then? The idea of most statistical test can be well explained with the simplistic example of a vase with papers and numbers written on them, show that resampling procedure leads to the to a probability distribution and explaining that this can be captured in a probability density function (while this sounds kinda childish it might more effective, at least, that is how I feel). But most introductions immediately start with probability density functions and fancy equations. Either people find it to boring, steps are to big, or it is too technical. Yet, from my perspective it is not about the actually mathematical prove of the test, but about the underlying idea. In ecology/biological field people are quite visual (for a lack of better wording), visualizing the process of deriving the p-value can clarify much more than displaying this in equation form. When you can grasp the underlying idea "pros" and "cons" are much easier to discuss. Needless to say, sifting to all information available is quite time consuming, you need to find the "right" information and right keywords (lots of jargon). Furthermore, books are expensive and you never know if it contains the "right" information to improve understanding. I buy a lot of books in other topics to read 1 chapter, and come to the conclusion that I already know this and the sales talk around the book have convinced me to buy it. Yet, neither can you publish a 30 page article that simplistic describes how a p-value for a t-test, chi-square or Wilcox-test are derived, which information it gives and what not. Even what I now now about statistics is quite pathetic, but this is, I think, an issue of communication and acquiring information. To simply tell someone to search of what p-values actually are is just a very inefficient way of communication and sharing information (neither am I a very good communicator, clearly displayed from how this discussion started to where we are now). Its just an easy way to say, I am right, you are wrong, just figure out yourself why you are wrong. If this is the process of scientific communication I am not not really optimistic about the future. I feel that the idea of being wright or wrong is more a status driven thing (I feel it is a logical fallacy, such as "appeal to authority") and jargon can be easily used by as "proof by intimidation". As long as this somehow keeps playing a role there less objectivity and sort to say free-speech in "science". What this last section had to add to the whole discussion I do not know, but I feel it is somehow important.
Since, the discussion stocked a bit. There are still a lot of articles out there that show things that aren't really there. I can repeat these studies with effect-sizes confidence intervals and additional the p-values and show it is not simply a yes and no question. However, I can never publish it, since the study isn't "novel" enough. So spreading a message to others is not as easy, sadly. The thing is I write a lot and then think about it coming to the conclusion that is not novel, neither very surprising. It is a bit an issue of communicating this information. I once had a discussion with an eco-toxologist and she sad she will not use effect-size estimators additional to the p-value because no one will understand it. I am not in the position of convincing people otherwise, nor do I feel confident about it. However, the conclusion of most article which interpret the p-value differently than intended to are still used to guide management decisions. I am just not sure how to handle this issue.
"PS: if someone knows a good book: correct and understandable for a life science student, I would be VERY happy to know about!" Not really, the newer books go directly for R coding, but don't show a simple example with (too) small sample size to simply explain it. Mostly, I go for everything books, articles, websites even youtube (embarrassing too say). However, repeating examples by hand see if they match and the make up an example and repeat it with R to see if I get the same result is quite useful. You can create your own canonical correspondence analysis or decision trees (at least the principle behind it). Yet, this is very time consuming and I do not understand why simple books are so undesirable to publish. The older books are much clearer, they show examples you can calculate by hand, but they are often not available anymore. Either they are hard to find or very expensive. I guess a book with the idea behind it, a simple example able to calculate by hand and than an R code would be perfect. Too my knowledge this is non existing.
Hello, Jochen Wilhelm
do you think this better describes what I meant in the first place?Let’s assume we want to know what the p-value is when comparing the sample distribution of two groups: “A” and “B”, along a hypothetical gradient. We draw a 100 samples for group “A” from a normal distribution with a mean of 0.35 and standard deviation of 0.15, similar repeated for group “B” but the mean from normal distribution was 0.32. We apply the T-test comparing the distribution group “A” to “B” (Fig 1A).
The resulting p-value from the T-test = .03 This indicates that the mean (central tendency) of the distribution from group “A” has a probability of 3% to be similar or more extreme than the mean the distribution of species “B” (given the number of samples, variation and the assumed model of the normal distribution). The p-value resulting from the T-test thus indicates about the variation of the difference between the means of the sample distribution from group “A” and “B”.
Simplifying the idea, we randomly drawing 100 samples (duplicates are allowed) from the sample distribution of group “A” and calculate the mean and perform the same for group “B”. After this the mean of the re-sampled distribution of group “A” is substrate from the mean of the re-sampled distribution of “B”. This process is repeated 10,000 times. The resulting differences of A minus B can be plotted in a histogram (Fig 1B).
The fraction of the mean A minus mean B (effect) that was negative or zero represents the probability the observed results were the same (similar) or larger (more extreme). This results in a pseudo-p-value (let’s call it pseudo) of .02, considerably similar to the p-value of the T-test.
Considering that y-axis represent a phosphorus gradient in mg L-1 groups “A” and “B” are both aquatic plants. We hereby focus on the difference between the means, which is considered one of many effect-size estimators. “An effect size is a quantitative measure of the magnitude of the effect.” Additionally to this effect size, we included the lower and higher confidence intervals (LCI and HCI). The difference between the mean of the species falls between 0.003 and 0.081 mg L-1 (at 95% confidence) and this difference might not be considered very large. However, let’s consider that group “A” is the fraction of people dying because drinking water is contaminated and group “B” is the fraction of people dying without being affected by this contaminant. The difference might be considered large varying between 0.3% till 8% .
The resulting p-value of the T-test and difference between the means does not focusing on the overlap of the sample distributions, but on the variation of the central tendency (mean) of two distributions, it does not say how often we are wrong. For this we can use another effect-size estimator: probability of superiority (Ruscio and Mullen, 2012). The probability of superiority estimates the probability that random samples taken from the sample distribution of group “A” falls higher along the gradient than a random sample of the distribution of group “B”. A probability of 50% indicates that both distributions exactly overlap (1/2*100%=50%). The probability of superiority indicates that there is a probability (at 95% confidence) between 52 and 68% a random sample of group “A” falls higher along the gradient than a random sample of species B. Considering “A” and “B” are aquatic plant species, suggesting that species “A” is found at higher concentrations than species “B” would be incorrect around 32 and 48% (with 95% confidence) given the samples. If our question was, are species “A” and “B” highly discriminately this seems not the case. Considering “A” and “B” is the fraction of people dying of a specific contaminant in the drinking water the 52 or 68% will not be acceptable.
In this regard the questions were directed to the magnitude (effect-size) not if p < .05. In some sense it becomes quite philosophical. How do we define a difference and how do we define an effect? This becomes extremely difficult when we do not know the reason behind this difference. p < .05 does not indicate there is an effect nor if the difference is large. So we technically cannot concluded there is a large difference or an effect based on p < .05. We can conclude something about the variation of the effect. If the term significant is a synonym for important, then for both examples above (aquatic plants and drinking water contaminant) the effect was important. For the aquatic plants I would say not really important (based on the effect-size and confidence intervals), but for the drinking water contaminant I would say yes. Perhaps this is why Fisher seemed so inconsistent in using p < .05 as significant? I also read some articles that proposed to simply replace the word significant to not confuse and go around saying, yes there is an effect or a difference. Simply put, they want to stop the Null-Hypothesis-Significant-Testing approach. What do you think about the example?
Jochen Wilhelm
An unnecessary mistake, 0.08 should be 0.04, the LCI should be 0.003 and HCI should be 0.081 not 0.11. It also would not make sense given this should approximate 0.35-0.32=0.03. I switched + and - sign in the code, I corrected this mistake.The correct statement would be: given the distributions are normal with the same mean, then two samples of size 100 for these distributions will give a t-value more extreme than the one from your actual sample with probability 0.03. Note that there is no statement whatsoever about what the distributions actually are, or how likely they are similar or dissimilar. I draw both the samples from a normal distribution and use a T-test, I already work under the assumption of the normal distribution. I also indicate this at the end (given the number of samples, variation and the assumed model of the normal distribution). Or do you refer to "give a t-value"?
Jochen Wilhelm
I am afraid I do not follow. Group A consisted of a distribution drawn from a normal distribution with a mean of 0.35 and standard deviation of 0.15. Group B consisted of a distribution drawn from a normal distribution with a mean of 0.32 and standard deviation of 0.15. I tested Group A versus B with the T-test. Later I applied the bootstrapping method on Group A versus B. I both applied the T-test and bootstrapping method on Group A (n=100) and B (n=100). Is my explanation to soggy? How do you mean the T-test assumes the variance is unknown? The T-test uses the standard deviations of both samples and sd2=var?the t-test uses an estimate of the sd of the population. A z-test uses a known sd.
Looking at the discussion it might be helpful to read a conceptual or philosophical introduction to inference as fundamentally your issue seems to be about whether p values etc. address the scientific questions you are interested in. It is possible that you would be more comfortable with a likelihood or Bayesian approach.
For a gentle-ish introduction I'd suggest Zoltan Dienes' book which starts with philosophy and then moves to contrast frequentist, likelihood and Bayesian inference.
http://www.lifesci.sussex.ac.uk/home/Zoltan_Dienes/inference/
Separately there are issues around whether one should engage in prediction, modeling, interval estimation or hypothesis testing. This depends on your goals and you can explore each of these with any of the inferential approaches.
It is a shame that we tend to teach one framework for inference (e.g., frequentist) and then focus on one set of tools (e.g., NHSTs and p values) without considering that there are many other frequentist tools and alternative inferential approaches (confidence intervals, equivalence tests, model comparison) within that approach and outside it (Bayes factors, region of practical equivalence, posterior probabiility/credibility intervals).
Even you are happy with the frequentist tool set it is worth looking at other approaches (AIC, BIC etc. from likelihood inference) and especially Bayesian methods. The latter are useful because you can flexibility fit a broader range of models more easily (and in some cases handle problems that are in practice too hard to handle with other approaches - at least with finite time, money and person-power).
@ Thom S Baguley Indeed, Karl Poppers book "Objective Knowledge: An evolutionary approach" is a good read. Although, I must admit I only read parts of it. Some parts get somehow too confusing or cumbersome (heavy and repetitive). Perhaps the link you send summarizes some of the more broader ideas. Thank you.
Wim Kaijser, regarding your earlier question:
"PS: if someone knows a good book: correct and understandable for a life science student, I would be VERY happy to know about!"
Thom Baguley's answer is excellent, and what he is suggesting is perhaps a necessary step if one wants to understand and ultimately refute and abandon the use of frequentist p-values. Likelihood is arguably the foundation of all inferential statistics, and certainly the basis of frequentist and NHST nil-hypothesis testing. I would suggest any textbook with the word likelihood in it, since these books inform how commonly used procedures like the t-test are derived from exact likelihood ratio tests, and consequently how extending likelihood logic to NHST, in the way it is ubiquitously used now, is deeply flawed, if not entirely nonsensical.
You don't necessarily have to agree with the likelihood function approach to get a lot out of such literature, especially since these books also point out flaws in commonly used Bayesian tests, which are often presented, incorrectly in my opinion, as relatively unequivocal solutions to current issues in statistics in science.
I would recommend the excellent text "In All Likelihood." Older editions can be found online as downloadable PDF files.
https://global.oup.com/academic/product/in-all-likelihood-9780199671229
Edit: fixed link.
Miky Timothy I read the first few pages and ordered the book. Lets see which of the different ideas might break, strengthen and change my current view on the topic. Thank you!
Hi Wim
This debate rages in epidemiology. We're slowly getting editors and reviewers to move from p values to confidence intervals.
My understanding is that p values arose in hypothesis testing, and Fisher started with 0.05 only because that seemed like a reasonable place to start with respect to controlling Type I error.
Meanwhile, estimation statistics were being developed to provide confidence intervals around single sample statistics like means and proportions.
Then came the realization that most hypothesis tests can be phrased as estimation problems. For questions of difference, rather than the null hypothesis that mean1 = mean2 (leading to rejecting or failing to reject H0), the estimation question is what is sample estimate of the difference in means (delta), and what is the precision of our estimate (95% CI)? And if the 95% CI excludes zero, we're 95% sure that the difference in the population isn't zero. (The statisticians have a more technical interpretation for the CI but this one is close enough).
Questions of association (correlation coefficients, Odds Ratios, etc.) can also be tested using conventional hypothesis tests or by calculating confidence intervals.
The great benefit of CIs over p values is that rather than a binary yes/no result they provide an indication of the precision of the estimated statistic. CIs always tell you what the p value tells you, and more.
Here's a great paper:Article The reign of the p-value is over: What alternative analyses ...
Wim Kaijser A p-value given by a t-test may not be comparable to AUC (also known as common language effect size (CL) proposed by McGraw and Wong (1992)). As you stated, "The AUC displays the probability a random sample of population A has a higher value than a random sample from population B." That is, AUC=Pr(A>B). AUC or CL dose not depend on the sample size because its calculation uses the estimated population standard deviations. On the other hand, a t-test compare the difference between two sample means standardized by a pooled sample standard deviation. Because the sample standard deviation is a function of sample size, p-value decreases with increasing the sample size. Please refer to an updated preprint for more discussion Preprint Exceedance probability analysis: a practical and effective a...
Miky Timothy "I would suggest any textbook with the word likelihood in it..."
Wim Kaijser
Modesty forbids, but there is my book :)
Evidence-Based Statistics: An Introduction to the Evidential Approach - from Likelihood Principle to Statistical Practice, Wiley.
and accompanying R package:
https://CRAN.R-project.org/package=likelihoodR
Just to comment, although I am very interested in good books, it was Jochen Wilhelm
who asked the question :). I will open a new discussion to ask for books that would fit my description of a "good" book.Wim Kaijser Regarding your example, "Let take a t-test the p-values of a t-test in simple language tells me what the probability is the mean of population A is similar to population B. If p=.03 this tells me that there is a probability of 3% the means of my two populations are similar (excluding whether this is actually true or not). It does not tell me anything about the probability that I am wrong." If it is a one-tailed t-test, p=0.03 means that the mean of group A is smaller than the mean of group B with an approximate probability of 3%, or A(mean)>B(mean) with approximate 97%. The sample mean of a group is a random variable.
My interpretation (prior assumptions) changed over the coarse of this discussion, follow questions and other discussion on RG. As such, the question starting this discussion was misleading. A p-value is perfectly useful it is the issue of misinterpretation, use of the p-value and stark universal boundaries (NHST). But changing this seems incredibly difficult. As such, the first 3 chapters of the Andy Field book are very interesting to read.
As I work with "large" datasets every p-value is low, which is also exactly what the p-value should do. I often read the p-value is measure of noise (I like to say variability/scatter, perhaps an ecology thing) given the summary statistic, assumed model and sample size and can give useful information. The issue with datasets is that they are not carefully obtained. There is no factorial design where "effects" of other factors are excluded, thus dependency/correlations are the norm. Planning a field or laboratory experimental design cost a tremendous amount of expertise and planning, which question to answer, which statistic to used and based on that determining sample size (etc.). Over all these factors I have no control using datasets. Working with these datasets I like to describe the patterns I observe and summarizing this in different statistics. As I cannot control any of the mentioned points above I would prefer not to give a p-value and keep it descriptive to the patterns I observe as to prevent any misinterpretations. I would go for the effect size and confidence intervals (although confidence intervals seem just as misinterpreted as p-values).
However, when I do this, reviewers suggest the dominance statistic/probability of superiority/AUC (or any other effect size) is unconventional and I need to use conventical statics (p-values). As such, "is being critical useless" I meant that it is much easier to do a linear regression an suggest their was a "significant" correlation (while the p-value of the regression coefficient was < .05 not the correlation); wrap it in a fancy story following the mainstream ideas and get it published. It requires much less effort/stress/frustration than using effect sizes (here also the term effect is relative misleading) than going counter current.
Moreover, most questions in my field of study seem not related to central tendencies, but to the strength/magnitude of the separation (overlap, AUC, error-rate, etc.). Without a meaningful effect size estimator indicating this and only suggesting it was "highly significant", I cannot make any appropriate judgement of what I can expect to observe in my data (see some of my frustrations https://snwikaij.shinyapps.io/Debunked/). If no-one mentions effect sizes I cannot obtain some quantification or compare my results to others and are therefore severely limited/biased to my own interpretations.
Also the gab between mathematicians/statisticians and the applied field is tremendous. Basic information on the application of statistics if often easy to obtain, but If you want to know slightly more, it jumps directly to extremely complex mathematical calculus/notations/equations (I am not trained for this). Sometimes, searching for an answer to a basic question, requires a huge amount of effort (several books, websites, videos, and then still feeling uncertain about the interpretation). I remember that a few years ago I was searching for the idea of variance partitioning (by hand) which took me 3 weeks to figure out that it could be simply explained in algebra sum of 3 sentences (?). In my opinion, obtaining basic information should not take 3 weeks. There seems not to be any between ground where some more complex issues are easier and more visually explained.
"I am more interested in the probability I am correct."
The probability that you correctly estimated the sign or direction of the effect can be estimated by plugging the p-value into this simple formula:
https://bityl.co/6Tco
This also does not address the issue if here is no factorial design where "effects" of other factors are excluded (sample size is in general the least of my worries in large random datasets I obtain). If the p-value is a measure of signal versus noise. Without a clear factorial design (or expertise about possible interrelations that could be neglect by accepted ignorance): 1.) I have no idea which signal I capture if I regress y~x1+, ..., xn, 2.) If it is correlated/dependent and, 3), Which noise is caused by the any other variables (again x1, ..., xn).
The only option is to describe the patterns and measures of "effect" size (e.g. R-squared, regression coefficient, etc.). Yet, I cannot conclude there is an effect (that's why "difference" size estimators might be a better term, but it doesn't sound good). Based on these large random datasets, I can suggest the magnitude of the regression coefficient x1 is bigger than x2, ..., xn, while everyone else seems to suggest x2 is always "significant". Than, in my opinion, future research, to "effects" of x1 should be considered in a carefully designed factorial study to elude if there indeed is an "real effect" of y~x1, y~x1+x2, y~x2, y~control.
For predicting y these dependent/independent variables do not matter. As literally you are interested in predicting (and also the patterns actually) not in interfering something about the effects of x1 and x2 on y. As such, how often I am wrong is for me more a predictive issue than eluding something about the signal versus noise ratio between y~x1. Thus, "I am more interested in the probability I am correct" as I look at it now, is not directly connected to the p-value.
Personally (as ecologist) I want to step outside or use a random dataset, obtain some educated guess about a summary statistic and see if I can actually apply it. For example, if the mean between species A and B is 100 (some unit), species A often occurs at 100 and B at 200 along this gradient in both my data and a factorial designed study. My question and goals are often: if I go outside and find species A can I than observe this species often at ~100 (assuming some distribution). To find out "how often I am wrong" my intentions are to test this model/concept/knowledge/assumption on other completely different datasets (not validation).
I think the discussion becomes quite chaotic and difficult to follow.
Ah, after re-reading, I see I read it wrong, my mistake! I will delete the last part of my previous message (yes, "Bayesian p-value" is contradictory).Then again, via the Bayes factor (or any other approach for that matter) you can also end up in a NHST approach, at least, if it is improperly explained what you can infer from it, don't you think?
Maybe in short and helpful.
https://www.youtube.com/watch?v=kTMHruMz4Is
Okay, its clear, but I am a bit confused about this one:
"It is never, and will never be, a statement about 'what you tested' (a hypothesis)."
I (now) know it is only about the data. But, if I have data, compare A = 100, and B = 200 and expect this to be ~100. I want to see if this is really the case (is it then still testing or comparing?). Then 200 - 100 = 100.
Or, do you mean something different with testing, what do you mean by tested? Do you refer to an expectation (hypothesis)? And as such, refer that the p-value as a measure of signal versus noise ratio and is not about the "hypothesis" or my expectation? Such as, I expected ~300, but it is actually 100. As in both cases p can be < .001, because I have "sufficient" data and the noise is low? I am sorry for this misunderstanding.
"She is inverting p(data|hypothesis) with p(hypothesis|data). That's a no-go."
P(hypothesis|data) = the probability my hypothesis is "true" given my data = 100%?
- There is no probability involved here.
exactly
- The observed 200 is then only a value considered produced by a process that will produce a different value each time, and so this particular value is not very interesting. What you want to know is the expected value of the process.
Yes, by process you mean "model" (as very loosely used term to represent a pattern [e.g. the process of bootstrapping or inferring a Gaussian distribution])? Can knowledge from publications also be regarded as a model/process? As such, I come across similar values (difference/effect sizes) and observe this in multiple datasets and without mathematical quantification form my expectations (e.g. tacit knowledge, if this is the right term). I mean, I could quantify prior knowledge based on the "process" of obtaining effect sizes from literature. Do we need to quantify this as Bayesian? As a side note, It is very difficult and tiresome to quantify every process from literature and datasets if you do not want to do any meta-analysis with it.
- Right?
In general it seems accepted. But, by trail and error on many different datasets, following discussions on RG, reading articles and playing around in R, I tent to say no, if this is okay. So sad this is not thought as a basic course in statistical interference or there is a booklet of 100 pages of this???, I took me ~5 years to get to this point while it could be explained in one A4!, or with some R code. I mean, I have a lot of datasets at my disposal, and I tried to select random subsamples from different datasets with different sizes and repeat a study. I (re)create some simple model or concept and test it to see if it works. As such, I am not to picky about deviations of effect sizes of 5 or even 20% from observations due trail-and-error given the size on these datasets, if this makes sense.
Thank you for responding with these lengthy answers/comments/critique. Not sure where else to get explanations (books seem difficult and not very focused on interference by practitioners).
-
P(hypothesis|data) was more meant as a provocative/stupid/misplaced/wrong joke (I need to learn somehow and provoking seems to work???). If you build a Bayesian Believe Network you can formulate the conditional probability, e.g. P(species|value) = 100% = I always observe my species given any data/value (which seems actually very probable for some species). Then, P(hypothesis|data) = 100% = my hypothesis (being true/correct) is always 100% given my data (sorry for this).
Ah okay. So by hypothesis you mean: the quantified expectation (probability) a future/next observation (summary statistic) would be observed similar, given any possible sample size for a specified model under similar circumstances? If this is the case, isn't it then better to exclude the word hypothesis and test (if the word test inclines hypothesis testing)? Currently, I try to deliberately avoid the words, hypothesis (replacing it with expectation), test (replacing it with quantifying patterns in my data or only use it when I apply a model on actual new unknown data) and avoid the word "significant" and try not to use p-values if I do not have a well designed experiment (but reviewers ask for it anyhow). Yet, you can understand it becomes quite difficult in the current world to do this.
Dear George Stoica
Indeed one of the most direct articles, although I prefer the "original" article as the opening is quite strong and holds the attention:"Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school."
Article The ASA's Statement on p-Values: Context, Process, and Purpose
Otherwise also the Article of Cohen is also "provocative" https://psycnet.apa.org/record/1996-13441-001 or the article David Colquhoun
Article An investigation of the false discovery rate and the misinte...
Both highlight the "misuse/misinterpretation" of the p-value. Some of the examples from the later article come back in the most recent book of Andy Field (which in my opinion gives a nice summary). Yet, I am not sure articles would reach the general public, as most researchers imbedded in a specific top do not search for publications on statistics. Otherwise, nobody suggests you always need to use hypothesis tests or p-values to make your point and "descriptive" articles are extremely clear, easy to read and just as valuable. Personally I miss some of the basic/philosophical discussions on the meaning of statistics instead of bluntly using a cut-off. In this sense what the meaning and interpretation of a difference/pattern between the means of 0.1 (p = .03 ) and what if the difference is 0.1 (p = .07), how does this difference compare to other research, is it "similar", why not similar, etc?. I want to know what the researcher(s) think(s) about this difference, how are the results interpreted, strengths and weaknesses, different views. I want my bias to be broken, not confirmed, I want evidence to show me I am "wrong", I want to be challenged and not get stuck in a specific world view.
Just a minor update, it surprisingly difficult to find a comprehensive definition of hypothesis in any of the n statistic books or articles (kinda funny and not surprising). Apparently: “there are several elements normally associated with hypotheses, though not all of them are always available. They include: assumptions, generalization and prediction, observation, experiment, induction and probability” (adapted from Darian, S. 1995). According to Karl Popper a "hypothesis" needs to falsifiable (not sure which book anymore or if exact). Thus, it is not so you need to "test" your hypothesis.
However, this "lack-off" definability seems actually the same for any other value-laden term: life, sickness, biodiversity, p-value, fact, truth, reality, important, evidence, objectivity, model, probability etc. Perhaps because they are in itself ideas/concepts/models which make them "wrong". But if they can be defined as "wrong" there most a "right". Perhaps "right" can be defined as "reality", but ideas/concepts/models also arise from reality. Thus everything is "wrong", and if everything is "wrong", the term "wrong" or "right" have no meaning (or do they? not sure). Thereby, putting effort in defining something is like a Sisyphean task?
Otherwise, if the p-value says nothing about the hypothesis and gives "information" how compatible the "data" is with the null-model. p(data | hypothesis) seems also misleading, then isn't it easier to say p(data | null-model). Also, the p-value gives no information about the alternative "hypothesis", so isn't it better to avoid the terms H0 and H1? The term data is also a bit confusing as it is about the magnitude-noise ratio, something like this p(magnitude-noise ratio | null-model)??????
Yet, without a stringed definition about hypothesis, it seems you can make the p-value about the "hypothesis", by suggesting: Our hypothesis is that the "data" is "incompatible" wit the null-model and suggesting a strong cut-off defining "incompatible". Then p(data | null-model) < cut-off you found that the data is "incompatible" with the null-model.
Perhaps the focus on the p-value has something to do with humans having a tendency to cling to some idea that it is "solid/objective/factual" without questioning it??? Yet, the latter seems to result in acceptance and nihilism (N. Gertz, 2019), while "science" is anti-nihilistic (adding meaning to results and is "subjective" [do not confuse "scientific objectivity", with "objectivity"]). Perhaps this is why "Bayesians" accept a (perhaps biased) prior assumption since everything is uncertain and biased. The later of course does not mean that you have to be either "Bayesian" or "Frequentist", although reading articles it seems that authors are often favor one over the other.
P.S. I feel that my thinking and this thread is more a satire of life.
Wim Kaijser Jochen Wilhelm
agree with points made here. It is curious that it is difficult to find a definition for hypothesis. One of the best definitions I have seen is by Edwards in his Likelihood book at the start of the section Statistical Hypothesis (p3 - 4): "A sufficient framework for the drawing of inductive inferences is provided by the concepts of a statistical model and a statistical hypothesis. Jointly the two concepts provide a description, in probability terms, of the process by which it is supposed the observations were generated. By model we mean that part of the description which is not at present in question, and may be regarded as given, and by statistical hypothesis we mean the attribution of particular values to the unknown parameters of the model, or of particular qualities to the unknown entities, these parameters or entities being in question, and the subject of the investigation. There is no absolute distinction between the two parts of a statistical description, for what is on one occasion regarded as given, and hence part of the model, may, on another occasion, be a matter for dispute, and hence part of a hypothesis. Every statistical inference is conditional on some model, and the universality with which it is accepted depends upon the general acceptability of the model. Probability itself is but a model which has found general acceptance when applied to events, though not when applied to statistical hypotheses." (his italics) In the last sentence he is referring to the frequentist and Bayesian approaches which use probability for hypotheses, contrasting with the likelihood approach which does not.It is funny that it is in the first pages of the book (perhaps the word curious has been replaced by funny as “generational” difference; don’t know. I highly appreciate the input though!). I think Descartes (not sure) said that the message/intention were always clear in the “introduction/first part”.
To me the attempt to define reads as a difficult paragraph, although clearly highlighted this was the best definition you found (so far I assume). This is “probable” because I am NOT a statistician /mathematician and the book might be directed to a different audience. My comments are not in ANY way meant as critique, but as a way in which a non-, statistician /mathematician interprets the text. If I may be blunt, I will break it down from my perspective (ecological background). Perhaps you can highlight some points.
1.) A sufficient framework …
To me “sufficient framework” is rather vague. Is there meant the totality of all ideas and “evidence” coming together, forming a logic flow guiding to a question, like an introduction?
2.) … for the drawing of inductive inferences is provided by the concepts of a statistical model …
Concept to me is similar to model or idea. It is a way of expressing, generalizing and simplifying patterns of the “data” to see if it describes the “observers” expectations of the “reality” and can be easily interpreted (ad meaning to it). Whether this is in the form of numbers or otherwise is irrelevant. Based on the latter, perhaps this is why statistical model is placed in italics?
3.) … and a statistical hypothesis.
It is also confusing that the definition of hypothesis contains the word hypothesis itself.
4.) Jointly the two concepts provide a description, in probability terms, of the process by which it is supposed the observations were generated.
Hereby, it seems to be assumed you need a probability for a hypothesis? To clarify this, in the e-book, The Art of Statistics, by David Spiegelhalter, the following was written: “The pattern does not required subtle analysis: the conclusion is sometimes known as ‘inter-ocular’, since it hits you between the eyes." Thereby, D. Spiegelhalter seems to highlight that not every question (hereby, I substituted question by hypothesis, perhaps unjust) needs an answer in the form of "probabilities" (still need to read the suggested article though second one is clear, but for the first one I need some time).
5.) By model we mean that part of the description which is not at present in question, and may be regarded as given, …
The model is one of the most important parts?. Okay, as given, I will assume it is fine. Better to reduce the whole part to: the model is given (done) or define it before.
6.) … and by statistical hypothesis we mean the attribution of particular values to the unknown parameters of the model, [or of particular qualities to the unknown entities, these parameters values or entities being] in question, …
If feel unknown entities adds to the confusion and the part between brackets could be left out.
7.) … and the subject of the investigation.
Where does the word subject come from? I guess by subject the author means data?
8.) There is no absolute distinction between the two parts of a statistical description, for what is on one occasion regarded as given, and hence part of the model, may, on another occasion, be a matter for dispute, and hence part of a hypothesis.
Does the author assumes hypothesis and model are interchangeable? It indeed seems to be used like that. Hence, t-test assumes a specific model. Yet, I do not feel like the model is the hypothesis (perhaps unjust).
9.) Every statistical inference is conditional on some model, and the universality with which it is accepted depends upon the general acceptability of the model.
If the word model is define generously vague as I do “our simplified ideas or concepts representing our observations/data of reality expressing specific expectations”, flawed in many ways, but it does not necessarily require numbers.
Otherwise if a model require a numerical quantification this would mean a specific scientific topic, e.g. sociology/linguistics/political sciences cannot propose a hypothesis if it does not use probabilities or numerical quantification?
10.) Probability itself is but a model which has found general acceptance when applied to events, though not when applied to statistical hypotheses."
Okay, my comments seem to be rather equal to this, you could basically ignore everything above.
Difficult to define for such a widely use term.
Indeed, Edwards’ description of “model” and “hypothesis” is somewhat complicated. Since his book is about likelihood, “hypothesis“ here refers to an ”unknown parameter“ in the “model“ that is given.
A bit of in-between searching to the term "hypothesis" it becomes confusing really fast. Considering the name of this topic p-value could better have been replaced by NHST approach. As such, after some thinkering I have some arguments:
1.) If the p-value indicates P(data|H0) it addresses the data. Therefore, it does not address the hypothesis P(H0|data). If it does not address the hypothesis then all studies that misconceive p < .05 as evidence against H0 are false. Hereby this does not mean the data, patterns and results are false, but that the hypothesis is not address. Not to say that most studies are predictive, exploratory and thus hypothesis generating (at least in ecology, as are my own) and the p-value is then misconceived (although it is not "wrong" to use it).
2.) A hypothesis seems to be a dichotomous statement. Either H0 = Accepted or rejected. This seems a tautology to me, since this has to be necessarily true. Hence, it is a necessary truth. A hypothesis (ignoring the difference between P(data|H0) and P(H0|data)) states: Given the data is the mean between my two samples is different or not = P(data | yes, no = mean(A) - mean(B)) [more correctly: reject = = .05 ... ]. A question could be: Given the data what is the difference between my two samples means P(data | mean(A) - mean(B)). If we suggest the hypothesis is not a dichotomous statement, but has to do with an estimation, than this should have been clearly defined somewhere (not to say that it than has be clear a likelihood or posterior is addressed).
Considering the hypothesis, the t-test in R rejects H0 for any given sample >2*2 even if the sample mean is the same (actually it suggests the alternative hypothesis is true even when x and y are rep(1.2,100000)). Hence, H0 is always false??? which is rather funny. This argument is some weird cook-up.
##############################################
t.test(c(1.2, 2.2), c(1.2, 2.2))
Welch Two Sample t-test
data: c(1.2, 2.2) and c(1.2, 2.2)
t = 0, df = 2, p-value = 1
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.042435 3.042435
sample estimates:
mean of x mean of y
1.7 1.7
##############################################
In either case if argument 2. is ridiculous. Most journals, editors and reviewers are misconceived by suggesting that an article needs a clear hypothesis and p < .05 is regarded is the accept/reject region. For example: "... , the nature of the hypothesis or hypotheses under consideration ... ." (https://besjournals.onlinelibrary.wiley.com/hub/journal/13652745/author-guidelines). Or, complaints arise no hypothesis has been stated and only a question given the knowledge the p-value does not address the hypothesis it would technically incorrect to propose a hypothesis (not considering other issue with ecological datasets or sample procedures), which make strong conclusions ridiculous (hence most studies are exploratory).
In this regard, a question is more useful as it would give me an estimate (how "wrong" it might be) and not a reject/accept as it were true/false and arrive at a conclusion. Hence, the mean(A)-mean(B) stays exactly the same whether you propose a hypothesis or question.
Personally, I thus see no merits in NHST (might it either be Neyman-Pearson or Bayes factors).
"A mini-literature-review: What have been said about Null Hypothesis Significance Test (NHST)":Presentation A mini-literature-review: What have been said about Null Hyp...