Blaine Tomkins , thank you for your appreciation, very wellcomed. I don't want to beat on others. It's my honest interest to improve statistical thinking and to help cure some very deep sitting misconceptions. I also want to provoke, so that such things are properly discussed and we all can learn.
A p-value gives the probability of obtaining the result of a statistical test assuming the null hypothesis is true. A more intuitive definition I give my students is that "the p-value gives the probability that the result of a statistical test would be obtained by chance alone". Suppose you conduct a t-test and the result shows that p = .02. This means there is a .02 probability (or 2% probability) that you would find this outcome by chance alone - not because there is a real difference between the two groups (means) you compared in the population.
Type I Error
Now, the concept of a Type I error is closely related to the concept of the p-value. A Type I error is committed when a researcher incorrectly rejects a null hypothesis. Let's stick with our example above. We conduct a t-test and the p-value in the output = .02. Since this is below the conventional threshold of significance (p < .05) the researcher rejects the null hypothesis. However, further suppose that there really is NO difference between the two groups the researcher compared in the population. In other words, the p-value suggests there's a difference, but there really isn't one. In this case, the researcher has committed a Type I error.
Sorry, but the answer from Blaine Tomkins is more confusing than helpful. It repeats some common flaws that lead to some of the present misunderstandings of tests and p-values.
If you want a correct definition of what a p-value is, it's best to look up the ASA statement about p-value:
Article The ASA's Statement on p-Values: Context, Process, and Purpose
that says: "a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value."
To put it veeery simple: consider you determine the weight of people. You don't know what weight you will measure for the next person. However, you might have a statistical model about the weight, like a normal distribution with mean 75 kg and a standard deviation of 5 kg. Given this model you expect a weight of 75 kg. Now consider you actualy measured a person, and it turned out that this person weights 60 kg. This is 15 kg off from your expectation. Under your statistical model this (being off by at least 15 kg) has a probabilty of 0.003 - this is the p-value. The fact that this is so small tells you that the actual observation is very unexpected under your model. So either you observed some unlikely light-weight person or your statistical model is not very good.
The type-I error is an error you can make an A/B acceptance test. A and B are two different hypotheses, one of which you have to accept, what leads to an appropriate action. A type I-error is to accept B when actually accepting A would have been the better choice. A type II-error is to accept A when actually accepting B would have been the better choice.
Jochen Wilhelm Where precisely are the flaws in my description? Granted I've phrased things differently, but I don't see where it is flawed.
Your characterization of a Type-I error is uninformative since there is no explicit mention of the relationship between the decisional error and the null hypothesis. As it stands, there's no way to differentiate a Type I and Type II error since A and B haven't been specified.
The p-value does not measure the effect size nor the probability that the Null Hypothesis (of no difference is true nor justified), but as hinted by Prof. Jochen Wilhelm chen (of ASA 2016 statements)! It is the calculated probability of finding the observed statistical summary (parameter value(s) under the null hypothesis, being only by chance), or that there is/are more extreme values than the observed results(differences) assuming the null hypothesis (H 0 as tested) of a study/research question is true! The smaller the P-value (compared to the alpha, as specified) the more significant it is , providing evidence that under that confidence limit, the data are unusual, and are probably not by chance but are actually different. That does not make you accept the Null hypothesis statement but fail to reject it or reject it! The confidence level (and interval) is not adequately a guarantee that more extreme difference than assume is not possible given just random chances. You wanted me to say if "p-value is same as type 1 error", No, they aren't practically the same, just as @Jochem had defined it.
you say " "the p-value gives the probability that the result of a statistical test would be obtained by chance alone". " this is a flawed (even a wrong) statement. This misinterpretation is also explicitely referenced in the ASA statements.
Regarding the type-I error: This error is not defined regarding H0. It is defined regarding A and B (what A and B are is as arbitrary as what H0 is-it depends on the research question and what task and thecontext of the test). In NHST (a flawed procedure found in many text books), the type-I error is explained as the error of falsely rejecting H0. This is quite meaningless as there is no alternative and thus no loss-function that could be optimized, and it is practically meaningless because H0 has no real existence (so there statements like H0 is true or false makes no sense) and even if H0 would be taken as something that existed, it is false by definition in most cases (one would not need data to reject it as "being false").
Jochen Wilhelm Interesting assessment. I do have a few follow-up questions. First, isn't there an implied H1 (whatever it may be) whenever H0 is posited? I haven't encountered a situation where a null hypothesis was posited without an alternative. Second, if H0 is meaningless, why bother making a distinction between a Type I Error and Type II Error in the first place? This distinction now seems rather meaningless. In either case, it's simply a decisional error. Lastly, how is H0 false by definition? Is it not a contradiction to state that something is false by definition in most cases? If it's false by definition, there are no exceptions.
Unfortunately, there is no full text available to the ASA 2016 Statement you attached, so I am unable to read it in it's entirety.
The classical definition was for pre meta-analysis and regression modeling era, when there was no forest plot or coefficient plot.
P value is just a P(value) i.e. probability of the value in paren. So which value? The standardized value of difference from a point estimation. This point estimation can be either H0 or H1. In other words, the zero of the statistics (like z or t) can be aligned on either H0 or H1.
In the classical definition of p value, this alignment is on H0 and therefore it has overlap with alpha area. While in forest plot (meta-analysis) and coefficient plot (regression) it is vice versa. Of course, they have often same amounts numerically.
This has been an informative discussion! I am not a fan of NHST or p-values given the problems and limitations inherent in this approach and instead prefer to use Bayesian methods.
If anyone has a full copy of the ASA paper, please share. I would like to learn more about the points raised by Jochen Wilhelm
Blaine Tomkins , the summary and link for the full text are here. https://www.crossfit.com/health/the-asa-statement-on-p-values-context-process-and-purpose. Whenever Ho is posited, is H1 hacked or certified ? Seriously,Hypothesis, the 'Null' and any 'Alternative' defined (simple or composite) belongs to the same 'Hypothesis SPACE' of ideal values. The Ho is NEITHER True nor False but unconfirmed "reasonable ideal value" we assume is true!
ere are indexed (and very important, requiring meaning,) questions a scientist might want to ask after a study: 'What is the evidence, and by how much?'; 'What should I believe, and why should I(i.e., is it a valid test and results)?'; 'What should I do, given statistical evidence of effects other than chance?' , and "how should these be reported, everything or by rule some particular results?"
" First, isn't there an implied H1 (whatever it may be) whenever H0 is posited?"
Yes and no. It's mathematically ok to say that H1 is "not H0". This is simply the hypothesis space over which the likelihood can be maximized (H0 defines a subspace for maximization). From the perspective of the test purpose, this distiction makes not much sense. H0 simply is a benchmark, a reference point, to what the data is compared to. This may be explained best with an example:
Say you suspect that the pH of your tap water is not ok. By the EPA regulation, the pH should be between 6.5 and 8.5. You measure a couple of water samples from different taps and on different days. They all give different pH values. It might be that there are some measurements lower than 6.5 or higher than 8.5, what can happen and would not be a reason to conclude that the regulations are violated. However, when the measured values would indicate that you should expect that the pH is too high or too low, you would contact your water supplier.
The expected value is estimated by the sample mean, and it may be the case that your sample mean is 6.3, indicating a too low pH. The question to be answered by a test is: does your sample data provide sufficient information to warrant this conclustion? Remember that if your sample was only a single measurement, it is understood that this single value is not providing sufficient information. But nor you averaged n measurements and the average is still too low. Was your sample large enough to provide enough information about the expected pH - relative to the limit of 6.3?
Assuming approximately normal distributed pH values you can measure, you can use a t-test, testing the hypothesis H0 that the observed data are from a model with an expected pH of just 6.5 (the chosen "reference point"). If the data (or the mean difference of 6.3 and 6.5 = -0.2) are very unlikely under this statistical model, you would conclude that the information in the data is large enough to conclude that the expected value is lower than 6.5 so that you would contact the water supplier. More formally, you might consider H0 as the interval (6.5, 8.5) and H1 as the the broken interval (-Inf ] 6.5, 8.5 [ +Inf).
" Second, if H0 is meaningless, why bother making a distinction between a Type I Error and Type II Error in the first place? "
Type I/II errors are meaningless if the task is to reject H0. You either reject it or you fail to reject it. This test is not about the truth of H0. It is about the amount of information you data provides against H0.
These errors only make sense if the test is to accept one of two distictly different alternatives. In the tap water example consider you are the water supplier and you have to make sure that the pH is ok. If it is ok, you continue to provide the water, if not you have to stop the delivery and start searching for the reason of the problem. You can put this as an A/B test with A: the expected difference of the pH to 7.0 is 0, B: the expected difference is 1.5. If you wrongly accept A, the water will corrode the water lines what will cost money. If you wrongly accept B, you unneccesarily invest in engineers and staff and loose money from selling water. From the minimum expected losses under these scenarios you get derive a cost function from which you can then get the expected loss for any given probability of type-I and -II errors. You can choose/balance these errors to get the minimum expected loss. Doing so will tell you the size of the sample you must analyze to have the minimum loss, if you do an A/B test.
" Lastly, how is H0 false by definition? "
H0 is a restricted statistical model (e.g. "the pH is modelled as a normal distributed variable with expectation µ and variance σ², where µ=6.5"). Note that you can restict µ to any value. There is nothing true-ish or false-ish on your choice of µ; it's just the reference values to check your data against. All together, this is a model, and as such an idealized representation of an aspect of the world. There is no point in arguing that the world is identical to that model. It is surely not. Therefore, H0 is "false" by definition. But that's the wrong question and the wrong explanation (it is not about being true and false!). A model might be a reasonably good description, and a model assuming µ=7.0 for the tap water pH is clearly a more reasonable model than one assuming µ=1.5 (and the reasons are not statistical; they come from context and subject matter knolwedge). But this does not make it a "true" model. Testing H0 is just to see if your observed data contains enough information so that it becomes obvious to "see" that H0 fails to account for all aspects of real data. Again: H0 might be reasonable, but it is (usually) only a matter of the sample size to accumulate sufficient information to reject it. And rejecting H0 is only required to see if we can interpret the sample estimate relative to H0 (like in the pH example: the estimate was 6.3, H0 was 6.5, so the estimate was smaller, not larger. Can we the conclude -with acceptable confidence- that µ < 6.5? - Yes we can, if p < (1-confidence), and if the test is less sensitive to the distributional model than to the tested parameter*).
Jochen Wilhelm This was a very clear and thorough explanation. I appreciate your taking the time to clarify each of these issues - particularly the habit of mistakenly treating H0 as a statement about the state of the world, rather than the assumptions of a particular model.
Blaine Tomkins , thank you for your appreciation, very wellcomed. I don't want to beat on others. It's my honest interest to improve statistical thinking and to help cure some very deep sitting misconceptions. I also want to provoke, so that such things are properly discussed and we all can learn.