Can I say that the probability of making wrong conclusion is lower when rejecting the null hypothesis at a p-value of 0.001 than at 0.045? In other words, can the probability of accurate conclusion be gauged by the value itself of the p-value ?
No. The p-value is calculated under the assumption that the null hypothesis is true. It tells you something about he chance to observe data that results in a more extreme test statistic - under the assumed model & hypothesis. It does not address the question if or how likely it the assumed model & hypothesis are correct.
Further, an individual p-value does not contain any information about the individual information or evidence provided by a particular sample. The interpretation of p-values is not a "by-case" interpretation - it is a procedural interpretation. The procedure of calculating p-values from data has particular statistical properties. Therefore all we can say is that low p-values are unlikely under the assumptions, but we cannot turn this around to find how likely the assumptions meet the reality.
If you draw two samples from the same normal distribution, a t-test gives you a p-value fro the mean difference. This p-value provides no information, what you can easily see if you repeat this "experiment": you get another, different p-value. When you repeat this, you will get p-values that vary all over the place between 0 and 1 (they will have an approximately uniform distribution). Having two p-values, one being smaller than the other, does not tell you anything about the assumption was "less correct" for the lower p-value.
If you sample from two populations with different expected values, the p-values you get will not anymore be uniformly distributed. Their distribution is skewed to smaller values, so most p-values you get are closer to zero. But the conclusion remains the same: the fact that one p-value is smaller that the other does not tell you anything about which assumption was "less correct".
In the correct "procedural interpretation", the difference is that in the first scenario you will only rarely see p-values close to zero. So in some cases you will reject the null hypothesis there, and all these (rare) cases will be false rejections. In the second scenario you will more often get p-values close to zero and more often reject the null. By definition, you cannot make any false rejection here, but it can still happen that the conclusion about the sign of the expected difference is wrong (the sample mean difference may be positive, statistically significant, but the "true" expected difference is negative). I'll call this a "sign error" (from A. Gelman: http://www.stat.columbia.edu/~gelman/research/published/francis8.pdf).
If we assume that the "true" expected difference is never exactly zero (but it can be arbitrarily close to zero), the you can never wrongly reject the null hypothesis, but among the rejected hypotheses you can have a higher or lower probability of making a sign error. This probability only and exclusively depends on the unknown true expected difference relative to the standard error (which itself depends on the sample size). If this is very close to zero, the sign error probability is about 50% (you can toss a coin to decide if the difference is positive or negative). If it is large, the sign error will approach zero.
A bad scientist testing stupid hypotheses will rarely get low p-values, and in those (rare) cases he will publish conclusions with a sign error probability of about 50%. He could increase the chance to get low p-value by using large samples, but this is cost more time and resources).
A good scientist testing sensible hypotheses will more often get low p-values with a low sign error probability and publish more papers with a low sign error probability.
Taking a single paper from two scientists and looking only at the p-values does not allow us to conclude which one is the "bad" and which one is the "good" scientist. This would be a "by-case" interpretation that just does not work with p-values.
The research community finds the papers from both scientists, some from the "bad" one and more from the "good" one, and most of the conclusions presented there about the sign of the analyzed difference will be correct (that is, the sign error probability of the published work is lower than 50%). This is the correct procedural interpretation.
Of course, scientists try to hack this system, literally, by applying "p-hacking" (e.g, doing many very small & quick experiments or making a huge number of stupid tests to get "some" significant p-values that will be eventually published). This is a huge problem fore the scientific community.
To make all of what Jochen wrote about above resonate with you, I would highly encourage anyone to go into Rstudio and try the following (perhaps oversimplified) code which you can copy and paste to get results right away. If you didn't understand already, you will understand it better once you tinker with it yourself and see the results with different parameters (change the means, sd etc). Compare the results of population with different parameters of mean and standard deviation and see what you get. (Source inspiration -9.4 of https://sites.ualberta.ca/~ahamann/teaching/renr480/labs/Lab9.pdf)
# Test for p values using a t.test where you compare the null to itself. This will compare two samples with the same /similar mean and standard deviation then produce a histogram of P values. With the current parameters, it will produce a uniform distribution of P-values if you leave the x and y variables with the same mean and SD. You can imagine X as the control population and y as the second sample of that x, in reality you will not get the same mean and sd when you re-sample but I am oversimplifying to show a concept. Compare a population to itself shouldn't skew the P-values towards lower values if there was no difference right? If you want you can make your y variable's mean 10.5 and slightly alter the sd as well because no sample should really have the same sample parameters upon resampling (simplifying here to show core concepts).
r = c()
p = c()
for (i in 1:10000){
x = rnorm(10,mean=10,sd=5)
y = rnorm(10,mean=10,sd=5)
p = c(p, t.test(x,y)[[3]])
}
hist(p)
# Then you can change the mean parameter for the y variable (think of this as your treatment group and X as control) and look at the p value of the distribution of this test. Try to make this with some reasonable difference that you think could exist in real life in control vs treatment effect.
r = c()
p = c()
for (i in 1:10000){
x = rnorm(10,mean=10,sd=5)
y = rnorm(10,mean=15,sd=5)
p = c(p, t.test(x,y)[[3]])
}
hist(p)
Hope these simplified examples visualize these nuanced arguments. If you change the [[3]] to [[2]] you will get t-values which are also interesting to see.
Peter Nam (OP), not quite. However, in a Fisherian frequentist framework you can say that you have less confidence in the null when observing a p-value of 0.001 compared to observing a p-value of 0.045. A p-value of 0.001 means the observed test statistic exceeds a 99.9% margin of error under the null. A p-value of 0.045 means the observed test statistic exceeds a 95.5% margin of error under the null. The first p-value represents stronger evidence against the null than does the second p-value.
The Fisherian frequentist calculates p-values for all hypotheses (not just a research null) and constructs confidence intervals of all levels (not just 95% intervals). Based on his understanding of their long-run performance, he bets on his observed confidence intervals covering the truth. A p-value of 0.001 testing a particular hypothesis means the 99.9% confidence interval excludes the hypothesis. Such an interval covers the truth (wherever it may be) 99.9% of the time in repeated experiments and misses 0.01% of the time. Equivalently, the compliment of a 99.9% confidence interval covers the truth 0.01% of the time. In this way, we can feel 0.01% confident that the observed 0.01% confidence interval has covered the truth, and this interval coincides with the hypothesis in question.
Here are some related links:
https://www.linkedin.com/posts/geoffrey-s-johnson_lets-talk-betting-odds-bayesians-laplacians-activity-7081240302056824832-p3Wi?utm_source=share&utm_medium=member_desktop
https://www.linkedin.com/posts/geoffrey-s-johnson_i-have-created-a-figure-i-wish-someone-made-activity-6947611213925081088-eHw2?utm_source=share&utm_medium=member_desktop
I do not agree, Geoffrey S Johnson . The individual p-value does not matter. The frequentist paradigm, as you stated correctly, is about the frequency properties. Research that accepts conclusions drawn only for cases were p
I do not agree with Jochen Wilhelm that Neyman-Pearson frequentism is the only valid approach to frequentism. While it is sufficient for the purposes of decision making to test a single research hypothesis using a single margin of error, this does not preclude further testing. This is known as a closed testing procedure. Not only can we test each and every hypothesis, we can utilize every significance level! Thus, the p-value coincides with the smallest attained significance level and represents the weight of the evidence based on the observed data.
The p-value is a transformation of the test statistic. It is a test statistic. If a small p-value does not represent greater weight of evidence than a larger p-value, then there is no value at all in performing the hypothesis test. There is no value in the likelihood or in likelihood ratios. There is no value in forming a rejection region or in calculating power.
Thank you Geoffrey S Johnson for your feedback. But what is the "weight of evidence" of p = 0.0001 obtained under a true null? Shouldn't that be the same evidence as provided by any other p-value? There is no measure telling us that this one particular p-value is from a uniform distribution or from a right-skewed distribution. Nothing tells us that we should bet on any particular shape of distribution from this p-value is drawn. Only when considering a rule, as taking all p-values we observe that are smaller than a small number alpha as "evidence against H0", then the weight of evidence provided in each case is set by alpha but not by the observed p-values.
The confidence interval is the inversion of a hypothesis test based on a p-value. The NP frequentist is only concerned with the performance of a confidence interval in relation to his research null hypothesis under the assumption the hypothesis is true. The Fisherian frequentist is concerned with the performance of a confidence interval covering the unknown fixed true parameter value.
Since the N-P frequentist assumes a null hypothesis for the purposes of argument, the “reveal” is his test statistic. This is why he must define his rejection region and place his bet *before* the data are observed. He can very well construct a 100(1- α)% confidence interval, but he is really only concerned with its performance relative to his research null hypothesis. While the Fisherian also considers null values of theta for the purposes of calculating the p-value, he does so with the intent of his 100p% and 100(1-p)% procedures covering the *real* theta, not an assumed truth for the purposes of argument. For the Fisherian, the “reveal” is the true theta, which in practice may not be easily revealed. This doesn’t matter, though, because the performance of the Fisherian’s procedures are unconditional on the unknown fixed true theta (at least asymptotically). Thus, the Fisherian does not need to place his bet before the data are observed, but he can if he wants to.
N-P frequentism seeks to reduce Fisherian frequentism to its minimally sufficient components for the purposes of decision making. While this is commendable on one level, it results in a contrived caricature of frequentism.
@Jochen Wilhelm, you're not wrong in your application of NP frequentism, it's just not the complete picture of frequentism.
https://www.linkedin.com/posts/geoffrey-s-johnson_lets-talk-betting-odds-bayesians-laplacians-activity-7081240302056824832-p3Wi?utm_source=share&utm_medium=member_desktop
https://www.linkedin.com/posts/geoffrey-s-johnson_i-have-created-a-figure-i-wish-someone-made-activity-6947611213925081088-eHw2?utm_source=share&utm_medium=member_desktop
Jochen Wilhelm Without loss of generality, let's say it's a one-sided p-value=0.0001 testing Ho: theta le theta_o. This means the Fisherian frequentist's one-sided 99.99% lower confidence limit has excluded theta_o, and his 0.01% upper confidence limit has covered it. In 99.99% of repeated experiments, the Fisherian's 99.99% confidence interval covers the *true* theta (wherever it may be). Likewise for his 0.01% confidence procedure. Based on this long-run performance, the Fisherian is willing to bet $99.99 in hopes of a $100 return were it revealed that the *true* theta is covered by his observed 99.99% interval. Likewise, he would be willing to bet $0.01 that the *true* theta is covered by his observed 0.01% interval.
Equivalently, based on this observed p-value we could say that under the null the test statistic exceeds a 99.99% margin of error.
The (1-a) confidence interval is usually (not necessarily!) constructed as the set of all hypotheses that can not be rejected at an a-level of significance. I would not call this an "inversion", but I think here we essentially agree.
However, I don't understand your division in N-P and Fisherian interpretations.
The N-P formalism allows to find an optimal test strategy in order to maximize the net win (or minimize the net loss) given two alternative hypotheses, each associated with its expected win (or loss). This is often just handled via the type-I and type-II error rates, the expected error probabilities under the two alternative hypotheses. But choosing acceptable error-rates requires some real-world connection, and this seems possibly only via expected wins and losses. Anyway, this allows to specify a desired power. The defined power and level of significance of the test give a rejection region, as you said, and the test can be based on the mere fact whether or not the observed test statistic is inside the rejection region, what is equivalent to compare the p-value against the chosen level of significance. I think we agree here as well. I don't know how confidence intervals come into play here. To my knowledge, CIs are placed around the estimate, and the particular value of the estimate is considered uninformative in the N-P context.
In contrast, the Fisherian tests don't know the concept of "power" and they do not strive to optimize a decision strategy with regard to... anything. The aim is simply to check if the available data provides sufficient information with the estimate relative to a hypothesis within a statistical model to draw a conclusion w.r.t. that hypothesis (e.g. should we reasonably expect the parameter value being on the same side of the hypothesis as the estimate). The CI here includes the set of all hypotheses that are "statistically indistinguishable" from the estimate. I think I see your point, if I'm correct here, that one may consider the width of the CI for a given data set as a function of the confidence. CIs with higher confidence are wider and include more hypotheses as being "statistically indistinguishable" from the estimate. So one could say that the data provides more evidence against a hypothesis further away from the estimate. Do you agree or am I on a wrong track already?
If you agree, then my concern is that this is correct for a given set of data (sample), but it does not allow to compare different samples. I don't know how this should be possible, and I cannot even clearly explain why I think that this makes no sense because it seems so fundamental. But maybe I can understand when you explain why and how I am wrong here. To make that clear: I don't think that there is any way of comparing p-values or CIs from different samples concerning their evidence (against whatever). The only benefit is, again, the rule, not to interpret estimates relative to hypotheses that are "too close" (for which the p-values are too large or which are inside the CIs). This leads to an in some way reasonable (but not necessarily optimal) strategy - not decision strategy as in N-P, but a "conclusion strategy".
Giving weight to evidence from data requires some (formal) prior knowledge, or a prior "frame" in which the information is embedded (embraceable?). This can, to my opinion, only be achieved in the Bayesian frame. I don't know how Fisher related to this. I used to think that Fisher was a strong opponent of the Bayesian idea and therefore developed the propensity concept what finally failed to be conclusive. But I also read papers stating that Fisher also promoted Bayesian ideas.
A confidence interval is the inversion of a hypothesis test, even if it's not explicitly labeled as such.
Assigned unfalsifiable Bayesian belief isn't evidence of anything.
While Fisher originally did not fully appreciate the concept of power, that shouldn't be taken to mean Fisherian frequentists of today have no concept of power or are unable to make use of it. Since there is an unknown fixed true parameter, there is an unknown fixed true power, and nothing stopping the Fisherian frequentist from inferring it, Preprint Decision Making in Drug Development via Inference on Power
.Beliefs may not be falsifiable, but they may well be justifyable (same applies to assumptions). The relevant question might be if "evidence" is quantifyable at all outside any context of beliefs. I don't see how the (isolated) observed likelihood (function) carries evidence. One needs a reference frame, a context to produce/measure/rate evidence. Frequentists do this by using only the likelihood but referring to expectations under assumed models/hypotheses. Bayesians do this by quantifying how much a likelihood alters the prior.
I fully agree with you that "power" is used in the Fisherian interpretation as well to plan studies. Your paper seems very interesting in this regard. Thank you for sharing.
Jochen Wilhelm, performance is evidence. The only scientifically defensible interpretation of the Bayesian's posterior integral (depicted as a CDF or folded CDF) is an approximate frequentist p-value function formed by scaling and integrating the meta-analytic likelihood. Any other interpretation of probability as assigned belief or of the parameter as a random sample (random variable) is indefensible.
Assigned belief of the experimenter isn't a statement of performance and therefore cannot be empirically investigated. It's not evidence of anything. The Bayesian allows himself to assign any credible level to a data-driven interval procedure and no matter what he will always be "right." Treating a parameter as though it were a legitimate random sample leads to contradictory prior and posterior sampling frames.
I think your first argument is circular. Not sure what you mean by "meta-analytic likelihood", but what is "scaled" is the product of the likelihood and the prior. If your "meta analytic" means that a proor is already accounted for, then the argument is circular. You cannot turn a likelihood into a posterior without having a prior. It may look like one could do so when using a "flat prior", but a flat prior is an as strong prior belief as any other non-uniform prior.
And I did not say that the researchers belief itself is evidence of anything. You got me wrong here. I said that the belief is required as a frame to judge the evidence of observations (the likelihood).
I further think you misunderstand the Bayesian interpretation of probability. For a frequentist, a probability can be assigned only to a process ("sampling") that can be repeated (under not identical but somewhat(!) similar conditions). I'd guess you would agree so far. However, this is exctly the problem our conversation started with (see below). For a Bayesian, probability is assigned to our state of knowledge about something (e.g. about a parameter, but also about a value the next measurement or observation of something will take). If the entity (a parameter value, an observation) is numeric or countable, this can be formalized as a random variable with some assigned probability distribution. The formalism is identical for frequentists and for Bayesians. The frequentists just restrict the concept to something that must be a repeatable process. This has the advantage that statements about probabilities can be "probed" experimentally.
To the best of my limited knowledge, a Bayesian does not see a parameter value as a sample. Very similar to the frequentist the Bayesian sees the parameter value as some unknown quantity. The frequentists takes a sample estimate of the parameter value and then can say how surprising this estimate is under a given, assumed (hypothesized) arbitrary(!) value of the parameter. And he may also calculate a confidence interval as the set of hypothetical values of the parameter under which the sample would not be "too surprising". The Bayesian starts with some idea about the parameter value and uses the sample to refine this idea. These "ideas about the value" are fuzzy and formalized via probability distributions.
Please don't get me wrong: I am not advocating Bayesianism here. I am just trying to find out where my thoughts might go astray.
---
If we have a sample, we assume that this is an outcome of a random process. We can calculate all kinds of sample statistics, all of which are to be interpreted as outcomes of random processes. This includes the p-value. The process producing the p-values is formalized as a random variable, P, say. So p is a realization of P. P has a probability distribution with sample space (0, 1). As it defined, the probability distribution of P depends on the true parameter value (of the test statistic, T), and since this true value is unknown, the distribution of P is unknown. We don't know from what distribution p is a realization. If we assume/hypothesize that T = t0, then the distribution is uniform, and otherwise the distribution should be right-skewed but we don't know how much. So we go for assuming a zero value. Under this assumption is is simple to get Pr(P t0 => T > t0, only in cases with small p value, we only rarely make an interpretation where T = t0, and more frequently make interpretations the larger the difference between T and t0.
I don't think our understandings are any different. I will say that frequentists do not assign probability. To them, probability is defined as the limiting proportion of repeated samples, a definition that is empirically investigable. Performance is the evidence. A likelihood is not a full statement of performance, but a confidence curve of p-values is. It outlines each and every confidence procedure based on it's long-run performance. Both the likelihood and confidence curve are a function of the hypothesis being tested. The likelihood simply provides relativistic inference. No prior belief assignment is required to judge it.
I agree that a flat Bayesian prior is as strong a belief assignment as any other, and it's not a factual statement about the parameter, a hypothesis, nor an experiment. Using a sufficiently flat prior amounts to using a uniform prior or an improper prior, which amounts to normalizing the likelihood. This is what I meant by the posterior being the scaled meta-analytic likelihood - the posterior is proportional to a flat or improper (or even subjective) prior multiplied with the full likelihood (historical likelihood times the current likelihood). It's a scaled meta-analytic likelihood. The prior is part of the scaling. I'm simply giving the Bayesian machinery a defensible interpretation.
Bayesians will equivocate however needed to justify their paradigm. They begin by saying the parameter is fixed and probability is assigned belief... until it is stated that assigned prior and posterior belief is unfalsifiable and invalid as evidence. Then they will claim the prior is a modeling assumption, i.e., the parameter is considered an actual random sample... until it is stated that this causes contradictory prior and posterior sampling frames.
Nothing about the Fisherian use of p-values is at odds with your correct explanation of the p-value as a test statistic. You are applying the NP framework to construct an alpha-level rule for a single research hypothesis. I'd like you to go one step further and test each and every hypothesis using the NP framework, and do it for each and every alpha level, then summarize the results. You'll find that, for every hypothesis, the p-value coincides with the smallest attained significance level. None of this invalidates or changes your original alpha-level decision rule. You will still reject/retain as you would have otherwise. The only difference is now you have complete inference on the truth based on confidence intervals of all levels, rather than merely a decision to behave regarding the research null.
Jochen Wilhelm according to Popper [1] no beliefs are justifiable (knowledge as justified "true" believe is therefore out). Therefore they cannot be "true" only supported or corroborated, not verifiable. As we do not know what might happen in the future. Verifiable that our models and numbers are "true" or "exact" representations of reality (as in Realism). Not merly abstracts have a representation to it. I do agree with you, perhaps the word justified invokes some nitpicking (from my side). Perhaps the word epistemically warranted in the form of inductive validity or cogency might fit better. Perhaps this does not matter in anycase, but I do agree with your hold back postion. I also agree with Popper in the sense that if we accept our beliefs to be "true" then we might be believe abstracts (meta-physical entities) then we incorporate ontological things into the world of physical things (when we believe the p-value
According to Hájek [1-3] and more recently to La Caze [4], frequentism as such is also not justifiable. Or, as Keynes put it 1923: "In the long run, we are all dead".
The logical foundation of frequentism given by von Mises has proven invalid [2,3].
After all, the attempts to settle a physical basis of probabilities did never work. The frequentism eventually is as "unjustified" as the subjectivism. Any argument about probabilities that defined on frequencies and frequencies that result from probabilities are finally circular and do not clarify anything.
After all detours we are not much further here than Bernoulli in the 18th century when he was working on Ars Conjectandi [5]; this title imo really is a hit on the topic: it's not about physics - it's about an art to make "justifiable", in some sense reasonable conjectures about things or states we do not know (precisely). And observed frequencies is all empirical data we have to justify expectations or probabilities (as an alternative to Laplace's approach of logizism, which is often not applicable in real-world problems). It is rational to adjust probabilities to observed frequencies. I think both, empirical frequentists and subjectivists do this to make "informed conjectures". The difference to me seems that frequentists try to hold out any experience from outside an actual sample (at the cost of implicitly making many and strong assumptions about data generating processes) whereas subjectivists focus on the "external experience" and use the sample information for an update (at the cost that there is no objective way to define "external experience", but possibly an inter-subjective agreement).
Jochen Wilhelm, it is the falsifiability of the frequentist definition of probability that makes it scientific and valid as evidence. We have the opportunity to empirically investigate a hypothetical limiting proportion of repeated sampling using finite sampling. We don't have to actually live forever and sample forever. Based on finite sampling evidence, we can make the tentative decision to reject the notion of a long-run limiting proportion. Of course, we shouldn't be impulsive and reject the notion across the board without even investigating it based on the impetuous writings of Bayesian diehards.
Importantly, if we reject frequentism, then there is no objective information available in the likelihood, not even for the Bayesian. This technically doesn't stop the Bayesian, though, for he could assign his likelihood belief without any connect to repeated sampling and continue to use Bayes' theorem uninhibited. This demonstrates he is always "right," no matter what probability values he assigns. Bayes is unfalsifiable pseudoscience.
https://www.linkedin.com/posts/geoffrey-s-johnson_lets-compare-bayesian-and-frequentist-inference-activity-6977669862294708224-J4Bj?utm_source=share&utm_medium=member_desktop
Wim Kaijser, that certain elements of astrology are falsifiable does not mean the whole of astrology is scientific, or that falsifiability as a criterion is useless for demarcating science. Those elements within astrology that are unfalsifiable are unscientific.
Geoffrey S Johnson , as you say: "there is no objective information available". It seems to be the issue here. You assume that "objective information" does exist, and this assumption may not prove as useful as you hope.
I also stumble over this statement: "Based on finite sampling evidence, we can make the tentative decision to reject the notion of a long-run limiting proportion". I don't see how finite sampling will bring us closer to a limit that is infinitely far away. It does not matter how large a sample is from which you calculate a statistic: the sample may be taken from some "weird" region in the infinite sequence space. You may substitute each single measurement/observation by a an own statistic calculated from a separate sample - the sequence space remains infinite and infinitely larger than the sample. You have zero coverage of this sequence space, you never get a relevant proportion of the sequence space. I think von Mises tried to apply the concept of a (mathematical) sequence to the behavior of relative frequencies of growing sample size. But stochastic sequences do not need to have a limit - such a limit was just postulated (and the proof that they should have one failed, afaik).
Of course do we learn something from samples, and we learn more from larger samples. But this is a subjectivistic view...
Geoffrey S Johnson what do you mean with falsifiable and objective infromation?
If falsifiability is the main criteria than your argumentation is also pseudo-science or can you show me how to falsify your own argument? This was btw advocated by Popper. Then, if you argument is sound then it is unfalsifiable and so it is is pseudo-science? Or perhaps logic is not scientific? If you say Bayesianism is not refutable, why are you refuting it at the moment?
The suggestion that those elements within astrology that are unfalsifiable are unscientific is to suggest that the statment "today wil by my lucky day because some stars are in position x" is scientific because I was unlucky to make this post?
Your are advocating frequentism over Bayesianism, but blaming the Bayesian (if there exists a Bayesian) of fundamentalism? What happend to plurality and anarchism as in Popper (The Myth of The Framework) and Feyerabe (Against Method). I see "banning" (e.g., p-value or Bayesianism) as a form of intelectual totalitarianism that wants to ban the freedom of ideas, which are btw our own.
Moreover, we can perfectly apply and refute any expectation, might it be Bayesianism or Frequentism. I would not see why falsification is solely a monopolised term of the Frequentist. If some model parameters are estimate E(y|x)=b0+b1x1 then either this results in an acceptable prediction and M or S-type error or it does not. If it does not thas epistemic/pragmatic value we are done, then we also learned something.
----
The first time Fisher refered to the p-value in in his first book (1925) was adressed as “… we can examine whether or not the data are in harmony with any suggested hypothesis.” It gives no information about the hypothesis only the information content against H0. Which is actually the most usefull thing I learned (from Jochen Wilhelm btw), but somehow not understood?
Jochen Wilhelm, Under Ho we assume, for the purposes of argument, that a limiting proportion indeed exists.
To test this hypothesis based on a specified 100(1-alpha)% margin of error or a type I error rate alpha, we would collect a finite sample and observe whether the sample proportion, as a sequence or function of sample size, remains within the margin of error. This would lead to a tentative decision to reject or retain the null hypothesis that a limiting proportion exists. If we tentatively retain the hypothesis that a limiting proportion exists, it is valid as evidence when investigating other claims.
If we assume for the purposes of argument that a limiting proportion does not exist, then there is no means by which to empirically refute this hypothesis. We will always retain the hypothesis that no such limiting proportion exists.
The only way we can "learn something from samples," is if this sampling converges to something in the limit.
Wim Kaijser, as an abstract philosophy, logic is not scientific. Only when it is applied can it be empirically investigated. Only then does it become falsifiable and scientific. Falsifiability itself is a philosophical principle. Only after this definition is applied to something can the notion and value of falsifiability be empirically demonstrated. Your statement about lucky day and stars makes no sense.
There is no way to empirically refute a Bayesian statement of belief. If a Bayesian assigns 73.4% belief to a proposition, this isn't a statement of performance. There is nothing we can do to demonstrate that this is right or wrong. He is always "right" no matter what number he assigns.
I'm not banning anything. People are free to have discussions and, if they so choose, make any unfounded and indefensible statements they want. The only weapon against bogus speech is more free speech.
https://www.linkedin.com/posts/geoffrey-s-johnson_is-scientific-decision-making-a-misnomer-activity-7079105888334008320-hBKf?utm_source=share&utm_medium=member_desktop
I respectfully disagree that "Under Ho we assume, for the purposes of argument, that a limiting proportion indeed exists".
All we assume is that the random variable has some particular probability distribution (at least approximately so). There is no need to give any meaning to the word "probability". If the distributional assumption is well calibrated to an experienced frequency distribution, then the resulting p-value is also calibrated to an (expected) frequency distribution (uniform, under H0). But this is not required. We can also interpret the probability distribution of the random variable as a way to quantify our relative expectation and in this case the p-value also reflects the relative expectation we should have to observe such or more "extreme" data under H0.
It is evidently good and reasonable to have a good "frequency-calibration", but this is for the usability of the analysis, not for the definition of what probabilities are. It would be strange and of little use to assign high probabilities to outcomes that are experienced rarely. We can and should require that our assumptions are in-line with our experience (up to date), but we cannot demand that anything will be defined in an infinite future. So what I say is that probability statements are necessarily local, not global. Trying to fix a global meaning of probability statements is to me like trying to fix an origin co-coordinate of the universe, or an absolute impulse of an object. It does not exist. And yet local co-ordinate systems and local definitions of impulse are very useful.
Jochen Wilhelm, any definition of probability other than as the long-run limiting proportion of events over repeated samples is unfalsifiable. If we try to be agnostic and say that, for an observable event, probability "just is," we are in fact relying on a causal propensity definition.
Here is a related post on Kolmogorov's original sin in his axiomatic formalsim of probability theory.
https://www.linkedin.com/posts/geoffrey-s-johnson_lets-take-a-moment-to-consider-the-effects-activity-6980170246422732800-9u6C?utm_source=share&utm_medium=member_desktop
The frequentist *defines* probability as the limiting proportion of events over repeated sampling. He does derive a feeling of confidence as a result of understanding this long-run performance, but the feeling is not a probability.
https://www.linkedin.com/posts/geoffrey-s-johnson_are-you-a-bayesian-or-a-frequentist-read-activity-6990304914568531968-ua1Y?utm_source=share&utm_medium=member_desktop
Yes, I agree, but any definition based on long-run limiting proportions is unfalsifiable, too. An infinite sequence of proportions can behave like approaching a (unknown, fixed) value over a period of 10^100 repetitions, but this contains no information about the next period of 10^100 repetitions, or how the sequence will behave in 10^100^100 repetitions.
Btw, how can an infinite sequence of proportions have a limit between 0 and 1? This proportion is,as I recall,
Pr(X=x) = lim(N->Inf) n[x]/N
where n[x] is the number of times X takes the value x, and N is the number of observations.
Now there are two possible cases: n[x] grows to Inf. Then the limiting value is Inf/Inf, what is either 1 or undefined (as you wish). Or n[x] stays finite. Then, no matter how large n[x] is, the limiting value will be some finite number divided by Inf, what is 0. I don't understand how this definition can result in limiting values between 0 and 1.
For me, the term "long-run" makes sense (is useful), but not in conjunction with "limiting". They don't go together. "Long-run" indicates considerable "local" experience. "Limiting" is an unjustifiable extrapolation.
I would be happy to learn that and how I am wrong her.
Jochen Wilhelm, you are equivocating on the word falsifiable. It does not mean verifiable. A statement or hypothesis is falsifiable if it can be empirically investigated. Falsifiable does not mean we must be able to be directly observe the hypothesis.
A hypothesis concerning the proportion of white marbles in an urn is falsifiable if we can at least sample from the urn. We don't have to observe every last marble in the urn to falsify a hypothesis concerning the proportion of white marbles in the urn. We just need to be able to gather evidence that, at the very least, is itself falsifiable.
Likewise, a hypothesis concerning the existence of a long-run limiting proportion of events over repeated sampling is falsifiable if we can at least perform finite sampling. We don't have to observe an infinite sequence to falsify (gather evidence against) a hypothesis concerning the existence of a proportion over an infinite sequence.
Importantly, we're not setting out to *prove* anything. We are merely gathering and presenting evidence to ultimately perform inference and make a decision. Presenting unfalsifiable statements is not evidence of anything.
The sample proportion is bounded between 0 and 1, whether a limiting proportion exists or not. n[x] is always greater than or equal to 0 and less than or equal to N.
*Long-run* and *limiting* are synonymous.
I don't understand then what you mean with falsifiable. If I may cite Wikipedia: "A theory or hypothesis is falsifiable (or refutable) if it can be logically contradicted by an empirical test." (https://en.wikipedia.org/wiki/Falsifiability). This may not be a correct source, but it aligns to how I understand it. This article also says a few lines later that "definitive experimental falsifications are impossible" and "according to Popper, statistical tests, which are only possible when a theory is falsifiable, can still be useful within a critical discussion."
I can falsify a hypothesis about a proportion of white marbles in an urn only when the urn is finite, the sampling is without replacement, and the sample is large enough in relation to the urn size.
You say that "Something is falsifiable if it can be empirically investigated" - but I don't see how infinite limits can be empirically investigated.
Of course is a sample proportion bounded between 0 and 1. But you insisted on defining probability as a limiting proportion, and under this definition, probability values can not have any value other than 0 or 1. If you move away from the "limiting" case and say that a sample proportion is a reasonably good estimate, you abandon objectivity and substitute it by a local - subjective - view, representing your subjective experience/data/knowledge from the sample. I am sure that this is not your intention, so I am puzzled by your arguments.
Jochen Wilhelm, it is incorrect to say that the principle of falsification applied to hypotheses concerning a population is applicable only if the population is finite.
A sample proportion is bounded between 0 and 1 inclusive for any N. A limiting proportion is a limit operator applied to the sample proportion. This does not imply that, should a limiting proportion exist, it is necessarily exactly 0 or 1.
*Subjective* is a poorly defined term. Bayesians use this for argumentation because it can mean whatever they want it to mean. Sticking with falsifiable, I can say that a sample proportion is not only falsifiable, it is verifiable since it can be directly observed. Using *subjective* as a synonym for unfalsifiable, there is nothing subjective or unfalsifiable about a sample proportion.
"A sample proportion is bounded between 0 and 1 inclusive for any N", yes, for any finite N. That does not help in setting up an objective definition of a p-value.
"A limiting proportion is a limit operator applied to the sample proportion.", yes, but the result is mathematically not deducible. The series of proportions from sampling is not a mathematical function. What do you mean with "should a limiting proportion exist"? I thought this is a prerequisite to the objective frequentist definition of probability?
Jochen Wilhelm, a prerequisite that can be empirically investigated and, should there be enough evidence against it, tentatively rejected. That is what I mean by, "should a limiting proportion exist." This is what makes the frequentist paradigm scientific - we have the capacity to gather empirical evidence against it.
0 le n[x] le N for all N => 0 le n[x]/N le 1 for all N. Proof by induction. The word *objective* is also a poorly defined term used by Bayesians to mean whatever they want it to mean. Using *objective* as a synonym for falsifiable, the limiting proportion of events over repeated sampling leads to a falsifiable or objective definition of a p-value.
0 le n[x]/N le 1 for all N is undoubtfully correct, also for N being infinite. But for N being infinite, n[x]/N is either 0 or 1 (what satifies the inequlity), but does not allow any value inbetween.
Jochen Wilhelm, you can't simply state that all limiting proportions of events in repeated sampling are unequivocally exactly 0 or 1. You can only state that, if the limiting proportion exists in a particular setting, it is somewhere between 0 and 1 inclusive. You can then offer a hypothesis and investigate it using finite sampling.
Since you are adamant about the hypothesis that all limiting proportions are exactly 0 or 1, you could empirically investigate this claim in a particular setting.
Best of luck!
Jochen Wilhelm, you are equivocating again on the word falsifiable, confusing it for verifiable. No one is arguing with you that a limiting proportion of events over repeated sampling is unverifiable.
It's about your claim that "it is the falsifiability of the frequentist definition of probability that makes it scientific and valid as evidence."
Ok, back to the origin of the definition of "falsibiable". Karr Popper wrote in his 1934 book "Logik der Forschung":
A theory is to be called 'empirical' or 'falsifiable' if it divides the class of all possible basic statements unambiguously into the following two nonempty subclasses. First, the class of all those basic statements with which it is inconsistent (or which it rules out, or prohibits): we call this the class of the potential falsifiers of the theory; and secondly, the class of those basic statements which it does not contradict (or which it 'permits'). We can put this more briefly by saying: a theory is falsifiable if the class of its potential falsifiers is not empty.
Note: it's about a theory. The theory in question is the definition of the p-value, afaik. It says: the probability of an event is a limiting frequency of the event being observed in a repeated series of "trials".
What is the "class of its potential falsifiers" in this case? No finite sequence ever would prohibit the limiting frequency having any arbitrary value (and I still don't understand how the limiting frequency can have a value different than either 0 or 1). Something rhat requires an infinite series of observations isn't an empirical (that is, according to Popper: falsifiable) method.
Jochen Wilhelm, we can theoretically determine the rate of convergence of the sample proportion, were a limiting proportion to indeed exist. If, in a particular application of the theory, we take a long but finite sequence of repeated samples, and the sample proportion as a sequence of sample size dances wildly such that we cannot identify a single hypothesis to which the sample proportion appears to be converging at the rate theorized, this would be an observation *inconsistent* with the notion that a limiting proportion exists for this application. It's not proof, just evidence. It's enough of an empirical observation that we can make a tentative evidence-based decision to behave as though a limiting proportion does not exist. A decision subject to error. We can even quantify how often (in the limit!) we would make a type I error under the hypothesis that a limiting proportion does exist.
If, as you hypothesize, a limiting proportion of events over repeated sampling necessarily converges to either 0 or to 1 regardless of the application, then two things: 1) a limiting proportion exists!, but... 2) the strong law (or frequentist strong axiom) of large numbers is a bunch of crap, implying that even finite sampling is meaningless (what's the point?), so that both frequentist and Bayesian significance testing based on repeated sampling have no value. Importantly, yours is a hypothesis that, in a particular application, can be empirically investigated!
"Can I say that the probability of making wrong conclusion is lower when rejecting the null hypothesis at a p-value of 0.001 than at 0.045? In other words, can the probability of accurate conclusion be gauged by the value itself of the p-value ?"
The conclusion of your study should be made by considering the effect size and some descriptive statistics, and be based on your domain knowledge. A p-value cannot tell you the probability of making an accurate conclusion. At best, a p-value can be considered as an indicator of the reliability of the effect size. But the signal-to-noise ratio is a simpler and more direct measure of the reliability of an effect size than a p-value.
Hening Huang, huh? The p-value is a calibration of the signal to noise ratio. That a result is, say, 1.5 standard deviations away from the mean of all possible repeated experiments is still application specific. It says nothing about operating characteristics. Real confidence comes from understanding performance.