A hypothetical example (hehe hypothesis), assume we have enough observations, apply both a “frequentist” and “Bayesian'ist'” model (e.g. linear model with Gaussian error distribution and for Bayesian an uninformative prior to keep it rather vague), we look at the intervals, and both models resulted in the same intervals. Then according to [1] we can say it is similar* to suggest the estimate on the population fell between [1] if we know they are similar. Are both than equally “wrong”? And, do they actually quantify uncertainty, as the both “want” to make (or am I wrong, as they really seem to want, although indeed P(data|estimate) and P(estimate|data)) probabilistic statements on the data about the population. Hence, the data is certain, the estimates are based on the data, so it seems is certain the estimate might approximate the population (assuming perfectly sampled population and this description makes sense) might take on a specified value (note Confidence an Credibility intervals have converged). Again, the data is certain, what is uncertain is what is not in the data. I am just curious what more statistical educated people think of this, how they would communicate this, as this seems hardly discussed (or it is my ignorance).
Thank you in advance for you input.
*Not their words. I just remember a part from the text.
[1] Article The Fallacy of Placing Confidence in Confidence Intervals
I would not count myself to the targeted group of "more statistical educated people" but I would nevertheless participate in this discussion :)
If you use the same probability model for the response and the same structural model (so that the coefficients have the same meaning) in both, frequentists and Bayesian analysis and if you use a flat prior, then the *limits* of the "typical" (1-a) confidence interval are always identical to the *limits* of the "typical" (1-a) credible interval. "Typical" means that the intervals are central, leaving a/2 confidence or credability on either side. So this is not related to having "enough observations". The sample size is relevant only when the prior is "informed". This prior information will be "overridden" or "overruled" by the information from the data in large samples, so that the limits converge with increasing sample size (if the estimators are consistent).
That it "is certain the estimate might approximate the population" is a consequence of the consitency.
So let's take the case that you have a confidence interval and a credible interval with identical limits. Then they still have an entirely different interpretation:
The confidence interval says that all possible estimates outside the interval are deemed statistically incompatible with the (certain) observed data. It stands for itself. It is a random interval (RI) that is derived from the probability distribution of the random variable (RV) that models the response and the sample size (observations are relalizations of the RV, observed confidence intervals are realizations of the RI, which is a function of the RV that returns two limits). A new sample would give a new confidence interval, and this will have different limits. It may not even overlap with the current confidence interval.
The credability interval says that you assign (1-a) probability to the event that the population value is inside this interval. A new sample would add information to your knowledge about the population value (forcing you to *update* your probability distribution assigned to that parameter). There are no two credible intervals - there is only one, and this (always) is based on everything you can resonably know about the population value.
Not sure if I touched your question...
Perhaps the more correct question is as follows
Intervals and credible intervals?
Look at the pictures and the output. They are based on the same data.
Jochen Wilhelm Not really sure If I understand the question myself, it is always about asking a question and see if the answer removes any confusion.
I don't understand: So this is not related to having "enough observations". So, confidence and credible intervals converge when using a flat prior but this is unrelated to the number of observations? But, I noticed this happens when the sample size is "large" (e.g. > 100), and when "small" (e.g. 7), the credible intervals are very wide or is this incorrect (perhaps I created an improper model)?
"is a consequence of the consistency." What do you mean by consistency?
Although I understand the process/idea/meaning of confidence intervals it is difficult to put this in words. Some simple way would be form me to describe it as: data=data and estimate=meanA-meanB. The procedure or model is P(), although data|meanA-meanB is often writen as P(data|estimate). I feel it is more understandable that P() is the model. Then P() could be replaced by variability representing an model assumption i.e. Gaussian. Then the confidence intervals describe the a (Gaussian) variability of the data given we would estimate meanA-meanB, or P(data|meanA-meanB). Does this make sense?
I am also confused with [1] as: "We believe any author who chooses to use confidence intervals should ensure that the intervals correspond numerically with credible intervals under some reasonable prior." But they are confidence intervals, not credible intervals. This makes it seem we all need to go Bayesian an not use confidence intervals? It is also confusing that they say "Many confidence intervals cannot be so interpreted, but if the authors know they can be, they should be called “credible intervals”." I would say yes, P(data|estimate) but does it need to be P(estimate|data). Any thoughts on this?
Anyhow thank you for the response to much confusion, too many questions (too much chaos).
[1] Article The Fallacy of Placing Confidence in Confidence Intervals
1) independence of sample size:
consider a simple case of one sample of values x1, x2, ... xn. These values are assumed to be realizations of i.i.d. random variables X1, X2, ... Xn, say with a normal distribution. As they are i.i.d., they all have the same expectation µ and the same variance σ². We do know the sample values, but we don't know neither µ nor σ².
The likelihood is a two-dimensional random function, one dimension for µ and one for σ². The observed likelihood is one realization of this random function and it is obtained from the observed sample. The maximum of this function is the combination of values for µ and σ² for which the observed data is most likely. This is the maximum likelihood estimate. For µ this happens to be the sample mean (the value is independent of σ²).
A significance test for the sample mean is based on the likelihood ratio. Since we are interested only in µ, the profile likelihood is used (σ² is "profiled out"), which is a one-dimensional function that has the shape of the density of a t-distribution. The set of likelihood ratios that lead to p > a is called the (1-a) confidence interval. The limits of this interval can be obtained from the quantiles of a t-distribution.
The Bayesian credible interval is obtained from the posterior distribution, which is the scaled product of the prior and the likelihood (scaled to make the resulting density integrate to 1). The likelihood is the very same as above. The multiplication with a constant prior does not change the shape, and scaling results in a t-distribution. The limits of the credible interval are taken from the same t-distribution as the limits of the confidence interval.
This does not change with the sample size.
2) consistency:
This is one of the requirements of "good estimators" (besides unbiasedness, sufficiency and efficiency). In short, a consistent estimator will approach the probability zero that the estimate is further away from µ than some finite value d as n approaches infinity. See e.g. here:
https://en.wikipedia.org/wiki/Consistent_estimator
https://personal.utdallas.edu/~scniu/OPRE-6301/documents/Estimation.pdf
https://eml.berkeley.edu/~mcfadden/e240a_sp00/ch6.pdf
https://people.math.umass.edu/~daeyoung/Stat516/Chapter9.pdf
3) what follows:
I must admit that I cannot follow you here.
4) The Fallacy of Placing Confidence in Confidence Intervals:
I don't know what the authors believe and why they believe that.
To second Jochen's answer and to put it a little bit differently: both confidence and credible interval incorporate the likelihood p(data|model). Where the frequentist approach uses the likelihood as estimator to calculate the confidence interval, in the bayesian approch, it is additionally multiplied by the prior. If the prior is flat, then all values of the likelihood are just weighted with the same value, like a scalar (in contrast to informed priors). Therefore, both approaches come to the same conclusion (in range of simulation precision when using MCMC).
It is similar with the sample size. The prior can be seen as a weighting, but with increasing sample size, the weight of the prior gets relatively smaller and becomes more unimportant (the weight of the likelihood increases). If the sample size is large enough, the influence of the prior becomes negligible and we end up with (nearly) the likelihood itself again.
Does that sound plausible?
You mean a a flat prior is simply multiplying by 1, basically the Bayesian bootstrap. l get it now, different perspective I guess. But, sometimes I read people going the Bayesian approach and then set a flat prior. Yet, I don't understand why ..., what is the benefit, from a pragmatic point of view?
I don't know a good reason for a flat prior.... weakly informed priors can be very useful, but as Jochen said, one advantage of credible intervals is to incorporate information that you already have, which is reflected in the prior.
The paper you mention
Article The Fallacy of Placing Confidence in Confidence Intervals
Is a very bad paper.
See my comments on my site in RG
The paper: "The Fallacy of Placing Confidence in Confidence Intervals" has received 456 citations since its publication in 2016 according to Google. I think the paper clarifies the confusion about the concept of confidence intervals. The authors of the paper are prompting Bayesian credible interval. However, credible interval has its own problems.
Using “Bayes” for estimation of the mean of a sample and its “Credibility Interval” is like inventing a PRE-sample of given sample size and mean and variance, to be “combined” with the real sample…
@ Hening Huang
Do you think that citations are and index of “Good Quality” of a paper?
See my comments in my site in RG
With them I increased the citations….
@ Jochen Wilhelm
Your statements:
are misleading young (and old) researchers…
Confidence interval (CI) is a procedure, or random interval, or interval estimator. An observed CI is an estimate of the interval estimator or a realization of the CI procedure. Confidence level is a property of the CI procedure. It is meaningless for an observed CI. CI is nothing but a procedure to generate a collection of CI "sticks" that capture the true value (parameter) at a specified rate, known as "capture rate", i.e. confidence level. CI is not a procedure to estimate the parameter. CI does not conform to the classical point estimation of parameters.
Jochen Wilhelm
Regarding the derivation of Bayesian t credible interval, I think the Jeffreys prior 1/sigma is needed in the objective Bayesian approach. Any other non-informative priors such as flat priors will not result in the scaled and shifted t-distribution. However, there is a debate on the validity of using the Jeffreys non-informative priors among Bayesians. D’Agostini (1998), a leading proponent of Bayesian methods in particle physics, argued that “…it is rarely the case that in physical situations the status of prior knowledge is equivalent to that expressed by the Jeffreys priors, …” He also stated, “The default use of Jeffreys priors is clearly unjustified, especially in inferring the parameters of the normal distribution, ….” Therefore, the validity of the Bayesian t credible interval is questionable. On the other hand, the validity of the frequentist t-interval (a confidence interval procedure) is also questionable because it is affected by the t-transformation distortion (Huang 2018).
D’Agostini G 1998 Jeffreys priors versus experienced physicist priors: arguments against objective Bayesian theory Proceedings of the 6th Valencia International Meeting on Bayesian Statistics (Alcossebre, Spain, May 30th-June 4th)
Huang H 2018 Uncertainty estimation with a small number of measurements, Part I: new insights on the t-interval method and its limitations Measurement Science and Technology 29 https://doi.org/10.1088/1361-6501/aa96c7
Hening Huang , I don't understand the concept or the need of "objective" Bayes. I also think than the word "non-informative prior" is a misnomer or just non-sensical. It's the whole purpose of Bayesian analysis to use and value information (about a parameter). In contrasts, frequentist methods use prior information only about the data and are entirely "non-informative" about the parameter.
Jeffrey's prior is ignorant to particular transformations about the variable. But it remains unclear if this is any real advantage (why should we claim that the same data provide the same amount of knowledge about parameters that represent differently transformed variables)?
It's possible that I have weirdly wrong views on this, so please feel free to correct me.
@ Hening Huang
I need IF you want (can) explain to me your statements:
In my opinion, you make confusion between a “Probability Interval” (which is a Random Interval”, that comprises the parameter to be estimated with a stated probability) and “Confidence Interval”, which is a numerical interval computed from the estimate of the parameter…
Please, reply to the question: Do you think that citations are an index of “Good Quality” of a paper?
Jochen Wilhelm
I personally don't like the concept of "non-informative prior". However, Bayesian statistics is often divided into two approaches: objective and subjective. In particular, the scaled and shifted t-distribution can only be obtained by the objective Bayesian approach with the Jeffrey prior 1/sigma (I tested two other priors numerically). The use of the Jeffrey prior 1/sigma is actually subjective, so it is confusing to call it the objective Bayesian approach. In principle, a prior is used to modulate the likelihood, or vice versa, to generate a "reasonable or meaningful" posterior distribution. Without the modulation of the Jeffrey prior 1/sigma, the marginal posterior distribution of the location parameter will be different from the scaled and shifted t-distribution, which people may not like.
Massimo Sivo
Generally speaking, papers with high citation rates are considered high-quality. Think about it, if not, why do people cite it? But there will always be exceptions.
By definition, a probability interval (PI) is an interval with fixed bounds. A confidence interval (CI) is an interval with random bounds. Please refer to Willink (2012) for the difference between PI and CI.
Regarding your request for explanation of my statements, please refer to the paper: "The Fallacy of Placing Confidence in Confidence Intervals" and my paper (Huang 2020). If you don't have an access to my paper, you can send a request and I will send you a private copy.
Huang 2020 Comparison of three approaches for computing measurement uncertainties Measurement 163 https://doi.org/10.1016/j.measurement.2020.107923
Willink R 2012 Confidence intervals and other statistical intervals in metrology Int. J. Metrol. Qual. Eng. 3 169–178 DOI: 10.1051/ijmqe/2012029
@ Hening Huang
I think that you could read several papers of my professor Fausto Galetto and see how many wrong papers (and authors) he cites: so he increases the citations!
@ Hening Huang
I cannot download your paper.
Can you send me a copy?
Thank you.
@ Hening Huang
I read the paper you suggested…
Apply the SAME rules (of Willink)to the following case:
Fix 90% for the Probability Interval.
Do you get the Confidence Interval for µ [MU] 272.850------588.123?
I am unable to generate a link to my uploaded document
You can find it in my document
Just an observation. If CRedible Intervals (CRI) with a flat prior are similar to Confidence Intervals (CI), Bayesian (B) statistics seem advocated over frequentist* (F) methods, the pragmatic distinction from a practitioners perspective is not extremely apparent (e.g. CRI -0.03-0.1 and CI 0.020-0.12), given a specified prior. Are some statisticians/mathematicians aware that a lot of confusion is created by provoking B is "better" over F or F over B, instead of recognizing and explaining the both have their pros and cons, likelihood is used by B as well as F and indicating their meaning? For example, Article Editorial
excludes both p-values and CI. To make a very extreme statement; a flat prior is reasonable with expert judgment and literature (for which someone can perfectly market this assumption, which skips the reviewer). Then applying a T-test and suggesting this is a B approach there is no distinction. There will be no difference in CRI as CI. To me it feels "wrong" to then somehow "ban" this and not explaining the practitioner why. As all F approaches have a B approach and vice versa. It is then an issue of communication that F is wrong** and B is correct. There are a lot of articles addressing (either minor) this, seem to bypass the "practiscioners". Any ideas/opinions (not sure what I am asking, perhaps a friendly comment or personal idea/observation?).*Perhaps a capital "F" is needed here as well. Just confusing we need to credit reverend Thomas Bayes over Frequentism which seems marketing, or make the a dichotomous distinction between F and B as this seems to be the case.
**Literally a dichotomous statement of wrong/right, which B seem to prevent. To me this seems contradictory to the philosophy marketed for B over F. This (I feel) is how most literature is advocated and interpreted, while I know this is not necessarily the case and perhaps not the intention? Hence, the confusion of practiscioners.
1) What do you mean by "As all F approaches have a B approach and vice versa"? It ist true that both incorporate the likelihood, but thats it.
2) I would not make a distinction "right" vs "wrong", both are right(!), but have different implications. Therefore, in my opinion the problem is not within the methods, but the education about them. Typically, practicioners interpret F and CIs as Credible Intervals, which is strictly speaking wrong (unless specific circumstances, as mentioned above). Everything is fine with F statistics, but not how it is taught. For example, CI are often indirectly interpreted as a distribution within the CI bouds (e.g. like a normal distribution, between the bounds, making the midoint the most probable value), but they are really only bounds indicating where you would declare something as significant or not. Within the B framework you have a distribution within the CRI bounds, which also may be heavily skewed, which really indicates the probability of the possible values (where then the mode or median might be a better representation than the midpoint or mean) . This is apparent, since CRI is just a superordinate concept and there are different types of CRIs, like percentile CRIs or highest density intervals (HDI). For symmetric distributions they will converge, but for skewd they may differ.
I am also just a practitioner and not a statistician, but I find the bayesian distributional approach (like Kruschke or McElreath) very appealing, since they tell you something about the data and their parameters itself and not only somehting about the model probability. (as in contrast to the Bayes Factor fraction, which I do not like). There is not THE B statistics, there are different "flavours" so to speak.
1) I meant that a method F has an counter part that can/will result in the same/similar results as B and viceversa (e.g. he CRI with a flat prior and CI of a t-test). I understand the meaning is different but 1=1 from a pragmatic point of view. So Article Editorial
is not very helpful, although it is not meant to be, I guess; it has some moral intention to prevent misconception of results. By intentions I mean that they favor B over F, as B addresses P(estimate|Data), whereby F addresses P(Data|estimate). The result after this decision are only P(estimate|Data) and prevent the misinterpretation that results from F model are perceived as P(estimate|Data).2) I would not make a distinction "right" vs "wrong", both are right(!), but have different implications. I agree, this is exactly what I mean, that's why the quotation marks. I would say they would give answer to a different question, but you need to find the correct method that fits the question. p-hacking revolves this around, it searches for the right hypothesis (here incorrectly exchanged for question) that fits the method. It formulates the hypothesis after all possible models and combinations are tried and the null-model has been "rejected" (note the quotation marks).
On the other hand, if both are "right" it is okay to say both are "wrong" (depending on your flavour i.e. the glass is half full or half empty). As George Box seem to have suggested "All models are wrong, but some are useful"*. Yet, if both are "right" (or "wrong") it has no meaning anymore to suggest a method (model/concept/idea) is "right", since if they are all "right" it does not matter anymore which method to apply, as they are all "right". Of course, this is not how it works, and how it was intended. It would be very useful as to explain what a method does and the meaning of it, which question it answers (quantifies), and why it is "wrong" (lets use the word naïve instead), instead suggesting which method to apply and which statement can be made based on the results.
Therefore, in my opinion the problem is not within the methods, but the education about them. This is what I think as well, but feel out of line suggesting so.
Hope all above makes sense.
*Actually, what I could find was: "Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration." and "Since all models are wrong the scientist must be alert to what is importantly wrong." 10.1080/01621459.1976.10480949
Flat Priors do not ALWAYS provide equal Confidence Intervals equal to the Credibility Intervals……….
I am not an expert in Psychology; so a cannot discuss the BASP decision of banning NHSTP (and other statistical “measures”)) in Psychology papers.
From my point of view the last statements of the EDITORIALmake very little sense in real (out of Psychology) applications.
See the discussion
Wim Kaijser ,
"...result in the same/similar results...": "F" and "B" cannot just be interchanged. Although under some circumstances the limits are numerically similar or even identical, they have a very different meaning. You wouldn't say that 5.3 apples are similar or equivalent or the same as 5.3 seconds, would you?
It's also not the question if one should prefer F over B or vice versa. When time is relevant, I should ask for a time, and when the number of apples is relevant, I should ask for the number of apples. It depends on the task what question to ask. Given a task, I cannot just choose what kind of question I ask (and hence what kind of answer I get).
Maybe you can also use following analogy: There is a thunderstrom and a lightning will strike the earth. We don't know where. "F" asks for the probability that the strike will be sufficiently far away from his house, whereas "B" asks for the actual place where it will go down.
"B addresses P(estimate|Data), whereby F addresses P(Data|estimate)": this is slightly confusing. This has all nothing to do with "estimate". You should substitute "estimae" with "hypothesis" (about the value of a random variable or a parameter in a statistical model). The estimate is just a summary statistic of the data, so these two are equivalent (P(hypothesis|data) is equivalent to P(hypothesis|estimate)).
"feel out of line suggesting so": no, you are not. Statisticians warn since decades and suggest to improve the education. But in applied fields (life sciences, econometrics, etc). these warnings are largely ignored. Same with the "fickle p values" that grew to the monster the are today in applied sciences, despite the clear statements of statisticians. I think one reason is that these statements are not (ore barely) understood without mathematical training, and I think that there is a lack of education on both sides: statisticians need to learn to communicate things in a way that is easier to understand without a high math skills and empirical scientists need to get better math skills. My hope is that they can meet somwhere in the middle.
One more note: There is a lot of literature about "flat priors", "non-informative proors" and "objective Bayesian" stuff. I consider this nonsense. The Bayesian approach is a coherent, objective method to update knowledge. There is no point that knowledge is, could, or should be "objective" (we may agree on a kind of inter-subjective knowledge). And it is contingent on models (made-up (!) descriptions that are tried to get in-line with as many observations we can make; you cited G.Box in this regard). I think it is a failure to understand this concept if people just try to avoid using their knowledge (or the knoweldge agreed upon in the field or in the scientific community) in a Bayesian analysis. And trying to get a meaningful posterior without a meaningful prior is like getting a meaningful ratio when the denominator is zero (you wouldn't believe how many biologists try to do this!).
the limits are numerically similar or even identical. This is what I mean. I am not trying to compare apples with pears, although some pears are more tastier than some apples and viceversa.
It's also not the question if one should prefer F over B or vice versa. Yes, but lets say you are a just introduced to stats, the somehow hidden or open messages in literature (I feel) feed into the confusion. For example, p < .05 (where ASA suggests it is arbitrary, but some psychology studies suggest people tent to go "naturaly" for 1/20), [1] (September 30), use CI as "they are `better` than p-values" (think it comes from cumming), P(data|stuff) and P(stuff|data), data needs to be normally distributed, etc etc. Hence, there is not necessarily a single source that somehow shortly addresses these things, although the book of Andy Field has very nice parts on this.
Maybe you can also use following analogy works for me. Although most of the time my questions are not really focused on H0. But on the estimate i.e. the regression coefficient (RC) is 1 (0.5-1.5). I am not really interested in how far away RC (strike) if from H0 (my house). Or how probable the RC (strike) is. I am interest if the y~ax+b actually works, if this makes sense given other literature, and how and if I can apply it. I am interested what might be a reasonable RC looking at CRI or CI and how wide they are. In both cases is about the data, but I am interested how it would perform on a completely different dataset. I am also bit at a loss with the meaning of hypothesis (but this is a completely different topic). That's why using the word estimate/effect-size/otherwise, or do you use the word estimate differently? As in a t-test you estimate the variability of meanA-meanB (effect-size)?
I think one reason is that these statements are not (ore barely) understood without mathematical training, and I think that there is a lack of education on both sides ... . Yes, I think so. But I think the meaning of some (if not most) statistics can be perfectly thought without higher order mathematics. Simulations and visual representations of processes can be very helpful. For example, you can visualize the idea/meaning of intervals with bootstrapping, than introduce the equation instead of bombing the person with an equation. Perhaps an interest in programming (mostly puzzling) becomes very handy. You can program all your own machine learning models without any mathematics as long as you can understand the meaning and process. I think it is a mismatch is between that mathematics is thought as the "language" of equations, while often the practitioner is more visual or uses another "language". I think this exactly what you mean (correct me if I am wrong). In some sense, either everyone is thought strong mathematics (impossible/improbable/unlikely) or you find a way in the middle. Not sure if you can find some resemblance with this, but I find this page always very self explanatory often suggest the simulation as a start: https://www.econometrics-with-r.org/4-5-tsdotoe.html. You can also program your own simulation in R via shiny. It is perhaps somehow less good looking, but it does the same.
is not necessarily a single source that somehow shortly addresses these things -- yes, because these things need some mental work. That's the problem that so many stats books and courses try to sell it as it was easy-peasy, understood in two minutes and all that without any concept of math. Courses like to sell easy-to-use recipies that can be taught during a 1-week course in a semester or so instead of building up understanding what would take a couple of semesters regular mental commitment on the topic.
I am interest if the y~ax+b actually works, if this makes sense given other literature, and how and if I can apply it. -- A very valuable aim, but not (really) related to statistics. I am interested what might be a reasonable RC -- yep, tha't a Bayesian question. What can you reasonably know about the RC? In both cases is about the data, but I am interested how it would perform on a completely different dataset -- huh? That's something different again now. I don't still understand what you actually want.
do you use the word estimate differently -- an estimate is something calculated from data. The rule how the calculation is performed (i.e., a mathematical function) is the estimator.
As in a t-test you estimate the variability of meanA-meanB (effect-size)? -- I don't think so. A t-test itself is not estimating something. A t-test gives the statistical significance of the difference of an estimate to a hypothesized value (for the t-test, the estimate is the sample mean or the sample mean difference, and the hypothesized value is that of the corresponding expectation [sometimes called the population mean{difference}]).
But I think the meaning of some (if not most) statistics can be perfectly thought without higher order mathematics. Simulations and visual representations of processes can be very helpful. -- Yes, this seems to be the case. But simulations rely on drawing random samples or on bootstrapping, what again is a difficult concept to grasp. It's easy only on the surface. As soon as you try to understand what it really means, you quickly find yourself in circular references about probability. But don't get me wrong: I fully agree that simulations are very helpful. But they are not sufficient.
huh? That's something different again now. I don't still understand what you actually want. I think somehow here lies the confusion in general. In which ways can we use models and does it suit the question. Predictive, exploratory, ... , hypothesis, combinations, etc etc?
1.) You can use a linear model (OLS) to explore the data and see if there is a strong relation between the target and predictor variables or which combinations.
2.) You can use the linear model to predict.
3.) You can use a linear model to to asses the strength of the coefficient when properly standardized.
4.) You can use the linear model (I think it is the probabilistic model that estimates N(0, error)) to asses P(data|H0) (I am also not sure if this is not a form of prediction as somehow the error use used for prediction).
5.) Is the data okay to test a hypothesis and was the hypothesis derived before the data was gathered and experiment was setup?
Lets say lost of times 4 is used, while actually something else was intended. Then giving 1-4 equal probabilities = .25 (I still have no clue how to define probability) and point 5. = .5 then in only 0.25*0.5 ~ .15 (15%) it would be okay to define a hypothesis, use p-values, or otherwise, to reject the data or hypothesis (depending on the flavor). In any case, it is still confusing as for point 5, the data is often never properly gathered or obtained from an experimental setup. And, I feel most question are related to 1-3 and are more predictive and exploratory. As you said "Do we need a safety belt in cars? - Yes, `when properly applied`." Then in 85% we are not even driving cars and put on seatbelts. The question, is it still okay to use intervals in the other 85% of the cases?
I don't think so. Yes, true. I mean doesn't the assumption of the model (error distribution) estimate how compatible the data is to H0. Thus based on the data and the assumption of t distributed errors, the model estimates how variable meanA-meanB is. In this sense the t-test utilizes the model, but is not the model itself?
Yep, there is a confusion. The words used are bad, as they have different every-day-meanings than they have in the context of statistics. The word "significiance" for instance is famously confused with "relevance" or "importance". And the word "mean" is used for an estimate as well as for the expectation of a random variable (aka sample mean vs. population mean). The word population in statistics is not the same as a finnite set of entities that are samples (although in special cases it can be reduced to that). I can go on with such examples. Almost every word used in statistics has a well defined meaning that is different to the every-day understanding of the very same combination of letters :(
The word "estimate" is as misleading. It's something calculated from data. Not more, not less. It should certainly "stand for something", like the sample mean should give an idea about the expectation. But this is actually not assured by the fact that the sample mean is an estimate. You can use any other estimator (function) to get a value you can interpret an an estimate of the expectation. This may be the third value observed, or the smallest value, or the harmonic mean of the smallest and the largest value, or the sum of the 3 smallest values and so on. These (and any estimator you may invent) can all be interpreted as estimating the expectation (or the variance, or anything else). The interesting point is if the estimator is unbiased and efficient. It turns out that the sample mean is unbiased and the most efficient estimator of the expectation. That's why we use it. If we calculate the likelihood of the sample data, we also find the maximum of the likelihood if the assumed expectation equals the sample mean (this is, as fas as I am aware, independent of the distribution [at least for non-malignant cases and for distributions that have a finite expectation]).
So if you use the sample mean -let's denote it with "m"- as an estimate of the expectation µ, you may know that the underlying estimator is unbiased and efficient, but that does not tell you anything about how good that particular estimate is. The actual m can be arbitrarily far away from µ. You can only make statistical claims, but you can't say anything about this particular case. This is where "F" and "B" come into play:
"F": Using only the assumed distribution and the observed data, one can compare m to any hypothesized value of µ, say µ0, and quantify the probability of obtaining data that would give an estimate even further away from µ0 than the one observed assuming the data are generated with µ = µ0. This is the p-value. If this is small, you still don't know anything about µ, but you can at least say that µ < µ0 or µ > µ0 (depending on which side of µ0 is m) are too incompatible with the observed data.
"B": Using the assumed distribution and the observed data one can update a probability distribution over µ that represents what you know about the value of µ. This knowledge can still be arbitrarily wrong, but it is the best you can say about µ and it incorporated the information of the observed data in an efficient and coherent way. The posterior mode may be something that comes closest to the meaning of "estimate" in every-day parlance.
In this sense the t-test utilizes the model, but is not the model itself? - The model underlying the t-test is E(Y) = β0+β1X where X is a 0/1 dummy for the group and Y is a random variable with a normal distribution with expectation E(Y) and variance σ² . The t-test is the likelihood ratio test of this model against the model resticted at some hypothesized value of β1.
@ Wim Kaijser
It seems to me that you do not want to accept a simple truth:
But I never said I did not accept it. Yet, acceptance leads to nihilism, i.e. p < .05 is "important", than we can just as well stop with everything. I just try to squeeze the lemon here to get as much perspectives (knowledge) and search for similarities and dissimilarities as to form my own view. I got a lot of useful perspectives. This thread is also not a question, but a discussion. I do not need to be convinced (or accept) the differences between the meaning of CI and CRI, because this was already clear before, I just wanted know what the perspectives of others is, how they give it meaning and communicate it. For example:
1.) "The confidence interval says that all possible estimates outside the interval are deemed statistically incompatible with the (certain) observed data." I find this quite a clear forward explanation, without the words probability or uncertainty (you might agree or not). I would shorten it a bit to "The Confidence Intervals indicate that all possible estimates outside are incompatible with the certain observed data."
2.) "If the prior is flat, then all values of the likelihood are just weighted with the same value, like a scalar (in contrast to informed priors)." I never looked at it this way because I often simply write the most simply ABC-rejection that updates the prior (and then do not use it anyway). Although it is P(θ|D) ~ P(D|θ) * P(θ), thus simply highlight that P(θ) is a weight. I missed this interpretation, which is clearly stupid.
Or 3.) "I don't know a good reason for a flat prior" and "It's the whole purpose of Bayesian analysis to use and value information (about a parameter)." In some sense it is quite enlightening to read someone else saying "I don't know ... ". Because I could not figure out the benefit and it did not get why they advocate for it sometimes.
It has no use to simply accept things or see them as "correct" in relation to your own ideals. It is useful to find a simpler way to explain it might it be fitting or not. It has also no use to always see them as "incorrect" or else we stay stuck in trying to define it i.e. probability, model, sickness or life. If it is not fitting we need to adjust the, method, model, what I think to know or belief and try to do better next time. This is the whole purpose of discussing things, to share perspectives, and changing perspectives. We are searching for truth with the idea in mind we are always "wrong" and not able to find it; but at least get closer, although not knowing how close. Or else I have no idea what I am are doing in general (which I still don't).
I would not use the word CERTAIN as you say in your point
Let’s consider this “macabre” “Gedanken Experiment”: 2 “practitioners”, in Rome, use 2 samples of mice (size=100) poisoned with food to see which poison is more lethal; they decide to “measure” the time (days) to death of the first 13 mice.
They use both the “classical” method and the Bayesian method, with flat prior: the collected data are
2.304
8.500
9.899
11.976
12.578
14.065
14.098
16.354
17.512
18.694
20.626
23.874
30.559
3.137
6.542
7.649
8.621
10.830
15.404
18.140
18.191
19.658
23.180
24.388
24.731
39.956
They use alfa=0.2; they draw opposite conclusions: one considers “mortality” equivalent, while the other considers that “poison 2 (second sample) is more lethal”.
The current article Colling & Szucs (2021) Statistical reform and the replication crisis comes to a nice conclusion in my opinion, which summarizes what we already wrote:
"We do not think that the solution to the replication crisis lies in statistical reform per se. While there are undoubtedly problems with how people justify scientific inferences on the basis of statistical significance tests, these problems may lie less with the tests themselves than with the inferential systems people employ. And we have attempted to demonstrate how good inferences on the basis of statistical significance tests may be justified. We have also examined the Bayesian alternative to statistical significance tests and explored some of the benefits of the Bayesian approach. The argument for Bayesian statistics is often framed in terms of the macro level inferences that they permit and in terms of the perceived shortcomings of Frequentist statistics. However, we have argued that well-justified Frequentist inferences can often lead to the same gross conclusions. Rather, the key differences lie in their view of evidence and the role error plays in learning about the world. That is, rather than furnishing different inferences, per se, each approach provides a different kind of information that is useful for different aspects of scientific practice. Rather than mere statistical reform, what is needed is for scientists to become better at inference (both Frequentist and Bayesian) and for a better understanding of how to use inferential strategies to justify knowledge."
https://www.researchgate.net/publication/328758855_Statistical_reform_and_the_replication_crisis
I agree that
But a point is also important:
Nobody dared to find the solution……………….
@ Rainer Duesing
I read the suggested paper…
The statements
have no sense.
Apply them to the case I gave on 12 October 2021…
Massimo Sivo
To quote yourself as a return to your request:"I will prove it at my will because YOU should be able to prove it." (accentuation from the original quote).
And you seem not to understand the difference between a necessary and a sufficient condition. Large samples do not guarantee valid infernces, if you have chosen an inappropriate model in the first place. If the statements are wrong and sample size does not influece the power to detect effects, I would suggest only to use samples of N=1, since then you are maximally efficient.
@ Rainer Duesing
@ Ronán Michael Conroy
I am not “as good as you are” (your numbers!)…
So, please, explain to me the meaning of the statement
Thank you
Red dashed lines SE of m and solid line m. "N = " indicates the number of observations in a sample generated from N(3, 6). Based on the samples, the dashed blue line indicates the assumed distribution. If N = 1 m would be 6.77, while actually 3 overestimating by 6.77-3=3.77.
But, isn't this partially the idea of the law of large numbers, although they seem to ascribe it a large number of independent trials, or am I wrong? Yet, not only dependent on the model selection, but also on the representativeness of the experiment/sample procedure (horrible in ecological field studies or monitoring's data btw).
But Stefano Nembrini , if precision is unrelated to sample size, you don't need that, just take the point estimate ;-) *irony off*
@ Rainer Duesing
@ Ronán Michael Conroy
@ Stefano Nembrini
There are people “playing with words”….
Read what Duesing wrote:
HE never used the word “precision”, in the above statements……
HE actually used the word “SAMPLES”, in the above statements……not POPULATION N=1
Massimo Sivo you still have not provided evidence, why the statements from Colling & Szucs are nonsense in your opinion (as always).
But your own example from the other thread shows what they are talking about. Your own equation to calculate CIs for beta of an exponential distribution shows the relation of CI width and sample size. Maybe you are mixing up coverage and precision. Whereas the former may be independent of sample size, the latter is not. And if you are trying to detect small effects, you need high precision and hence a larger sample size, ceteris paribus. And in turn, if precision and sample size were unrelated, you should try to keep your sample size as small as possible, since otherwise you would waste ressources. It seems apparent, that this is not true.
@ Rainer Duesing
@ Ronán Michael Conroy
@ Stefano Nembrini
Rainer Duesing your statement “””Massimo Sivo you still have not provided evidence, why the statements from Colling & Szucs are nonsense in your opinion (as always).“”” proves that you still do not understand what is WRONG in Colling & Szucs statements
In the case I gave on 12 October 2021… n=sample size=100
Changing TO n=sample size=200, and having the same number of data (g=13, which is NOT the sample size) one does not alter (significantly!) the “width of the Confidence Interval”!
SO
is CORRECT, while
is WRONG (in the case I gave)
HENCE, Rainer Duesing, your statement…
is WRONG (in the case I gave)
Dear Massimo, I think you are playing word games in this case. You are correct that in a Cox Regression (I assume you are referring to) it would not matter if you would have 87 o3 187 cases where the event has not occured, the CI for the exp(beta) would be nearly identical. But in my opinion (and please correct me if I am wrong) your example is missing a crucial point. I cannot imagine a research question, where you are interested to knowing the group difference when exactly 13 cases have occured, independent of total sample size. To what is this 13 related? Either your are typically interested in a time interval and count how many cases with events have occured or you may be interested to wait until a fixed (precentage) amount of events have occured (in this case 13%, but I am not sure if p-value calculation would be valid then with this stopping rule, maybe someone can tell).
In either case, the total sample size would change the results, if the underlying data generating process would be the same. I case 1) you would simply have more cases in the same time interval, where the ecent occured, hence more precision and in case 2) you would wait until more total cases with event would have occured and hence a higher precision.
Therefore, in my opinion it is not valid to say that the sample size would not change anything, because it would not be same same underlying process. But as said above, you are right, the CI estimation would be nearly the same if you only add cases wherer the event has not occured.
@ Rainer Duesing
Cox Regression? Nothing to do with my cases.
Research question as per my cases? Any time you a “measured response” related to “time”, “cycles”, “space” …, such that the test cost is greatly related to the number of data (“measured response”) collected: the more data (“measured response”) you collect the more you have to spend. The only way to reduce the cost is to “invest” (IF it is suitable…) in increasing the sample size. IF you have a penalty (you must pay… a lot of money) due to a late decision, you must invest in increasing the sample size. Increasing the sample size reduces the “time to draw Decision”.
With n=100 and g=13, you get an “estimate” and a Confidence Interval. With an Increased sample size, n=200 and g=13, the lower data can change; changing the sample size, change only the g=13 lower data; BUT the “estimate” and the Confidence Interval (with n=100 and g=13 AND n=200 and g=13) are NOT statistically different, with a stated risk.
Wim Kaijser
This comment is about your previous statements and question: “Then applying a T-test and suggesting this is a B approach there is no distinction. There will be no difference in CRI as CI. …. Any ideas/opinions (not sure what I am asking, perhaps a friendly comment or personal idea/observation?).” It is true that the frequentist t-interval (CI) coincides with the Bayesian t-interval (CRI). Unfortunately, neither is correct. The frequentist t-interval (CI) is the result of distorted inference due to t-transformation distortion (Huang 2018), while the Bayesian t-interval (CRI) is the result of the objective Bayesian approach with the Jeffreys prior that is unjustified as pointed out by D’Agostini (1998). However, the problem of the t-interval has been overlooked and t-interval is taught as the truth in our universities.
As a practitioner, I have processed thousands of small samples in my work. Our customers such as the US Geological Survey and Environment Canada deal with small samples in flow measurements almost everyday. We found that t-interval or t-distribution has no places when dealing with small samples. The t-interval would give unrealistic or even paradoxical uncertainty estimates when the sample size is very small (say n
@ Hening Huang
I tried to download the paper
but I did not succeed.
So I cannot make any comment on your statements.
In any case, I am sure that your method can be applied ONLY to measurements with "Normal Distribution" (or BASED on the Central Limit Theorem)
There is a puzzle in Bayesian statistics that has been overlooked or ignored (I think). The Bayes Theorem in continuous form states that the posterior distribution (PDF) is proportional to the product of the prior distribution (PDF) and the likelihood function. That is:
posterior PDF ~ prior PDF x likelihood (1)
In the case of no prior information, a flat prior should be used according to Jaynes’ maximum entropy principle. If a flat prior is used, formula (1) becomes:
posterior PDF = standardized (likelihood function) (2)
However, formula (2) is wrong because a likelihood function is NOT a probability distribution. Fisher (1921) stated, “… probability and likelihood are quantities of an entirely different nature.” Edwards (1992) stated, “… this [likelihood] function in no sense gives rise to a statistical distribution.” Therefore, Formula (1) may be flawed. Moreover, formula (1) violates “the principle of self-consistent operation” (Huang 2020).
You must not find formula (2) in statistics textbooks. It is to be avoided. Instead, non-informative priors such as Jeffreys priors were invented and used. For normal distribution model, the Jeffreys prior 1/sigma is usually used to modulate the likelihood, resulting in the scaled and shifted t-distribution. If a different non-informative prior, e.g. flat prior for sigma or 1/sigma^2, is used, the posterior distribution will be different. So, Bayesian posterior distribution is essentially subjective even if a non-informative prior is used.
Preprint A new modified Bayesian method for measurement uncertainty a...
Hening Huang , you can understand the flat prior over the whole real line as the limiting case of -for instance!- N(0,s²) with s² -> Inf. So your standardization is a limiting case of the procedure and nothing qualitytively new or different than a multiplication with an (improper, though) PDF.
If the domain of the RV is a finite range (like a proportion, for instance), the flat PDF is that of a (possibly scaled) beta(1,1), so this is even a proper PDF.
Hening Huang why was or still is there a need and desire for an "objective" bayesian prior? I really do not get the point. Research is always subjecive, but the "beauty" lies in the possibility to openly communicate the priors and compare the results of different priors if different opinions exist.
And is there a situation where you really know nothing about your data? With vaguely informed priors you may be able to reproduce the general form of your dependened variable (without considering and seeing any predictors) and even check this with prior probability checks. But shouldn't it be a goal to do cumulative science, so to incorporate all information you already have? Again, why a need for "objective" priors?
Rainer Duesing
Indeed, the Bayesian inference depends on the choice of priors. However, there has been a long-standing debate about the choice of priors in the statistical community. García-Pérez (2019) showed that Bayesian analysis with informative priors (known as subjective Bayesian approach) is formally equivalent to data falsification because the information carried by the prior can be expressed as the addition of fabricated observations whose statistical characteristics are determined by the parameters of the prior. He argued that only the use of non-informative, uniform priors in all types of Bayesian analyses is compatible with standards of research integrity. However, the use of non-informative prior (e.g. Jeffreys prior 1/sigma) creates different problems as already discussed.
García-Pérez MA 2019 Bayesian estimation with informative priors is indistinguishable from data falsification The Spanish Journal of Psychology DOI: 10.1017/sjp.2019.41 https://www.cambridge.org/core/journals/spanish-journal-of-psychology/article/bayesian-estimation-with-informative-priors-is-indistinguishable-from-data-falsification/FFAB96BDC5EE3C64B144ECF8F90F31E9
@ Hening Huang
Well before that
My professor, Fausto Galetto, showed that
See one of his statements
The same happens for other distributions…
@ Hening Huang
I read your preprint, updated August 2021.
You did not take into account that the likelihood function is used for finding the estimator of the parameter.
Hence
Massimo Sivo
Indeed, as stated in the conclusion: "The proposed new modified Bayesian method is a self-consistent operation because it operates entirely on PDFs. As a result, it gives the correct inferences for the problem considered (Case 1 and Case 2): same solutions as its frequentist counterparts. In contrast, the traditional Bayesian method, i.e. the reformulated Bayes Theorem, is not a self-consistent operation because it operates on likelihood function and PDF. This is a flaw of the traditional Bayesian method. As a result, the traditional Bayesian method gives the incorrect inferences: invalid estimates of standard uncertainty (SU) in Case 1 and Case 2. A likelihood function is a distorted mirror of its probability distribution counterpart. The use of likelihood functions in Bayes Theorem is the root cause of the inherent bias of the traditional Bayesian method. However, the original Bayes Theorem, either in continuous or discrete form, is a self-consistent operation because it operates entirely on probability distributions or probabilities."