In recent time, it has been noticed that almost all research articles (with some sort of data) validate their results with the use of "p-value". The question is how much we can believe in with these statistical values? I mean, the research held before emerging of "p-value" were not significant in their nature??
Thank you, Indeed.
The point of the p-value was to give some indication of the probability that the appropriate hypothesis was true. Before p-vales we did not have that. Validating a model is something different. See this link for that discussion.:
https://www.google.com/search?q=validating+a+statistical+model&rlz=1C1CHBF_enUS847US847&oq=validating+a+statistical+model&aqs=chrome..69i57j33.15859j1j7&sourceid=chrome&ie=UTF-8
Best, D. Booth
You can follow
https://www.nature.com/articles/leu2016193?draft=journal
Try reading https://amstat.tandfonline.com/doi/full/10.1080/00031305.2019.1583913#.XSyNOXmWxPZ
and a few of the articles written for this special edition.
More journals are accepting articles with "negative" results.
Stop thinking of results where p-value>0.05 as negative results.
I like Dr. Wilhem's explanation. I have only encountered this need for p-value validation when dealing with cytokine levels or number of mice. Its murky waters because the term "significance" is so misused that it has no significance. Hopefully this type of thinking goes away as new scientist enter the fields.
Part of the problem is with interpretation. A p-value is the probability of getting an observation as large or larger than that observed given that the null-hypothesis is true. There is nothing in that statement about validation, proof, importance, or relevance. Validation comes when multiple researchers perform roughly the same experiment and all come to the same conclusion.
The p-value is also only an estimate. If you would redo the experiment many times you could build a 95% confidence interval about the mean p-value. With a single experiment, you have no idea where any one p-value might be within this distribution. With a good sample size the confidence interval about the p-value is fairly narrow and one can generally accept the statement of probability at face value. However, that is still no closer to validating or proving anything.
The good news is that a brief reading of medical and dental practices from 200 years ago should rapidly convince you that there are real benefits to living now. The quality and quantity of food has likewise improved from what was available to most people at that time. Most of that improvement has come in the last 90 years (or so) and driven by experimental design and the analysis of the resulting data. Despite problems, this approach has resulted in considerable success, and continues to be used extensively even though it does not prove anything.
Timothy A Ebert Right on. For those that want to replace p-values: with what do you replace them with or do you not want to judge how likely it is a hypothesis is true or what is science about anyway?
David Eugene Booth I have seen people want to replace p-values with 95% confidence intervals. However the 95% confidence interval is no more accurate than the p-value. The upper 95% confidence level is still an estimate. With another sample one would get a slightly different estimate. There is a distribution about the upper and lower bound. It still does not validate or prove anything.
Another alternative may be to evaluate the minimum effects, equivalence, inferiority test based on the results presented by the results using two-tailed tests (TOST). This allows to determine if an observed effect is really small since there is a true effect greater than the minimum effect (SESOI). The free jamovi program has the TOSTER option that allows you to evaluate this analysis in a simple way.
I recommend reading:
Equivalence Testing for Psychological Research: A Tutorial
Daniel Lakens, Anne M. Scheel, & Peder M. Isager
Cristian Ramos-Vera Why do a 2-tail test if the one sided alternative is what you really want to test. Just like the point is not to get p
Jochen Wilhelm
If you think only in terms of R.A. Fisher and the original p-value that is quite true. I believe that most present day frequentist statisticians combine the Neyman-Pearson theory with the idea of a Fisher p-value which is what I suggested unless I have made an error somewhere. 70 years after his death R. A. Fisher no longer has the ego and force of will to prevent us from improving his ideas. BTW Fisher never as far as I know accepted the Neyman=Pearson Idea of an alternate hypothesis. If you have two hypotheses one researcher may label the first as H0 and the second H1 or v.v. I see nothing here that is problematic. Certainly the Bayesitan idea of iinterval is not the same thing but I believe that I have captured the essence of how Neyman and Pearson gave us a modification of Fisher's view in both the work on hypotheses tests and estimation. BTW I have just started reading Lehman's book about Fisher and Neyman. I will be interested in his take for he is certainly more of an expert in this topic than I. Best wishes, David Booth PS I did my work at the University of North Carolina when Senator Sam Ervin was representing it. He would often say "Well. I'm just an old country lawyer. I don't know the finer points." Perhaps I didn't work through the theory as completely as I should have so I will sign off by paraphrasing the Senator. I am just an old country statistician. I don't know the finer points. Again, Best wishes, D.B.Moving beyond p values, in biology/agronomy etc experiments, we should always keeping in mind the distinction between the concepts of biological relevance and statistical significance:
https://efsa.onlinelibrary.wiley.com/doi/epdf/10.2903/j.efsa.2011.2372
How many ways can we repeat the p-value question?
How many ways can we say it is flawed, but useful?
It does not, as David Eugene Booth claims, provide an ESTIMATE OF THE PROBABILITY THAT A RESEARCH HYPOTHESIS IS TRUE.
Jochen Wilhelm
makes two interesting statements: “In the testing setting, a null hypothesis (H0) is either rejected or not.” “Hypothesis testing is not to find the probability of any hypothesis. It is useful to find out if the amount of data is sufficient to interpret the estimated effect as being below or above the tested value.”The statistical concept of hypothesis testing is not about the research, it is about data quality. The use of the term hypothesis in data quality analysis is superfluous and misleading.
Suppose we wish to know if A is taller than B. We do not assume A = B then attempt to disprove it. We ask if A is taller than B, how do we measure the difference to an acceptable level of certainty?
The difference in the heights of A and B is a problem of signal in noise. The signal is the difference D = A-B and the noise is the measurement uncertainty of A and B. Our estimates of A an B are A +/-a and B +/-b. We could propose a sufficiently low p-value based on a and b that is the minimum data quality to yield D +/-d. This conforms to Jochem’s “It is useful to find out if the amount of data is sufficient to interpret the estimated effect as being below or above the tested value.”
A null hypothesis A=B is worthless. There is no need to reject or accept something that is worthless.
Statistics without probability. No dependence on the axioms of probability.
Www.statistics.sastram.com
Statistics Without probability in research gate.
Deleted research item The research item mentioned here has been deleted
I am late to this party again but I would suggest that the strawman Dr Alvarez considers does not arise in research for there we make multiple measures of A and B and wish to take all of the measurements into account. We thus have a problem in statistics because we rarely observe the same value twice. I thus reject Alvarez's solution and recommend consulting one of my favorite books: https://www.amazon.com/Statistics-Engineering-Sciences-William-Mendenhall/dp/0023805811/ref=sr_1_2?keywords=Mendenhall+and+Sicich+Statistics+for+Scients+and+Engineers&qid=1574604437&s=books&sr=1-2-spell
Best wishes to all, David Booth
David Eugene Booth explain the straw man accusation that leads you to believe I advocated taking only one measure of A and B. What is the solution I proposed that you reject?
As Jochen Wilhelm
stated, no one doing real research uses NP.P-values do not provide an ESTIMATE OF THE PROBABILITY THAT A RESEARCH HYPOTHESIS IS TRUE. They give an estimate of data quality and sample size sufficiency.
In my point of view, I can say that even whatever the p-value is as small as you like there is a very very small probability to accept your hypothesis. So, because most of applicable statistics used this measure it becomes popular and most of nonstatisticians ask you to find it and to use it. Therefore, in social sciences they may use 5% but some they accept with 20% significant level.
Joseph L Alvarez Since A and B were undefined I made a common assumption. Normally if A has more than 1 member then such is reported. Scientists don't like guessing. I will let Jochen Wilhelm
discuss anything he would like to discuss, however I would refer you to the management, economics, accounting and finance literature, Where the Neyman-Pearson paradigm is used. I will avoid commenting on what real research might be. Further consider this link: https://www.google.com/search?q=Where+is+the+Neyman-Pearson+paradigm+used&rlz=1C1CHBF_enUS874US874&oq=Where&aqs=chrome.0.69i59j69i57j0l4.5888j0j7&sourceid=chrome&ie=UTF-8Note the first entry gives an important application of the N-P paradigm to radar. Is this real research? I will let the reader judge. Disclaimer: R.K.Ritt,who made contributions to radar research was one of my graduate school mentors. The point I advocated above,that N-P and Fisherian approaches are complimentary is described in the attached work by K. N. Berk. Disclaimer: K. N. Berk was one of my research advisors in graduate school. This paper lays the foundation for my own belief that the N-P and Fisherian approaches should be considered complimentary. Next I have a paper by Eric Lehmann that discusses whether or not the 2 paradigms should be considered as 1 or 2 Please read it. Lehmann an outstanding statistician in his own right was a student of Neyman. For those of you that would like an up to date look at Lehmann's views see this link: https://www.google.com/search?rlz=1C1CHBF_enUS874US874&q=Lehman+fisher+and+Neyman&nirf=Lehmann+fisher+and+Neyman&sa=X&ved=2ahUKEwiBhM_KkYTmAhXLl54KHYKqB-cQ8BYoAXoECBQQJw&biw=1584&bih=812 This link also gives other peoples views on N-P as being complimentary to Fisher. I have to admit that I have not finished reading Lehman's present book. Based on these arguments presented by a number of people, the complementarity seems reasonable. Further as Berk remarks in his paper there is a relationship between p, alpha and whether or not a research hypothesis is rejected. You may decide then whether or not as Alvarez suggests p-values are only an estimate of data quality and sample size sufficency. After reading and commenting on the above another strawman has been laid to rest. RIP
David Eugene Booth You do not appear to know the difference between a straw man and a scare crow.
Joseph L Alvarez Dear sir, probably I don't but I do know the literature and material in my last post. If you would care to comment on that I would be happy to read it. Otherwise please enjoy a Happy Thanksgiving. David Booth
This is a nice and recent article on p-values and more reasons to exercise cautions in interpreting the p-values.
Article The reign of the p-value is over: What alternative analyses ...
David Eugene Booth, a couple of comments,
Miky Timothy Spare me. If you use an H0 and H1 That is N-P. Nothing fishy here. By The way, I haven't heard that any biostatisticians were having difficulty.
see: https://www.springer.com/gp/book/9789811086267 ,
Dive into Chapter 8 that is the combined approach, which is what 98% of competant reseachers use whether they know it or not. Check the lit. There is a Full Ocean of the combined aproach. Dive right in.
Hardly the bad guy, Jochen Wilhelm
. NHST is gibberish and not what is done in real research. H0 is also gibberish the way it is often used. This is not a research practice. It is a reporting practice. Nevertheless, as a reporting practice it is improperly used. What is often missed is Fisher's admonition to do the statistics before the experiment. A p-value is not the standard for accomplished research. The p-value is the minimum design value for the expected observation of an effect. An observation at the p-value is too uncertain to claim as an effect.Cook book statisticians have the blame for a lot of gibberish.
In the interest of clarification and historical interest, and because I apparently am one of the few to attempt to understatnd the methods and history of mathematical Statistics. I will "Once more into the breach". First I suggest you read my posts above and the material provided there. The most important and some new material is attached here. This is REQUIRED READING for anyone that wishes to post anything else in this ancient thread with my name attached. If you don't Know the rules you don't get to play the game. NOW:
1. There is no such thing as NHST currently
. By now you have read what the Fisher and N-P approaches actually are about
2. If you don't follow these very well, then ask questions. However, you don't get to
criticise what you don't understand. BTW, I believe Lehman does a very good job of explaining why these two approaches are complementary.
3.I am not going to comment on a couple of silly statements that I have read in these postings. At this point you should be able to do that on your own..
4.The combined approach is exactly what was done in every social science thesis I have read. Further it is what is applied in the STEM areas where appropriate. The use of confidence itervals falls under combined theory.
5. According to the Springer book I cited above,. this is also advocated in the biostatististical area.
6. I do not advocate the use of the p-value as it is commonly used in medical journals, eg We report that up is not down (p
It is good to know we have the ultimate authority who can post:
"This is REQUIRED READING for anyone that wishes to post anything else in this ancient thread with my name attached. If you don't Know the rules you don't get to play the game. "
I am impressed and duly intimidated.
Jochen Wilhelm
I'd be glad to answer. The statement is meaningless because it gives no information about tests performed, models used, and on and on. Research must be repeateable to be meaningful. There is no way in the world i could confirm that statement as likely true with some error probability. I am saying, I suppose a statement supported only by a p-value and nothing else is meaningless because we cannot verify it as being true with a known error probability. I admit I should have taken an example from a current journal. Hope this clears that up. Now the rest of it. The point I have been trying to make and Lehman makes and Berk makes Is that modern applied frequentist statististics is NOT Fisher vs N-P, But a very useful combination of the two which gives us a much more powerful method to use in applications. I see N-P as a fullfilment of something Fisher started. If Neyman and Fisher had had different personalities I believe this conversation would not have taken place. My position is that there is currently only one way to test a hypothesis in frequentist statistics and that is what Sloughter calls the N-P Paradigm. N-P clarifies Fisher in my opinion. I was hoping people would read those papers instead of argueing over a p value. If you tell people what you do then it is clear what a p-value means, .I can verify it and your conclusion. And so on. The problem is not the p-value but rather poor exposition. I hope I have finally made my self clear.@Joseph Alvarez.. In my 40+ years of teaching statistics at all levels, I have always requred students to read appropriate material prior to class discussion in order to give them context. However he got it @Jochen Wilhelm has that context and provides useful questions for clarification of pertinant statements. That tells me that I was on the right track as an educator to proceed in this way. May the force be with you and your students. David Booth
I agree with David Eugene Booth that there is nothing wrong with p values, rather, the problem is with the usage.
Certainly p is derived from the probability theory established by theoretical and core statisticians who taught the rest of us Applied Statistics. It neither validates nor invalidates a research. It is a guide to the extent of achievement of an agreable expectation in a research effort. So people play with different probability values, .01, .05 which are themselves arbitrary.
As already stated, I am in agreement with David that additional requirements may be necessary, but there is nothing wrong with use of p values.
,
History is made by human being and the point from where to start thinking is what David Eugene Booth wrote: "If Neyman and Fisher had had different personalities I believe this conversation would not have taken place".David Eugene Booth wrote: "If Neyman and Fisher had had different personalities I believe this conversation would not have taken place".
Jochen Wilhelm
it is unfortunate that some find it necessary to distract instead of discuss. Claims (with references) that it must be right because everybody does it is not discussion. You are correct that NHST is logically flawed. The logical flaw is that the NHST makes objective, logical decisions. The NHST is, as you say, subjective.The problem with NHST starts with the misuse of H0. The problem with H0 is the confusion of a statistical hypothesis with a scientific hypothesis. Indeed, the statistical hypothesis is in support of an investigation into a scientific hypothesis. The error in naming H0 as a scientific hypothesis is the isolation of the data from the assumptions.
The data question is what data are needed to distinguish a difference from the null conditional data. The usual p-value approach of 0.05 selects a minimum data criterion that is barely distinguishable from the null and highly uncertain. This difference, in the context of assumptions and conditions of the scientific question, cannot support a decision.
Nevertheless, everybody does it, so it must be OK.
Jochen Wilhelm
I agree that other methods perhaps Bayesian, in some circumstances would provide a better approach. This is well known. However,in our world I stand by my colleagues and mentors, Lehman and Berk, as cited above. In the absence of a suitable prior I see no other solution. If we are still argueing about p-values then I think a discussion of Bayesian methods is yet in the future. However reporting only a conclusion with a p- value is unacceptable in science All I can say is we would welcome a model that takes everything into account. However, I see none on the horizon. The difference between statistics and mathematics is that statistics is attached to the world as we find it. I have noticed in the postings that you have apparently acceoted some of "my type" models as useful.
May I respectfully ask what you do in your research should issues like these appear?
Now,
Joseph L Alvarez Since I have apparently been unclear previously I hope you read the previous paragrah. That everybody does it does not make it OK. That those that actually understand the model and accept the model's limitations makes it helpful. As i said a perfect model is not on the horizon.
David Eugene Booth my problem with your answers is failure to offer discussion. Insistance to read cook books to see what everybody else does is not discussion. Methods and models as solutions do not address the problem. The confusion starts with "The difference between statistics and mathematics is that statistics is attached to the world as we find it." No. Statistics is mathematics attached to the data as we collect it. Statistics, in as much as it can, attaches to the world as we find it by our assumptions and conditions of data collection.
Joseph L Alvarez I rather doubt that Erich Lehman ever wrote a cookbook about anything. if you could find it please give me the citation.That seves two purposes. I request the citation because i am quite capable of finditng such a book and verifying your claim or not. Note that saves you the work of writing detail that many of your readers are familiar with and providing it for those who are not and would like to brush up. Perhaps you can provide a second citation that will allow me to remedy my problem with respect to my failure to offer discussiom. Again a citation would be helpful to save you the work of explaining it to me if, of course, you believe that I can read and understand it on my own. with all due respect and thanks, David Booth
David Eugene Booth Please read your answers to this thread. Your last one , in particular.
The p-value is the probability is rejecting the null hypothesis when it is true.
Ankit R. Patel Joseph L Alvarez David Eugene Booth Miky Timothy
Article Evidence against vs. in favour of a null hypothesis
Article Bayesian alternatives to null hypothesis significance testin...
Article Bayesian alternatives for common null-hypothesis significanc...
A p value (observed significance level) is the probability of obtaining a test statistic value equal to or more extreme than that obtained from the sample data when the null hypothesis is true.
An alternative approach of a hypothesis test uses the p value rather than the critical value:
Reject
H0 if the p value < α
For example, the p value is the left tail area of the observed test statistic, .
Chung Tin Fah, that is indeed the textbook definition of a p-value. However, the topic question of the thread was not what is a p-value, but the more difficult issue of "What is the role of "p-value" to validate any results?"
My guess is that you probably agree with many of the other replies that a p-value cannot "validate" anything, especially in the context of NHST. However, can you please explain how one would practically make use of the textbook definition in a real-world scenario? Thank you.
Cristian Ramos-Vera, JASP is an awesome program that I use all the time, and there is much to be said for the adoption Bayesian thinking in science. Nonetheless, the original poster is certainly speaking about frequentist statistics. By linking these three articles, is you answer to the original question that we should abandon frequestist hypothesis testing in favor of some kind of Bayesian alternative hypothesis testing. Please explain. Thank you.
By the way, while I have tremendous respect for EJ Wagenmakers and what he and other bayesians are trying to accomplish, I have a lingering suspicion that the problem with statistics in science is mindless INFERENCE, rather than some kind procedural issue. I.e., in my opinion, inferential statistics of any kind should be used only when necessary, and descriptive methods should be the main approach.
Miky Timothy - I will bootstrap my data to see what is the outcome. A lot of the time, the data I have may not conform to a normal distribution as per requirement of a p-value calibration. Thanks for the query.
Chung Tin Fah, really cool answer! Of course it depends on the ultimate goal, but I too would verge towards something like a simulation analysis using a non-uniform random number generator. This would be simply to get a sense of just how relevant my data are agnostic of real-world effect considerations.
The p-value is an “estimate” of a “Random Variable P_value”, related to a Statistical Hypothesis H0 (about a parameter of a distribution, a type of distribution, …) used to assess if H0 is “plausible” (according to the data). It is not a Signal-To-Noise measure… To understand consider the Wrong Minitab T Charts… Where is Noise there?
Let H0 be “the process is In control”. Consider the case (Example 7.6 in the Montgomery book Introduction to Statistical Quality Control, 7th edition, Wiley & Sons) where he writes “A chemical engineer wants to set up a control chart for monitoring the occurrence of failures of an important valve. She has decided to use the number of hours between failures as the variable to monitor”. Here are the data (exponentially distributed), named lifetime:
286
948
536
124
816
729
4
143
431
8
2837
596
81
227
603
492
1199
1214
2831
96
Montgomery and others conclude that H0 cannot be rejected… and is “plausible”!
YOU Jochen Wilhelm SAY: “you need a prior plausibility statement that is modified by the evidence in the data, resulting in the posterior plausibility.”
Do really I need it?
Massimo Sivo, your example from the textbook does not include which approach the engineers used to conclude that 'the process is in control'. Where are they getting their 'control limits' estimates, exactly? In my opinion, this is the crucial bit in the context of the conversation. It would be insane to apply identical or arbitrary 'six-sigma' criteria to assess assembly line reliability of flat-screen TV's and nuclear fuel rods.
Jochen Wilhelm
's comment "[that] you need a prior plausibility statement that is modified by the evidence" is perhaps of greater relevance to the sub-field of reliability analysis than for any other statistical approach. In a goodness of fit model, you would assess failures of a (preferably large) sample of objects in relation to, for example, an exponential distribution based on a-priori determined parameters of scale and location.Resampling can come in handy here and p-values actually become meaningful (though not in the context of mindless 'null hypothesis testing' dogma).
Finally, I encountered a site while replying to this thread. It relates to failure control method mentioned. I haven't looked at the site much, but it seems quite relevant:
http://nomtbf.com/
A p value is used in hypothesis testing to help you support or reject the null hypothesis. The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis
I beg your pardon: I assumed “a priori” that you knew the Montgomery book Introduction to Statistical Quality Control.
Several times people run away from problems…
The problem is: H0=[“the process is In Control”] is plausible, given the data?
Simply that!
Dou you not like exponential distribution? Dou you like Weibull distribution? Dou you like Gamma distribution? Dou you like Normal distribution? Dou you like ….?
The problem is still the same: is H0=[“the process is In Control”] plausible, given the data?
Following Nelson (1994) Montgomery solved this problem by transforming the exponential random variable to a Weibull random variable such that the resulting Weibull distribution is well approximated by the normal distribution: transformed data=(exponential data)^(1/3.6).
THEN Montgomery draws the Control Chart related to the data well approximated by the normal distribution.
AND finds that H0=[“the process is In Control”] has NOT to be Rejected: high “p-value”!
Montgomery and several others conclude that H0 cannot be rejected… and is “plausible”!
ACTUALLY the process is Out Of Control and H0 has to be Rejected!
“””Data is subject to noise and variance, and a few values we observe may fool us quite a bit.”””
Do the data [20 “rare data”] fool us?
The problem is still the same: is H0=[“the process is In Control”] plausible, given the data?
I agree with Jochen. Very often it is overlooked that the p-values alone tell hardly anything to nothing, if you do not additionally account for other factors, like distributional form etc. and especially sample size. But if you can only interpret the p-value with regard to the other factors, why bothering to use it in the first place?
If all parameters of interest would stay the same across two studies, but one study has a very small sample size and the other a very large sample size and the calculated p-values would come to different conclusion, i.e. "not significant" for the small sample and "significant" for the large sample. How does this make any sense, if the estimated parameters were completely the same and therefore the dicision if the size of the estimated parameters matter? The decision if a process is "in" or "out of" control should not be made by p-values, but by a definition what it means to be "out of control", i.e. at which parameter value is it harmful for the e.g. the production process, so that it will cost more money for example or it will generate even losses. These are hard questions to answer, but in my impression, p-values are used, because researchers are lazy and dont want to think about the content of their models and variables and strive for an easy ways to make a decision, instead of thinking about the real world consequences of their variables. It is more important to interpret the parameters of interest itself and their uncertainty than any artificial, dichotomous decision criterion like the p-value.
If you want to achieve this, you should use bayesian inferencing and I think Jochen meant this, when he used the terminology "a priori". If you have not much information "a priori" you can incorporate that directly in the prior distribution with vague priors. If you have already a lot of information at hand, this will lead to informed priors. But all of this does not excuse you to think about your variables and parameters itself (as suggested by p-values).
Jochen Wilhelm SAYS:
I cannot judge if the "process is in control" given a p-value alone (put aside that I don't understan exactly what is meant by "process is in control").
Rainer Duesing SAYS:
The decision if a process is "in" or "out of" control should not be made by p-values, but by a definition what it means to be "out of control", i.e. at which parameter value is it harmful for …
You both do not have knowledge about what a statement as H0=[“the process is In Control”] means and discuss about it, without using the provided data!!!
It’s rather amazing…
The approach that engineers used to conclude that "the process is in control" is basically: "The underlying concept of statistical process control is based on a comparison of what is happening today with what happened previously".
And "happened previously" is the goal of the engineer, based on legal product specifications or customer product specifications.
Process control is well and simply described in the link:
https://www.itl.nist.gov/div898/handbook/pmc/pmc.htm
Roberto Molteni
Thank you for the link, but here we do not find a solution for the basic statistical problem. Your link provided the following statement:
"Out-of-control refers to rejecting the assumption that the current data are from the same population as the data used to create the initial control chart limits."
Ok, and there it is "...rejecting the assumption...", this is the crucial point Jochen was referring to: on what statistical basis are you willing to reject the assumption (or not reject it). Typically, you fit a model (most often frequentistic and just mentioning, the topic of this discussion is "What is the role of "p-value" to validate any results", so nothing other is considered here), fine. But what is the conclusion you can draw from it? As I saw in the link provided, WECO rules for signaling "Out of Control" do not use p-values, but formulate it in terms of mu +/- k*sigma, where there seems to be an agreement of k=3 to be magic number where some evil things will happen, just like this mysterious 5% boundary (but I can also reformulate the p-value in terms of confidence intervals, e.g. mu +/- 1.96*sigma [or sigma/sqrt(n)...]). But why not k=2 or k=10? Are there situations where k=2 may be already detrimental or others where k=10 is still ok? This exactly what we meant. Instead of sticking to some arbitrary magic number, it should be formulated case by case, study by study, variable by variable which real world consequences exist and how large the effect may/can be, to be of interest (and I found in the link that there is also a discussion about the meaning of 3sigma and an article for m, n, ARL calculation [so similar to a power analysis], which in turn just sets this magic 3sigma boundary without any rationale for it, like 5%....).
So back to p-values and null hypothesis testing: a small p-value tells you that there is more signal than noise (loosely speaking), but it does not tell you if the signal is of practical relevance. This is because statistical significance is totally independent from practical relevance. You can have any combination of both:
1) the effect is large enough to be relevant and your test is significant (perfect!),
2) the effect is large enough to be relevant, but your study is underpowered and not significant (bad luck... but your effect is still relevant, isn't it?? Because we know relevance and significance are independent, right?--> you increase the sample size to get it significant, because you already know that it is relevant and you want to publish it...),
3) the effect is too small to be relevant, but your test is significant (good for you, at least you can publish anything, although not relevant [maybe for metaanalyses, but I am pretty sure most of the time it will be presented as something very "meaningful"...]),
4) the effect is too small and the test is not significant (fine as well, as intended)
So what can we take away from it? To determine if an effect is relevant, statistical significance can only help, if we make a lot of additional assumptions to interpret it correctly (e.g. you have to know the power of your test, which has also been discussed in the "out of control" link, how many samples are good enough). So imho, we should take care to have a good data collection (which represents the target population and the data generating mechanism appropriately) and such an amount of data that the uncertainty in the parameter estimation is below an acceptable level (which in turn has to be determined case by case and not an arbitrary value like 5% or 3sigma or what else...). It puts much more effort in the understanding of the processes and the variables, that is true and I am afraid a lot of people are not willing to do it, because it is convenient to rely on a simple system.
@Massimo: I am afraid you overlooked some basic principles, otherwise you would not have written your last post. This is rather amazing.
Jochen Wilhelm , Rainer Duesing , Roberto Molteni , others…
The attached file tries to put, in the right perspective, the problem
Dear Massimo,
thank you for the document, but I have the impression it is still missing the point. In your document, there is nothing about p-values, but it is about transforming the data to get an normally distributed variable and I assume to be able to fit a model which is based on normally distributed errors or what so ever. There is nothing in it which directly fosters the discussion, if p-values should be used for decision making, since it is not about NHST.
Interestingly enough, the author did not use any p-value based decision criterion to decide if the data is normally distributed, but he used a normal probability plot (some form of QQ-plot) to visually inspect if the fit is good enough!!! Again, no need to use p-values. Quite the contrary (and I repeat myself), goodness of fit tests for normality (e.g. K-S test, S-W test) suffer from the same problem as all other test: too few data points and it will miss obvious deviations from normality; a lot of data points and it will detect even the slightest (maybe not harmful) deviations from normality and gets significant.
Maybe I have to correct myself: your document provides evidence for Joachim's and my standpoint.
Rainer Duesing…
You do not want to consider the problem
YOU SAY:
I see that you do not use Minitab (that was used by Montgomery!!!)
If one uses Minitab to compare competitive models "p- values" are used ….!!!
Montgomery decides that
WHY YOU do not want to ASSESS IF (instead of waffling….)
Use anything you want:
AND FIND IF
Dear Rainer Duesing ,
in some way we are in the same direction. You stated: " Instead of sticking to some arbitrary magic number, it should be formulated case by case, study by study, variable by variable which real world consequences exist and how large the effect may/can be, to be of interest"
I agree and of course the "magic number" have to be different, case-by-case, based on the production to control: as general rules e.g. pharmaceutical company should has stricter "magic number" than food company and food company should has stricter than animal food company.
The link provided was just an example and it refers mainly to engineering. Iam not an engineer but I think that they are also able to work case-by-case: e.g. Project Apollo should have had stricter "magic number" that producing an Italian coffee machine (although now improved a lot).
Dear Massimo,
apparently, you have no clue what this whole discussion is all about. You demonstrated with your last post your lack of basic knowledge.
1. You said "use anything you want....". P-values are the inferential result of methods like regression, ANOVA etc, they are not a method itself. So our criticism about p-values covers all of them (except likelihood methods, which by contrast can be used to make statemens about plausibe parameter values instead of p-value decisions).
2. You stated: " I see that you do not use Minitab (that was used by Montgomery!!!) ". Please tell me what this statement has to do with anything?? Minitab is just a tool, a program to do statistical analyses. You can use any tool you want, I prefer R, SPSS and MATLAB, but it really does not matter. If there are bayesian analyses available in Minitab, they wouldn't report p-values either (never used it, so I don't know anything about Minitab). So it is not about the program, but the methods used within the program. Therefore, your statement is absolutely rubbish! (And who is Montgomery anyways, I never trusted arguments from authority... therefore I dont care, just bring good arguments). Normal probability plots can be produced by any program, I do not need to use Minitab to understand the output. If this ominous Montgomery used Minitab to produce the NPP, he achieved more than you so far.
3. Deriving from my point 2 your statement " if one uses Minitab to compare competitive models "p- values" are used ….!!! " is also of no value. It is exaclty the point of this whole discussion started by Ankit R. Patel , if p-values should be used or not. Its not about comparing specific models. And generally speaking, your statement is false itself if you think that p-values are needed to compare models, you can also use bayesian statistics to compare models against each other (e.g. Bayes Factor) or information ctriteria like AIC or BIC.
4. Apparently you want us to make any statement about the data yóu presented. Jochen showed that is not possible to make useful statements without knowing the context. He showed you that with a different model, you can come to different conclusions. And again this is exactly the point of the whole discussion, p-values ALONE tell you nothing. You have to know much more about the goal, the data generating process, the costs that are related to either the one or the other decision. THIS specifies the effect size of practical relevance. And with this effect size at hand, you can do useful frequentistic statistics, but as said before, you have to consider all this prerequisites to interpret it correctly.
5. Jochen told you that he needed more information about the goal to come up with an answer, but you provided nothing. I can only speak for myself: I really don't care about your data. I only care about the original question asked by Ankit. Your contribution is less than zero, because it destructive and does not help to talk about the problem and question at hand.
I really would like to have a descent discussion about the pros and cons of p-values.
P.S.: dear Roberto Molteni , sorry forgot about your last post. Yes, I totally agree with you. Different applications should use different "magic numbers" ;-)
Rainer Duesing , Roberto Molteni , Ankit Patel, others wafflers…
Let’s start from the beginning to see IF I am in line with Rainer Duesing statement “””I really would like to have a descent discussion about the pros and cons of p-values.”””
On July 2019 (1 year ago!) Ankit Patel of University of Minho WROTE
My first answer is date 10 July 2002 (3 days ago):
The p-value is an “estimate (stupid name, corrected by Jochen in "realization", to which I agree)” of a “Random Variable P_value”, related to a Statistical Hypothesis H0 (about a parameter of a distribution, a type of distribution, …) used to assess if H0 is “plausible” (according to the data). It is not a Signal-To-Noise measure… To understand consider the Wrong Minitab T Charts… Where is the Noise there?
YOU ALL relate the Statistical Hypothesis H0 to a parameter …
Soon after I set the Null Hypothesis H0=[“the process is In control”] about the case (Example 7.6 in the Montgomery book Introduction to Statistical Quality Control, 7th edition, Wiley & Sons) where he writes….
So we had a Statistical Hypothesis H0 without a parameter!
I provided the data to let people to compute the “p-value”, and decide about H0.
THIS is in line with the Ankit Patel question
THAT IS
NONE of YOU decided to use the data to answer.
You Rainer Duesing said “””I really don't care about your data.”””
The case is clear… Read it.
There is NO NEED of BAYES!!!
End of the story: the RESULT
H0=[“the process is In Control”] versus H1=[“the process is OUT OF Control”]
Dear Massimo,
you really seem not to get the point made several times now. Jochen stated:
" The problem is what is not provided: the subject matter knowledge about the system under study and what the statement "the process is under control" means in that context and what would indicate that the process might not be under control. "
YOUR decision that it is '"in control" or "out of control" is a copy and paste example of some textbook, where the textbook want to illustrate, how a decision can be made, if you apply the standard (arbitrary) rules.
Now the problem: you gave us YOUR result and decision --> H0 REJECTED by the p-value
Why?
Why do you reject the H0 with this particular p-value? Because it is smaller than 5%? Why does it not have to be smaller than 0.5% or p < 5*10e^-10? THIS is one crucial point (besides others).
Jochen asked for the contextual factors, which are the real world consequences of one decision or the other (e.g. costs if you decide "in control" or costs if you decide "out of control"). These are things your textbook example did not provide and that's why we cant evaluate the consequences. A textbook example is fine to illustrate the mechanical caculations to obtain a result, but not more.
And you are right, there is no need for Bayes per se (as Jochen and I pointed out already), since frequentistic and bayesian inferencing answer different questions. And there is nothing inherently wrong with p-values itself or the method. It is the too simplistic application and interpretation of them by many researches. As I already wrote, if you consider all prerequisites, it can be a useful tool, but I doubt that this is done most of the time.
Maybe one additional point:
Massimo, considering your decison "out of control". I assume this would result in stopping the production or whatever, right. Would it be possible(in the real world) to calculate the costs, if you not stop the process although your test told you "out of control"? Would it be possible to calculate the costs if you stop the process, as your decision suggests? Would it be possible to compare these two cost calculations? Is it possible that (although your test suggested to stop the process) the costs for stopping it would be higher for the company (because stopping a production line is very expensive) than keeping it running? I think all these questions can be answered with "yes" and that's why p-values as only decision criterion is not very useful.
The lesson to be learned by the “chemical engineer (that) wants to set up a control chart for monitoring the occurrence of failures of an important valve. She has decided to use the number of hours between failures as the variable to monitor” is twofold:
1. the wrong Montgomery method provided a wrong conclusion!
2. Due to the fact that “the process is Out Of Control”, the “chemical engineer“ should inform the Company manager of the chemical process that the “important valve” MUST have a better reliability… IF the Company wants to increase the Process Availability and Yield: 0.875 Steady State Availability is not enough for a Company that wants to make profit