On the web I found that three scholars wrote (special issue of The American Statistician):
1. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely.
2. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.
3. Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless.
Thank you
Dear Ette Etuk
I would say that it is a scientific attitude.
The convention is to use alfa=0.05 as a threshold level… for declaring "statistical significance".
If you don't like it don't use it. However don't ban it so nobody can use it. Fisher and Neyman were bright guys and this stuff has been around for a century so someone is using it. Do we really want to kill the concept. ? Can it be further improved? D. Booth
There is already a similar, lenghty, but worthwhile discussion on RG about NHST, which is the hybrid of FIsher's or Neyman&Pearson's method.
https://www.researchgate.net/post/Should_Null_Hypothesis_Significance_Testing_NHST_be_eliminated
Ich agree with David Eugene Booth that the original methods have their value, but that the application is flawed, especially with NHST, the hybrid method in a narrower sense. Here is a good article what the problems are:
https://www.frontiersin.org/articles/10.3389/fnhum.2017.00390/full
So my opinion: the plentiful usage of "significance" should be dismisses, especially if you are testing exploratively, but not the concept per se.
Thank you, David and Rainer.
I do not think that the concept and the use of “Statistical Significance” have to be dismissed.
It is really important in GOOD applications…
The application sometimes is flawed and I do not enter in the discussion on RG about NHST, which is the hybrid of FIsher's or Neyman&Pearson's method.
At least, I would be please if there were a clear understanding of “statistical significance” and “biological significance”.
I think that a sensible solution could be the use of Confidence Intervals: they are a class of equivalence.
IF, for example, we would use the symbol CI(H0, CL, dof, Exp)=A----B, where CL is the Confidence Level, dof is the Degrees of Freedom and Exp is the distribution of the collected data (e.g. Exponential), we could decide if the Hypothesis H0 could “statistically rejected” (Statistical Significance) and we can compute the p-value if we like it!
IF we had TWO confidence intervals CI1(H0, CL, dof1, Exp)=A----B and CI2(H0, CL, dof2, Exp)=C----D, we could, as well, decide if the means of the TWO samples can be “statistically considered different” (Statistical Significance) and we can compute the p-value if we like it!
Doing that we could avoid MISLEADING statements like the following (in the attachment)
1) Confidence Intervals do not solve any problems, since they are just another representation of the p-value. If the p-value is smaller than the CL then the CI does not include the comparison value declared in H0. So, if not thinking different about CIs this does not help much.
2) What is misleading about your attached statement??? It does not concern statistical significance at all, quite the contrary in my opinion. Calculating (standardied or unstandardized) effect sizes to determine the practical value of your results seems more plausible than NHST.
Rainer Duesing
If I understand your point 1), there is something odd…
Let alpha=0.1 and let p-value=0.15; THEN H0 cannot be rejected.
Since CL=1-alpha we have p-value=0.15
If you set alpha = 0.05 for example and calculate s simple t-test or calculate the confidence interval with CL=1.alpha, both will agree in their outcome. Either t-test is "significant" AND the does not include the target value, or t-test is not "significant" AND the CI includes the target value. Therefore, both are in principal the same (unless you use different methods to calculate either the one or the other, like bootstrapping the CI).
Rainer Duesing
I beg your pardon, BUT I do not follow you: I use your numbers…
If I understand your point YOU suggest to
· set alpha = 0.05… AND
· calculate the confidence interval with CL=1.alpha
Perhaps there is a mistyping: 1.alpha in place of 1-alpha
I repeat my argument with your data….
alpha=0.05 and let p-value=0.15; THEN H0 cannot be rejected.
Since CL=1-alpha we have p-value=0.15
Consider a model with parameter µ. We assume that µ takes some fixed unknown value u. You collect some data to estimate u. I will denote the estimated value with m. For any chosen value of µ (let's call this value H0) you can calulate a p-value. The more different m and H0, the smaller is the p-value.
The significance level (SL) or "alpha" is the largest p-value for which you are going to reject H0, concluding that your H0 and the estimate m are sufficiently different for say whether u < H0 or u > H0.
The confidence interval (CI) is (usually) defined as the interval of µ containing values that are not rejected at some SL (all values of µ that would give p > SL). Under µ=u, the probability that such intervals miss u is just SL, and the probability that they contain u is 1-SL = the confidence levels CL. Note that the probabilities refer to the CI as a random interval, or to the "process of creating such intervals", not to any particular observed interval (it makes no sense to say that any given interval has that probability! - The interval either contains u or it does not). Therefore these intervals are not called "probability intervals" but "confidence intervals".
Now let lwr and upr be the lower and upper limit of a CI with confidence level CL = 1-SL. Any test of µ=H0 < lwr and any µ=H0 > upr (H0 being outside of the CI) will give p < SL, and and test of lwr < µ=H0 < upr (H0 being inside of the CI) will give p > SL.
Concrete: say SL = 0.05, then CL = 1-0.05 = 0.95. If you test µ=3.4 and you get p = 0.04, then 3.4 is not inside the 95%-CI. If you test µ=-2.5 and you get p = 0.87, then -2.5 is inside the 95%-CI. 3.4 would just be at the border of a 96%-CI, and -2.5 would be just at the border of a 13%-CI.
1) yes it was a TYPO CL=1.alpha --> CL = 1-alpha
2) Apparently my explanation was not clear. If you calculate a p-value for the comparison of a lets say mean value against 0 (H0), and compare it with alpha (either palpha, hence reject or not reject H0, respectively) will come to the same conclusion if you calculate a CI around your mean value with CL = 1-alpha and check if 0 is within the CI (not reject H0) or not within the CI (reject H0). Two sides of the same coin. Just try it yourself in R
x
And yes, my statement " If the p-value is smaller than the CL then the CI does not include the comparison value declared in H0. " is wrong is this form.
Correctly: If the p-value is smaller than 1-CL, then the CI does not include the comparison value declared in H0.
Of course if the CI is constructed with the CL.
Rainer Duesing, I believe the arguments Massimo Sivo is making are based on six-sigma dogmatism often taught in corporate settings. This approach has been extensively criticized and is a kind of p-hacking/HARKing for engineers. My impression is the approach is based on a misapplication of likelihoodist philosophy.
W. Edwards Deming must be rolling in his grave, since I can't image he would approve of the current state of reliability statistics.
But it's not quite NHST so try to understand what Massimo is getting at in that context.
Miky Timothy but he did not say anything about six sigma, but only about "significance", "p
Miky Timothy
I am very much against SixSigma!
I am very much a disciple of Deming: be sure, he is not rolling …
Massimo Sivo could you please state, if the information Jochen Wilhelm or I provided were helpful to understand the problem with CIs if just substituted for p-values? Thank you for answering, I appreciate it.
Rainer Duesing, yes you are right. I assumed that the context here was from the six-sigma field because Massimo Sivo has quoted DC Montgomery extensively, who aside from occasional hedging articles on 'bad six-sigma,' appears to be a promoter and proponent of the status-quo approach, in general. In my mind, I was trying to differentiate this post from the countless identical ones on RG by attempting to get at what Massimo might be implying.
You are right Rainer Duesing, that this might be more about the 'New Statistics', a bright idea that I have been seeing in journal guidelines, as of late, and in my opinion has yet to yield, in its application in published research, a non-cringe-inducing result. I'd link the article 'The fallacy of placing confidence in confidence intervals' but I find the central submarine example an unhelpful reductio.
Estimation statistics seems to be a formula for taking real-world measured empirical information and transforming it into meaningless standardized 'data' inaccurately labelled as effect. The best refutation is actually the article ' Estimation statistics should replace significance testing' in Nature Methods which falls apart when one considers the practical implications of this philosophy, as well as the catastrophic failure of publication-biased, micro-effect-centric 'evidence-based' medicine in 2020 and well before.
Then again, smarter people than myself, surely disagree so what do I know!
Massimo Sivo , I stand corrected about your analytical philosophy. I certainly did not mean that you are responsible for Deming rolling in his grave, which would be a terribly unkind thing to say - I meant six-sigma is the reason for Deming's rolling! Anyway, I too would appreciate your response to Rainer Duesing, Jochen Wilhelm, etc., and their answers, which are simply the literal definitions and implications of the terms you are discussing.
Finally, Joseph L Alvarez in other threads has repeatedly provided the true definition of 'significance' and, in my opinion, made short work of refuting NHST. Worth reading.
Miky Timothy many thanks for your detailed answer and I agree with your statements. I believe that there is not THE best method out there, but some are more useful or informative than others, but most (if not all) of them are prone to misuse, misapplication and misinterpretation.
A Fisherian p-value has it's use (but misses effect sizes), the Neyman & Pearson method is a good idea (but how specify a reasonable alternative hypothesis in many cases), confidence intervals are quite informative (but misused and misinterpreted as if they were something completely different than p-values and give a false "confidence" -> see the Morey article), effect sizes are even more informative than confidence intervals, because they relate to the original variables (but are often reduced to standardized effects sizes as mere cut of points by citing Cohen (1988), whithout any consideration about their real world meaning), bayesian estimation, the most appealing approach, which answers questions researchers have most often implicitly in their mind (but a trend to be reductionistic as well, by "only" using bayes factor and again just use cut-off values (e.g. by Jeffreys) to declare 'strong' evidence for example).
In my personal opinion, all methods have their rights, value and standing. For me the problem are not the methods per se, but their application. And why are they wrongly applied in many cases? Because the current scientific system forces researcher to find and publish simple and quick solutions, because (nearly) all that counts is how much you published. Therefore, it is much easier to use simple, mostly dichotomous decision criteria to declare their findings as 'significant' which is synonymously to 'important' or 'publishable' in that context. And I am afraid even the best statistical method cannot avoid this problem. (and we should not doom researchers caught in this system...)
An approach I find more and more appealing for myself is shown for example by John Kruschke (e.g. Rejecting or Accepting Parameter Values in Bayesian Estimation as a simple primer). It is not primarily focused on global tests of hypotheses (like bayes factor), but more on the estimation of the parameters of interest (e.g. mean differences, regression coefficients etc). You can also do 'hypotheses' testing, by declaring a region of practical equivalence (ROPE) for the parameter of interest, which can be done in the original metric of the variable or with standardized effects sizes. Similar to BF, you can find evidence, for the H0, for the alternative, but the results may also be indecisive.
Rainer Duesing , Jochen Wilhelm
Sorry for my late reply; please excuse me.
Your explanations about p-value and CI is perfectly clear: I agree with that.
I see one problem: the p-values “forget” the degrees of freedom and the distribution of the data, AFTER the computation.
That’s why I suggest the symbol CI(H0, alpha, dof, Exp) for showing the Confidence Interval.
I will come back when my example is ready…
Rainer Duesing , Jochen Wilhelm , Miky Timothy
At Politecnico I studied reliability. The test is ended at a fixed time 10 (units not shown).
The data are the “Time To Failure” (units not shown). The value 10 means that the item did not fail during the test.
10 items are tested for reliability in sample 1.
10 items, named “Improved?” are tested for reliability in sample 2.
We want to know if the “Improved?” items are “Statistically Improved”.
The symbol I suggest CI(H0, alpha, dof, Exp) for showing the Confidence Interval allows to “combine” the TWO Confidence Intervals and see if there is an improvement, by making a “sound statistical test of hypothesis”
See the file…
Sorry, but either I am missing something crucial here, or your numbers do not add up in the example. Could you please check your example and provide more information? Somehow it is not clear to me, what you are trying to say or to prove. But maybe someone else can explain, what I am missing here.
Massimo Sivo, below is a link to a video going over the concept and construction of Shewhart charts:
https://youtu.be/8Ln3emiwQzU
It is taken from this site:
https://learnche.org/pid/process-monitoring/shewhart-charts#shewhart-charts
Is this what your example seeks to convey?
Rainer Duesing , Jochen Wilhelm , Miky Timothy
My previous file is WRONG….
I made a very silly error!
Please see the CORRECTED file.
The Mean in each column is given by TOTAL/#Failures
My point is not related to Control Charts, BUT it has an influence on them…
Massimo Sivo sorry, but I still do not understand your example or what you are trying to show. I do not understand you numbers.
What is TOTAL in your example? What is #Failures? You values represent time to failure? Why do you count the failures? Where do the mean values come from? Arent you interested in the rate or scale of the function? So, please carify what you are trying to accomplish.
Cheers, Rainer
Rainer Duesing
You wrote:
“Somehow it is not clear to me, what you are trying to say or to prove. But maybe someone else can explain, what I am missing here.”
I am trying to convey the information that, in my opinion, the concept if Confidence Interval and its SYMBOL I suggest CI(H0, alpha, dof, Distribution) for showing the Confidence Interval allows to make “sound Statistical Significance (tests of hypothesis)” AND “combine” TWO Confidence Intervals to assess (Statistically) if there is difference between TWO samples (Hypotheses)
See my Corrected EXAMPLE for the misleading statements_2 where I write:
At Politecnico I studied reliability. The test is ended at a fixed time 10 (units not shown).
The data are the “Time To Failure” (units not shown). The value 10 means that the item did not fail during the test. The data are Exponentially distributed…
10 items are tested for reliability in sample 1.
10 items, named “Improved?” are tested for reliability in sample 2.
· We want to know if the “Improved?” items are “Statistically Improved”.
For each sample (of 10 items) the data are “Time To Failure” (units not shown).
The data are Exponentially distributed…
ONLY the numbers < 10 show the time to failure of the items.
#Failures means the NUMBER of Failed items in EACH Column (Sample)
The value 10 means that the item did not fail: it arrived at the end of the test NOT FAILED
TOTAL is the sum of all the data in EACH Column (Sample)
Mean is TOTAL divided by #Failures.
Massimo Sivo, your example seems to necessarily involve a complex multi-step procedure based on several statistical concepts. Since these steps aren't shown, it's hard to understand.
Focus on confidence intervals and a z-statistic in comparisons is very similar (down to the 10 items) example found here:
https://learnche.org/pid/univariate-review/testing-for-differences-and-similarity
But the statement of data following an exponential distribution with a range of interest seems like you are discussing cumulative frequency analysis:
https://en.wikipedia.org/wiki/Cumulative_frequency_analysis#From_cumulative_frequency
Maybe a variation of a CUSUM chart would be the implementation goal?
I agree with Miky Timothy , I am not an engineer and I find it rather confusing that for example all values are used to calculate the sum (isnt 10 rather an artificial value, the whole thing would have been different if 20 was the maximum, wouldn't it?), but only the count of cases with errors as denominator. This may be convention in phyiscal reliability analysis, but does not help to get your point, as Mike said.
Please provide an example which reduces to your main argument.
Miky Timothy
NO COMPLEX MULTI-STEP procedure….
As I said before at Politecnico I studied reliability.
Let n=sample size; let g=#Failures.
Just for simplicity let g=2: t1=7 and t2=3; the other data are all=10
Order the data in increasing order.
Consider the test as a “system made of n units all good at time 0”; the system end its mission at t=10.
Compute the reliability of the system: R(10, MTTF) and let f(x) the pdf of the “Time To Failure of the System”: f(x) is just the likelihood of the collected data. It shows the Total Time on Test, named TOTAL.
This allows estimating the mean=MTTF of any item=TOTAL/#Failures.
Let CL=1-alpha be the Confidence Level.
The solutions for MTTF, of the two equation
R(TOTAL, MTTF)=1-alpha/2 and R(TOTAL, MTTF)=alpha/2
provide the Confidence Interval.
The symbol I suggest CI(H0, alpha, dof, Exp) shows the Confidence Interval.
You can find the Theory in my professor's books, by searching “fausto galetto quality reliability” on the web.
I hope that now you do not connect “Statistical Significance” with any type of Control Charts…
Rainer Duesing
My main argument is about Statistical Significance…
For Statistical Significance, I suggest the use of Confidence with the symbol
CI(H0, alpha, dof, DISTRIBUTION) to show the Confidence Interval.
The example shows the use of Confidence Intervals to compare TWO products, the second one considered Improved.
The Manager ASKS the Data IF it is “Statistically” so….
The paper of John P.A. Ioannidis that is shared in the project "Complexity in Medicine..." is dealing with your question in-depth:
https://www.researchgate.net/project/Complexity-in-Medicine-Practical-Problems-Their-Definitions-Models-and-Solutions/update/5eb52749f155db0001fa4ba0
He goes deeply into mathematics and explains all very well. Feel free to read it. I recommend it to know those issues everyone who is working with statistics in his/her research.
Abstract of the paper:
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there are a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.
Jiří Kroc
I read the paper “When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment” by Denes Szucs and John P. A. Ioannidis
It does not help in my case.
Thank you for the link.
I will read that paper…
Jiří Kroc
I tried to download the paper “””Why Most Published Research Findings Are False””” by John P. A. Ioannidis PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124
BUT I did not succeed
Rainer Duesing
You asked about the value 10 for the end of the Reliability test.
IF the value 20 were the end of the Reliability test, we would have more failures (maximum 10).
IN ANY CASE the computed mean would be AGAIN TOTAL/#Failures
You write “””This may be a convention in physical reliability analysis, but does not help to get your point, as Mike said.”””
Ok, my last attempt.
Please show side by side, how in your example a typical CI would be calculated, and how your method (of representation) improves the information content and confidence of the results.
Rainer Duesing ,
Massimo wrote that he is a student of Fausto Galetto. It seems plausible, as he uses allmost as much CAPITALIZATION and bold statements in his texts. I am only missing the many many exclamation marks!!!!!!!!!
https://www.researchgate.net/scientific-contributions/2136505695-Galetto-Fausto
Here is my personal favourite: Method Open letter to Jochen Wilhelm
Massimo Sivo, you could have saved us a lot of time by simply saying that your example was showing MTTF (mean time to failure). This was never mentioned previously and is indeed a complex multi-step procedure.
In my first post in this thread, I mentioned to Rainer Duesing my guess that your "approach [was] based on a misapplication of likelihoodist philosophy" and that it had little to do with NHST. This was indeed the case. It appears your argument is based on generalizing the niche MTTF reliability hack to all of frequentist statistics. Here is the 'logic' behind MTTF:
"The MLE of the failure rate (or repair rate) in the exponential case turns out to be the total number of failures observed divided by the total unit test time."
This is quite convenient vs choosing, deriving (and understanding and justifying use cases) of other distribution likelihood estimates in a real world context-dependent manner. But ease of use doesn't mean it makes sense (nor would blindly using Weibull, etc, for that matter). And your numbers don't make sense, even if proper background was provided - since the method leads to arbitrary plainly unintuitive estimates. There is a website devoted to debunking MTBF/MTTF reliability:
http://nomtbf.com/perils/
http://nomtbf.com/2014/02/calculate-mttf/
There are countless articles explaining why the assumption of constant failure rate is irrational and leads to nonsensical circular conclusions that nobody believes or reads, but are demanded by rubber stamp corporate managers.
I've attached an an article that again details the pitfalls of mindless reliability and engineering p-hacking.
Hopefully my assessment informs the context of Massimo's argument and reflects the situation accurately.
Is Fausto Galetto serious about it or is this some kind of (bad) joke? I mean, under psychologist there is this joke that you can evaluate the sanity of a person by it's caption of the writing (Schriftbild). There is no substantial argument beyond personal attacks against you. I guess this is the deleted profile of some older discussions I already found from you Jochen Wilhelm . Looks like you had some "fun time" here on RG.
Nevertheless, I would like to see more of the method in detail, shown by Massimo. And maybe a bit more of explanation. To pick a quote from Fausto Galetto in the open letter:"clarify statistical concepts in a way that everyone is able to understand"
Miky Timothy many thanks for the links, THIS really helps me to see the problem and understand what approach Massimo took.
Rainer Duesing , I am afraid that he was in deed serious. I wasn't his only target. He was actually against everyone, and we all had a hard time to understand him what considerably limited the ability to learn from him.
Jochen Wilhelm, Rainer Duesing, Miky Timothy, George Stoica
It seems I have a lot to read before answering.
Thanks
I will read everything
Here is another link to the open letter, where I was able to download it, didnt get it via RG either.
https://www.academia.edu/35218153/Open_letter_to_Jochen_Wilhelm
Jochen Wilhelm, Rainer Duesing (thanks for the link), Miky Timothy, George Stoica
Reliability and MTTF - USAF.pdf Downloaded. To be read…
1. Wrong definition of reliability
2. Wrong definition of Hazard Rate
3. … will follow
Jochen Wilhelm, Rainer Duesing, Miky Timothy, George Stoica
Reliability and MTTF - USAF.pdf Downloaded. To be read…
1. Wrong definition of “”bathtub hazard function””
2. Wrong definition of “"repair rate"”
3. Wrong definition of “"MTBF"”
4. … will follow
I think this is not the point. The links helped me (I didnt read the pdf) to understand where you are coming from. Intead of pointing out flaws in the third party work, please show how your approach to CI improves the understanding and correctness of CIs. I think this is where the whole discussion was all about, wasn't it?
Jochen Wilhelm, Rainer Duesing, Miky Timothy, George Stoica
Reliability and MTTF - USAF.pdf Downloaded. To be read…
1. Nonsense definition of “”Upper Control Limit””
2. Nonsense definition of “”Lower Control Limit””
3. Wrong definition of “"Reliability"” AGAIN….
Without the data, it is impossible to assess the various analyses….
Massimo Sivo, “Without data, you're just another person with an opinion.”
- W. Edwards Deming
“If you can't describe what you are doing as a process, you don't know what you're doing.”
- W. Edwards Deming
Imagine you were an engineer working under Deming, you need to convince him that your procedure provides a benefit. Who cares if it is theoretically right or wrong? You can disagree with the US Airforce and say they don't know the definition fo reliability, that's fine. But as Rainer Duesing asks, you need to explain your own process and justify its application, not just dogmatically, but practically. Confidence intervals for an exponential distribution, given a sufficient n, have a relatively straightforward derivation. But explain how CI's inform the problem at-hand.
I believe Jochen Wilhelm tried to go through this with Galetto. Unfortunately, Jochen Wilhelm is clearly an abrasive monster who doesn't understand QUALITY and corrupts the minds of the youth with his statistical lies on RG!!! Bring out the hemlock! :-) [joke, if not obvious] By the way, I was not the Timothy in Galetto's thread. I had no idea this had all been gone through before.
Importantly,
Do you think Deming would approve of Fausto Galetto's behavior? Do you think Deming would accept a conclusion because it was the product of Galetto's 'Golden Integral'? Why even bother doing statistics? Your job is to justify your test in the first place.
I wouldn't bother closely reading Ioannidis' piece, if you haven't already. The title is absolutely true but his thought experiment 'proof' is absurd and the 'more research, less thinking' approach is ironically the reason his title is true. Goodman and Greenland and others' responses are better reads.
Finally on the topic of Deming, attached is an alternative to Ioannidis' hand-waving solution to broken science, from a common sense perspective of research quality the authors think Deming would take. Open access but link below:
Article Deming, Data and Observational Studies
Jochen Wilhelm, Rainer Duesing, Miky Timothy, George Stoica
Here you find data: all items failed. Data Exponentially distributed.
Comparison between Exponential and CLASSIC (NORMAL). See the file…
At Politecnico the teacher showed the drawbacks of the definition Mean=Total/#data
The readers in the previous posts were expected to know the Theory: the total has the distribution of the sum of Exponentially Distributed Random Variables.
Jochen Wilhelm, Rainer Duesing, Miky Timothy, George Stoica
I forgot to say that because the total has the distribution of the sum of Exponentially Distributed Random Variables, the Reliability system (associated to the Reliability test) is as a Stand-by System, with R(Total, MTTF):
MTTFLower Limit is the solution of the Equation R(Total, MTTFLower Limit)=alpha/2
MTTFUpper Limit is the solution of the Equation R(Total, MTTFUpper Limit)=1-alpha/2
Sorry..
Miky Timothy
You say: “Confidence intervals for an exponential distribution, given a sufficient n, have a relatively straightforward derivation. But explain how CI's inform the problem at-hand.“
I do not understand the statement “… given a sufficient n … “. Do you mean that for large n=number of data, you use the Central Limit Theorem?
To your question “Do you think Deming would approve of Fausto Galetto's behavior?“ I can only say, with Deming’s words “A figure without a theory tells nothing.” I think that Deming would not approve the “form” of the Jochen Wilhelm “”Here is my personal favourite: Method Open letter to Jochen Wilhelm“”
On the contrary, I am sure that Deming would have approved the content of the paper Hope for the Future: Overcoming the DEEP Ignorance on the CI (Confidence Intervals) and on the DOE (Design of Experiments) that was considered “like a witch-hunt” by Jochen Wilhelm.
Btw, did you notice that he did not dare to afford any analysis of the data (in the example)… Perhaps there is a reason … explained by Deming’s teaching.
Dear Massimo,
I asked several times to provide a demonstration for your method to be better, and why it is better. Until now, you showed us nothing. You may agree or disagree with the way Jochen tries to teach/educate other people, but at least he gives explanations to the questions/problems at hand (if you like it or not). Therefore, people have a chance to understand it.
The "article" Hope for the Future is the most unscientific writing I have ever seen. It is just a personal attack. I haven't found any solutions or proposals how to do it better (although regularly demanded by the author, but fails to provide it himself [deja vu?]). How to teach better. How to educate better. Besides that it is a advertisement to his own book. And just a quick side note, it has been published in a predatory journal where you can geht anything published, if you pay the money. Such "work" would surely not have been published in a descent journal, with peer review.
So, if you really want to make the difference, do it differently and provide explanations everyone can understand and stop personal attacks. This is not only unscientific, but ridiculous. I was here really to educate myself with the goal to do things better in the future, but until now, there was just nothing.
Miky Timothy
I read the paper of Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124.
NOBODY was interested…
Rainer Duesing
Dear Rainer,
I think that telling the readers:
Surely is more Scientific than writing:
Deming:
Ø Without theory, experience has no meaning.
Ø Anyone that engages teaching by hacks deserves to be rooked.
Ø The result is that hundreds of people are learning what is wrong. I make this statement on the basis of experience, seeing every day the devastating effects of incompetent teaching and faulty applications.
Jochen Wilhelm, Ette Etuk, David Eugene Booth, Rainer Duesing, Miky Timothy, George Stoica, Jiří Kroc, Harish Kumar Gupta
DEAR ALL,
RG suggested me to read the paper: “P-values – a chronic conundrum” by Jian Gao, BMC Medical Research Methodology (2020) 20:167 https://doi.org/10.1186/s12874-020-01051-6
I do not know if BMC is a predatory journal ….
In the paper I found:
It is logical to ask:
ONLY Sound Theory assure Quality of Papers.
Does Peer Review assure the Quality of papers?
It is like setting a prior on the trustworthiness of the results. Is my prior trust on peer-reviewed articles higher than on an self-published articles without any control: yes. Is this prior p = 1: no. Am I right with my prior all the time: no. Is peer-review the ultimate assurance of quality: no. Is it possible to publish sound articles without peer-review: yes.
So do not trap in that logical fallcy of a false dichotomy. Quality is gradual, not an either or thing.
I think I am out of this discussion. I asked you several times to show the superiority of your method (maybe because I was not able to understand it), but you kept on ranting things completely off topic (your topic, btw) and provided no substance. Sad to see. That is not education that you put on your flag.
Rainer Duesing
Dear Rainer,
I am very sorry that you repeated several times:
Let’s stand back and recap my initial Question (and statements in The American Statistician)
Is “Statistical Significance” outdated?
On the web I found that three scholars wrote (special issue of The American Statistician):
1. We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely.
2. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way.
3. Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless.
During the discussion, I suggested (or I meant to suggest and I was not clear) that the idea of “Statistical Significance” should be retained, with the use if the symbol CI(H0, alpha, dof, Distribution of the data) for the Confidence Interval.
I presented some examples:
IF you do not understand that I do not know what to do.
IF the other participants to the thread cannot explain to you the idea better than me I do not know what to do, as well.
IF none of the RG scholars can explain to you the idea better than me I do not know what to do, as well.
I suggested using the concept of “System-associated to the Reliability Test” BUT it is not necessary IF people know the Theory of Statistics.
Cheers
Massimo Sivo, whenever I set out to understand a new statistical test or concept, I make a point of using real data that actually has relevance to me, both in my daily life and in its practical and verifiable application in my field. Your example is contingent on data following an exponential distribution. I don't find made-up example of people receiving phone calls per hour or meteor strikes to be informative.
So I tried to think of what kind of phenomena I would confidently pre-suppose is a poisson process from my own experience. I really couldn't think of much. I don't work with Geiger counters. I do work with image-analysis for biological microscopy applications, but here the goal would not be to use inference to compare poisson-random noise, per se, but to identify and remove it without sacrificing data integrity. So if anything, my interest would likely be in evaluating if an image parameter follows an exponential distribution in the first place, given an expected base-rate of photonic noise. I am a biologist and most definitely not a physicist, so am not the person to ask regarding these details.
Anyway, I am assuming that you work with real engineering data, for example something coming out of a FIAT factory. :-)
Deming, like George Carlin, is eminently quotable, with diminishing results, so here's my last mention of his philosophy. When Deming talked about "data" he didn't mean fake data like that found in many examples in DC Montgomery's book. Arguably, those are akin to the uninformed "figures" he disliked so much. Thus, you should use real data that has been empirically gathered and analytically determined as following your distribution - before even embarking a two sample test in your example, never mind using it to prove some kind of generalizable truth in statistics.
When I spoke about confidence intervals given X ~ Exp(λ), I meant a sufficient n in terms of the weak law of large numbers. This is another way of saying that if you define your system in such a way, your assumptions allow for a deterministic approach with nearly exact estimates (others can correct me if I am wrong).
I encourage you to pose your questions at the Cross-Validated forum, at stackexchange which is the best place, I think:
https://stats.stackexchange.com/
There you can talk discuss the merits MTTF, the exact likelihood ratio test for two exponential distributions, and other statistical issues. I would be curious to know what you find out, since Researchgate Q&A deals more with the methodological application of statistics in an experimental setting.
So my last comment in this discussion: in previous threads, you used examples from DC Montgomery to very assuredly make a point. Those examples are why I thought, initially, you were a fan of six sigma bad statistics. I've attached an image from 'Management Versus Science: Peer-Reviewers do not Know the Subject They Have to Analyse' showing what Fausto Galetto thinks of DC Montgomery's defining text. Your guess is hopefully better than mine.
Best of luck,
-Miky Timothy
EDIT: Added Massimo's previous comment for context.
Miky Timothy
Here you find a case with REAL DATA about Engine Belts failures (km).
It can be dealt with by the same ideas as before, not with the same formulae…
Thank you for the link; I will see….
M. Sivo
Jochen Wilhelm, Ette Etuk, David Eugene Booth, Rainer Duesing, Miky Timothy, George Stoica, Jiří Kroc, Harish Kumar Gupta
To the question
I tried to propose the use of the Confidence Interval CI (H0, alpha, dof, Distribution)=
Obviously, the CI must be computed in the CORRECT way.
It is WRONG to say (as some incompetents say) that one has to “clarify statistical concepts in a way that everyone is able to understand“ IF the methods are WRONG
See the file Likelihood and Confidence Interval
Massimo Sivo
I saw this discussion immediately after posting a note related to the topic. I am adding a link here that I had mentioned in my note.
https://www.sciencedaily.com/releases/2019/03/190321092229.htm
Thanks.
Dear RAD,
unfortunately, I could not download the paper.
Can you please send it?
Thank you
Statistical evaluations in psychology, for example, are misleading anyway as often it leads to a simplistic understanding of human behaviour and provides no validity to understanding states of mind or complex situations. They tend also to be predictive, confirming what the researcher is looking for.
What we really want to know is "scientific significance", not "statistical significance". But the latter is often misinterpreted as the former.
Hening Huang
If a guy is "Intellectually Honest" he does not confuse the two concepts: "scientific significance" and "statistical significance".
There is a great difference between "Scientific Truth" (e.g. in Mathematics) AND "Scientific Law/Truth" (e.g. in Physics): the latter to be "Scientific" needs an IMPORTANT "statistical significance".
Statistics are ubiquitous in science - needed, often not. However, there are haunted reviewers who require statistical analysis of experimental results regardless of the situation. It's hard to argue with them. I used to like statistical inference. It was interesting during the PhD defense.
Many scientists suggest to abandon "statistical significance" or p-values because it often does more harm than good. Scientists can (and should be able to) find out whether their discovery is of "scientific significance" without using "statistical significance" or p-values.
Hening Huang
You say"
Do you think that the "mass difference between the proton and the neutron" has "scientific significance" without using "statistical significance"?
See the attached file that shows the data of a presentation about Minitab, on the web.
The speaker does not find Interaction Significance.
HOW could you decide if there is "scientific significance"?
Massimo Sivo,
What do you mean "interaction" here? Please explain. Thanks.
Are the 12 values shown the data points/raw values or already mean values for the cells of each combination of the factors A (catalyst) and B (days)?
First of all "statistical significance" in a statistical sense is a (Fisherian) synonym for P-value
What you are referring to is probably hard thresholding at the infamous 0.05.
In this last sense there can be plenty of oppropriate substitutes because many tools like bayes factors or information criteria can be used in a mechanical way as well, and be manipulated in a similar sense as pvalues can, which is why we are talking so much about replication crisis
Pvalues are directly connected with the variation of the underlying sampling distribution, and do not necessarily tell you much outside of a randomized experiment
That's why effect sizes and other measures are known to be more useful
The main problem with NHST (what you refer to as statistical significance ) is that researchers use it to avoid taking responsibility, i.e. "the pvalue is small so my research is interesting / relevant"
Statistical significance was outdated from day one
It was never meant to be used as a formal inferential method, it was proposed as a way to see if you should keep going with your experiment
The problems about pvalues and the mechanical use of it has been matter of discussion since the 40s
People do not want to be bothered and learn new things: for many researchers pvalues are the only ingredient of the recipe they need to get published
Hening Huang , Rainer Duesing , Stefano Nembrini
Interaction should be a known concept.
The data are single data, NOT means.
Stefano, what would you decide, IF you have the data of the file?
You say: People do not want to be bothered and learn new things
I NEVER was SCARED of new things...
This seems interesting, there are no degrees of freedom left to test the interaction, if these are raw values.
Rainer Duesing
Interaction has 6 df(degrees of freedom) and it can be "split" into 6 parts with df=1.
Using the combined lowest parts as residual we can assess the "significant part" of the interaction...
Is this "scientific significance"?
Stefano Nembrini
Since you say " Statistical significance was outdated from day one"
Yes but how do you test/interpret an interaction, without taking into account the main effects (at least in research I know and familiar with, this does not make sense), so that you need additional 2 + 3 df, including the interaction is 11df, which is the same as the amount of information you get from the sample. Please show me, who you would do it. I can't come up with a solution for it, since you do not have an variation in the cells. How can you determine the uncertainty in each cell?
If we assume that the measurements are accurate (no uncertainty), the plot shows interaction between "Day" and "Catalyst C4 and C6", but no interaction between "Day" and Catalyst C5". The experimenter who knows his/her experiment set-up and measurement uncertainty should be able to explain why and should provide a scientific inference. Nevertheless, this example shows that statistical significance test could be misleading.
Hening Huang
Are you saying that no interaction between "Day" and Catalyst C5" (exists) by "looking at graphs" or by "making a statistical analysis"?
Rainer Duesing
There is a Significant Effect of "Catalyst". [df=2]
There is a Significant Linear Effect of "Day". [df=1]
There is a Significant Effect of Interaction given by "Catalyst*Linear_Day".[df=2]
Massimo Sivo, I did a little bit regression analysis of the data. Let me clarify. I mean that, if the data for Catalyst C5 is excluded, there is an interaction between "Day" and "Catalyst" (R^2=0.746). Again, I think he experimenter should be able to provide a scientific inference because he/she must know the physics behind the data (experiment). It doesn't make sense for others to just look at the data to provide a statistical inference.
We must go back to the roots! Statistical significance in physics and biology (medicine) are two completely different entities. SS in physics is easy, atoms do not change properties during the experiment, they are identical all the time, they are not having some internal parameters affecting the experiment.
Contrary to this, biological systems violate all the above-given assumptions. They are alive, change all the time, and change physiology and epigenetic set up constantly.
From my understanding achieved on the prediction of heart arrhythmias, complexity measures and entropy are much better tools to capture the internal setup and properties of living systems. The hypothesis can be found using AI.
Rainer Duesing
There was a piece of stupid wrong information I gave, 1 day ago
There is a Significant Effect of "Catalyst". [df=2], SS=11.16667
There is a Significant Linear Effect of "Day". [df=1] SS=24.06667
There is a Significant Effect of Interaction given by "Catalyst_Linear*Day".[df=3] SS=4.50
(this should be similar to Hening Huang findings)
BTW Hening Huang I'm still waiting for reply to the question
Jiří Kroc , Johannes W. Dietrich , Tanvir Singh
Can you, please, provide some documents showing that "entropy is a much better tool (than Statistics)" when you have collected data (in your field)?
Thank you
Massimo Sivo regarding your question:"Do you think that the "mass difference between the proton and the neutron" has "scientific significance" without using "statistical significance"?" My answer is "yes". The mass difference is a "physical quantity" that has nothing to do with "statistical significance".
Dear Massimo Sivo,
The validity of deploying statistical significance analysis is highly dependent on the type of data and the relation between the original variables. simple it ensures that a relationship between two or more variables was not caused by chance. It is used to provide evidence concerning the plausibility of the null hypothesis, which hypothesizes that there is nothing more than random chance at work in the data. For instance, in data mining, statistical hypothesis testing is used to determine whether the variation on the obtained experimental result on a dataset with different techniques is statistically significant or not.
Hening Huang
Do you assess the mass difference WITHOUT any measure? Do all measures have the same value?
All is in detail described in the paper in the prediction of heart arrhythmias: drug-induced Torsades de Pointes arrhythmias on a rabbit model.
https://www.researchgate.net/publication/340460356_Application_of_Machine_Learning_and_Complexity_in_Medicine_Prediction_of_Drug_Induced_Arrhythmias_in_Rabbit_Model_up_to_One_Hour_Before_Their_Onset_early_version_requested_due_to_COVID-19_unfinished_m
This is an early preprint with unfinished/rewritten entropy section, methods, and improved all other sections except results. It was released due to COVID-19 and applications of chloroquine drugs treating COVID-19 that are inducing TdPs. It is a very detailed paper in the final version.
The general procedure is simple. Permutation entropy is applied to raw ECG recordings. Entropy shows arrhythmias & predict the onset of arrhythmias well before anything is observed on ECG. Artificial intelligence can 'see' it.
There are shown results and figures proving that statistics (simple, advanced) are completely blind to prediction.
There is existing only one other paper on the prediction of arrhythmias on human data achieved by deep learning. Hence, it seems to be that the prediction of arrhythmias is possible.