Use the Ballico paradox as a standard test for the validity of any method for computing measurement uncertainty

01 January 1970 3 8K Report

In 2000, Ballico reported a paradoxical behavior of the expanded uncertainty (EU) estimated with the GUM’s WS-t approach in a real-world application (WS stands for Welch-Satterthwaite). According to Ballico (2000), during a routine calibration and associated uncertainty calculation at the CSIRO National Measurement Laboratory (NML), Australia, a thermometer was calibrated for two ranges: 1 mK range (higher precision range) and 10 mK range (lower precision range). He observed a counter-intuitive result: the estimated EU for the 1 mK range was 37.39, which was greater than 35.07, the estimated EU for the 10 mK range! This paradoxical result was later designated as the Ballico paradox (Huang 2016). Ballico (2000) suspected that the paradox was due to the limitation of the WS formula.

Hall and Willink visited the Ballico paradox in 2001. They presented a calculation example and employed Monte Carlo simulation to generate the t-intervals with the effective degrees of freedom (DOF) estimated by the WS formula. Their results for the mean width of the simulated t-intervals showed some anomalous behavior. However, Hall and Willink (2001) didn’t resolve the Ballico paradox.

The Ballico paradox had been ignored and unresolved until 2016 when I visited it and provided a resolution with a proposed WS-z approach (Huang 2016). I figured out that the Ballico paradox essentially invalidates the WS-t approach. However, the Ballico paradox is not due to the WS formula. The WS formula is valid for estimating the effective DOF; the Ballico paradox is due to the use of the t-interval in uncertainty estimation (Huang 2016). I revisited the Ballico paradox in 2018 and 2019 and provided the resolution with two alternative methods (Huang 2018, 2019).

Nobel laureate Richard Feynman (1964) stated, ‘If a theory disagrees with experiment, it is wrong. In that simple statement is the key to science’.” Feynman’s statement can be interpreted to mean that theories should be tested against experiment and only against experiment (White 2016). While frequentists and Bayesians disagree on their views and methodologies, both agree that a statistical method should be judged by the result which it gives in practice (Jaynes 1976, Kempthorne 1976).

Therefore, I propose using the Ballico paradox as a standard test for the validity of any method for computing measurement uncertainty. That is, a statistical method, regardless of whether it is derived based on frequentist or Bayesian statistics, must resolve the Ballico paradox. Otherwise, the method is invalid.

References

Ballico M 2000 Limitations of the Welch-Satterthwaite approximation for measurement uncertainty calculations Metrologia 37 61-64

Feynman R P 1964 Almost Everyone’s Guide to Science ed J Gribben (Hyderabad: Universities Press) (see also www.youtube.com/watch?v=b240PGCMwV0)

Hall BD, Willink R (2001) Does “Welch-Satterthwaite” make a good uncertainty estimate? Metrologia 38:9-15

Huang H 2016 On the Welch-Satterthwaite formula for uncertainty estimation: a paradox and its resolution Cal Lab the International Journal of Metrology 23 20-28

Huang H 2018 A unified theory of measurement errors and uncertainties Measurement Science and Technology 29 125003 https://doi.org/10.1088/1361-6501/aae50f

Huang H 2019 Why the scaled and shifted t-distribution should not be used in the Monte Carlo method for estimating measurement uncertainty? Measurement 136 282-288 https://doi.org/10.1016/j.measurement.2018.12.089

Jaynes E T 1976 Confidence intervals vs Bayesian intervals Foundations of Probability Theory, Statistical Inference, and Statistical Theories and Science Vol. II 175-257 Eds. Harper and Hooker (Dordrecht-Holland: D. Reidel Publishing Company)

Kempthorne O 1976 Comments on paper by Dr. E. T. Jaynes ‘Confidence intervals vs Bayesian intervals’ Foundations of Probability Theory, Statistical Inference, and Statistical Theories and Science Vol. II 175-257 Eds. Harper and Hooker (Dordrecht-Holland: D. Reidel Publishing Company)

White D R 2016 In pursuit of a fit-for-purpose uncertainty guide Metrologia 53 S107–24

Alejandro Oharriz Calderon

Hello Dr. Huang

I find your research very interesting and with great impact. And in fact unaware of this behavior. Have you performed another experiment other than Ballico's data?. I mainly ask why we take the use of the Welch-Satterthwaite formula for granted and do not even question it when developing an uncertainty budget. I would be grateful if you would support me to carry out what you set in your discussion with a more practical and simple example.

regards

Hening Huang

Alejandro Oharriz Calderon

Thank you for your comments. As I mentioned in the discussion, the Welch-Satterthwaite formula is valid for estimating the effective DOF; the Ballico paradox is due to the use of the t-interval in uncertainty estimation. That is, it is the t-interval method that has a serious problem: i.e. producing paradoxical results when the sample size is small (say, 2, 3, 4, …). It is true that the t-distribution (or t-interval) has been considered as the standard way to deal with small samples for over 100 years. The GUM and many statistics textbooks take the t-based uncertainty as the Type A expanded uncertainty. However, after having been working on the subject for over 12 years, I figured out that all t-based methods for the Type A evaluation of uncertainty are flawed. This might be an astonishing statement. But, believe it or not, the t-distribution (or t-interval) actually is a barrier to realistic evaluation of uncertainty. All t-based methods for the Type A evaluation of uncertainty cause unrealistic or paradoxical results. I have discovered the root cause of the problem is the so-called “t-transformation distortion”. The t-distribution (or t-interval) barrier must be removed. That is, all t-based methods for the Type A evaluation of uncertainty must be abandoned. Otherwise, the Type A evaluation of uncertainty would never be right, regardless of frequentist and Bayesian approaches.

By the way, if you are interested in some of my papers on this subject but do not have access to them, please send requests through ResearchGate and I will send you copies privately (I cannot post some of the papers publicly because of copyrights).

Hening Huang

The table below shows a comparison of the results obtained with six methods for Ballico’s data (unit: mK).

Methods 1 mK range 10 mK range

GUM’s WS-t (Ballico 2000) 37.39 35.07

Kacker’s Bayesian (Huang 2018) 41.06 43.89

t-based MCM (Huang 2019) 38.53 41.06

WS-z (Huang 2016) 25.75 29.34

Unified theory (Huang 2018) 25.95 29.95

z-based MCM (Huang 2019) 25.99 29.94

Note from the table that the first three methods: GUM’s WS-t, Kacker’s Bayesian, and t-based MCM, all produce unrealistic estimates of uncertainty, or all artificially dilate uncertainty. It is important to note that these three methods all have one thing in common: their formulations are all based on the t-distribution or t statistic. In GUM’s WS-t approach, the t score is applied to the combined standard uncertainty. In Kacker’s Bayesian approach, the standard deviation of the t-distribution, is applied to each of the Type A components. In the t-based MCM, the scaled and shifted t-distribution is assigned to each of the input quantities having Type A uncertainty. The results indicate that these three methods that are based on t-distribution are invalid.

Also note from the table that the other three methods: WS-z, unified theory, and z-based MCM, all produce realistic estimates of uncertainty. These three methods also all have one thing in common: their formulations are all based on the bias correction factor c4 for the sample standard deviation. In the WS-z approach, c4 is applied to the combined standard uncertainty. In the unified theory, c4 is applied to each of the Type A components. In the z-based MCM, c4 is applied to the PDF for each of the input quantities having Type A uncertainty.

VASP recognize my insulating alloy structure as not insulating?

Suggestions for the Annexin V Positive Control ?

Will the catalytic effect of retained austenite happens in steels including RA?

How to use ImageJ for fluorescence quantification of platelet adhesion and spreading?

Determination of fatty acids in fish?

Who could give me an online access to Mandarin Ducks and Butterflies: Popular Fiction in Early Twentieth-Century Chinese Cities by Perry Link?

How do you think about omni-channel marketing?

How to explain that the length of transfected gene become larger in the cells?

Why the PCR product turns bigger after transfection into cells while using the same primer?

Why RNA and protein concentrations difference big even they were from the same sample?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Please explain how the plastic input value should be considered from the true stress-strain curve for the bilinear elastoplastic material model ?

How does one derive the standard deviation of a scale?

"A Markov-like Model for Patient Progression"?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

How to report results of Generalised Linear Mixed Models in a journal article?

Why results of ROS flurescence are negative as there was no bacteria within?

Is it necessary to covary exogenous constructs in a structural model?

Why can't academics earn the money they deserve?

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?