It will be interesting to see how researchers will argue what effect is "biologically meaningful" what what effect is not.
And if it was really clear what a minimally biologically meaningful effect is, why is the data not tested against this effect (rather than against a "zero effect")?
Here they make a relevant mistake:
"The rationale for providing a second, independent set of observations is [...] to consider whether the magnitude of that more precise estimate is biologically relevant".
As I understand it, this is nonsense. What a relevant effect is is not in the data - it's in the understanding of the phenomenon and is based on external reasoning. Nothing tells me if an effect of, say, 1.5 (of whatever - I deliberately don't give any unit or context) is biologically relevant, no matter if the confidence (or credibility) interval is 0...3 or 1.49...1.51). Only if I have an a priori reason to say that effects > 2 are biologically relevant, wheras effects < 2 are not, the wide interval is inconclusive, as that data would still be compatible with both, relevant as well as irrelevant effects, but the second more precise interval would be conclusive (the effect would not be relevant). Honestly, show me one scientist who has to write grant applications who would -in light of this precise estimate- be willing to say that effects lower than 2 are surely not relevant.
I am afraid that this suggestion will cost more animal lives and cause unpleasant experiences to unneccessary many more subjects.
And there is another problem:
"However, rather than jumping to the conclusion that the inferences of the estimation stage were “false” (Ioannidis, 2005), this two-step procedure might shift the emphasis toward precisely estimating the magnitude and direction of an effect (“how much”) and away from a dichotomous (“Does the effect exist or not?”) question (Calin-Jageman and Cumming, 2019)."
A test can not give the conclusion that no effect exists. The dicotomy "effect exists vs. effect does not exist" has noting to do with the testing procedure. It is at least a very wrong way to understand what tests actually do. Failing to reject the tested hypothesis does not mean that the hypothesis (better: the state of nature that is modelled by this hypothesis) is true. This only says that the data is not sufficient to conclude at the estimate is at the correct side of the hypothesis.
This fact becomes quite clear (I think) when you are aware that the hypothesis you test does not need to be "zero effect". It can be any other effect, too. So if, instead of testing effect=0, you test effect=(some very small non-zero value), you will (most likely) get a very similar "non-significant" result. Now, it cannot be that at the same time it is true that the effect is zero and that the effect is different from zero. In fact, a 95% confidence interval usually constructed by inverting a significance test, and is the interval of all hypotheses that cannot be rejected at 5%. Adhering to the faulty logic proposed above, this is a whole interval of hypotheses equally considered being true at the same time.
I am sure that a better understanding of what tests do (what what they don't do) would already mitigate many of the problems. The rest of the problem is not of statistical nature. Biological relevance has to be investigated. This might require to think outside of the box. And in many cases the honest answer likely is: "we don't know".