As we know a lot of debate is going on regarding the current p-value threshold for the declaration of statistical significance (P < 0.05); even some peer-reviewed journals stopped publishing papers with p-values. Recently, a more stringent p-value threshold for a statistical significance (P < 0.005) is proposed by a number of leading statisticians for the declaration of statistical significance for a new discovery (I think the paper will be published in the forthcoming Nature Human Behavior and the prepublication version can be accessed from this link https://scholar.harvard.edu/files/dtingley/files/sig-naturehumanbehaviour.pdf). This seems one step forward to address the issue with p-value (especially regarding the reproducibility of findings), but how this proposal will be adapted in clinical trials design particularly in dealing the trade-off between type I and type II errors is debateable. For example, to maintain the lower quite well-accepted 80% power, the sample size needs to be increased by about 70% using the new type I error (alpha=0.005) or compromise the power which will be less than 50% in order to keep the same sample size as the one from the standard approach (i.e. alpha=0.05) for a two-sided test. I think the issue with p-value threshold for a statistical significance is not going to stop here and this paper is going to spark a lot of new debate as long as the balance between budget (resource) and statistical power (the probability of finding an effect if it is there) is the main factor in clinical trials design.
One size doesn't fit all. That was true for 0.05 and it will be true for any other threshold.
Changing a threshold does not solve the three most severe problems:
1) that researchers do not know what a p-value is
2) that still A/B testing and hypothesis testing are illogically intermingled, and that there is rarely any sensible argument given why power and size of tests are chosen as they were
3) that small experiments are repeated until they happen to show what the researcher wants, and these cases are published (ignoring the undesired outcomes)
One size doesn't fit all. That was true for 0.05 and it will be true for any other threshold.
Changing a threshold does not solve the three most severe problems:
1) that researchers do not know what a p-value is
2) that still A/B testing and hypothesis testing are illogically intermingled, and that there is rarely any sensible argument given why power and size of tests are chosen as they were
3) that small experiments are repeated until they happen to show what the researcher wants, and these cases are published (ignoring the undesired outcomes)
I entirely agree with Jochen Wilhelm that changing the threshold does not solve the three main problems he presents. I will add some considerations about them.
Statistical inference involves two procedures. First, we need to verify the validity of several hypotheses on the population studied: this procedure is called hypothesis testing. Second, we need to estimate the population’s characteristics and conduct statistical tests to see if their effect is significant. The two procedures are actually very similar and closely interlinked: both aim to draw conclusions about the total population from information on the sample alone.
To this end, let us first examine the approach used by Karl Pearson (1900), and R. A. Fisher (1923, 1935). We want to determine if a given factor influences the phenomenon under study or not. We shall estimate parameters linking the factor to the phenomenon. At this point, a question arises: can we explain the values of the estimated parameters by chance alone, or does the factor studied also play a role? This “Type I error” is the one we commit when wrongly rejecting the hypothesis that observations can be explained by chance alone. The authors listed above devised hypothesis tests to verify whether the factor does or does not influence the phenomenon examined.
We can interpret the tests in strictly frequential terms. Suppose a population in which the hypothesis we want to test is true. Let us assume that we draw a large number of samples at random from this population, in the same conditions as the sample already selected. Some of these samples will be very rare, others far more frequent. If the probability of the sample drawn is too low—say, under 0.05—we shall reject the hypothesis. The solution consisting of a large number of draws—not actually performed, but supposed—does indeed enable us to work with probabilities open to a frequential interpretation. At this point, we are no longer examining the probability of a hypothesis, but only the probability of obtaining a particular sample, if the hypothesis is true.
In the wake of the authors above, Neyman and Pearson (1933) observed that another type of error can occur at the same time, which the “Type I error” leaves aside. This other type of error, called “Type II error” is the one incurred by wrongly rejecting the opposite hypothesis, namely, that the observations cannot be explained by chance alone. We must estimate both types of error to obtain a more robust conclusion: when we guard against one, we necessarily increase the probability of the other, if the information remains the same. However, we can see that this second risk is far more complex to analyze, for the contrary hypothesis actually comprises an infinity of possibilities of deviations from chance: strictly speaking, therefore, we should compute an infinity of type-two errors. That is why probabilists very often simply assign a low value to type-one error, setting aside type-two error. In any event, accepting a hypothesis after subjecting it to a statistical test does not mean that we declare it to have been verified, but only that we choose to act as if it were.
Often, the reasoning that we have just used to obtain a frequentist statistical inference from observed data is interpreted incorrectly. Let us take the statement that the 95% confidence interval for an unknown parameter, t (such as the mean age at first childbirth, in the French 1920 birth cohort, estimated from a representative sample of that cohort), lies between two values t1 and t2. This appears to indicate that the parameter has a 95% probability of lying in that interval. But that is incorrect, for we can apply the interval only to the parameter’s estimation and not to the parameter itself, which is unknown.
We would actually want to answer the following question: what is the probability that the unknown parameter lies in a given interval? But in this case we can only state that, if we draw many samples of identical size and if we build such an interval around the mean of each sample, then we can expect that 95% of the resulting confidence intervals will contain the unknown parameter. That is an answer to a far more complex question than the first, which seemed much clearer and does not actually exist in frequentist theory. The question is the following: if we draw a large number of different samples, N, what is the probability that the unknown parameter is contained in a certain number of the samples, n? As the analysis is often confined to a single sample, we conclude that the question makes little sense.
Let us go further and try to establish more specifically what objectivist statistical inference can demonstrate. Suppose we want to see whether a given factor influences the phenomenon studied or not: for example, we want to determine if the fact of being a farmer influences a given sub-population’s probability of migrating. The question then becomes whether the differences between the estimated probabilities of migrating for farmers and the rest of the population can be explained by chance alone or whether they diverge significantly. If these probabilities prove to be different at a preset limit, for example at a 1% “significance level,” then we can conclude that they do diverge, since the result observed is hardly likely to have been obtained by chance. We therefore interpret these probabilities here in frequential terms, by imagining a population larger than the one observed, from which we can draw many sub-populations at random, including the one observed. If the probability of the observed sample is too low, we shall reject the tested hypothesis. This finding is consistent with our earlier statement: objectivist statistical inference makes it possible to test the probability of obtaining the observed sample—if the hypothesis is true—but not the probability of the hypothesis itself, which is either true or false (Matalon, 1967).
References
Fisher R.A. (1923). Statistical tests ot agreement between observation and hypothesis. Economica, 3,139-147.
Fisher R.A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 39-82.
Matalon B. (1967).Epistémologie des probabilités. In J. Piaget (Ed), Logique et connaissance scientifique (pp. 526-553). Paris, Gallimard.
Neymann J., Pearson E.S. (1933). On the problem of the most efficient testsof a sratistical hypotheses. Philosophical Transactions of the Royal Society of London, Series A, 231, 289-337.
Pearson K. (1920). The fundamental problem of statistics. Biometrika, 13, 1-16.
I agree with the other answers that changing the threshold doesn't address the main problems with relying on p-values.
A.
I actually don't always agree with the a lot of the critiques of using p-values that I've read. They seem to criticize the use of p-values because researchers don't understand what they mean, how to use them, or because p-values aren't things they aren't (like effect sizes). To me, none of this is a criticism against using p-values. It's a demonstration of the need to better educate our researchers and students, and to emphasize the need to look at effect sizes and other practical information, and to understand that the results of one study don't prove something.
Changing the threshold doesn't address the main problem.
B.
I come from a background in agriculture and natural resources. In my experience in this field, a p-value threshold level of 0.05 is pretty reasonable.
1) In some cases it's reasonable to remember that we aren't curing cancer or preventing train derailments, so we can often accept some type-I errors. In some cases, we would choose to live with higher type-I errors to avoid type-II errors. In some cases, it's the opposite.
2) We usually don't have huge sample sizes, and there is plenty of natural variation in natural systems, so a p-value below 0.05 is often a good gauge of something interesting going on.
3) We are often dealing with measurements that are meaningful to our readers, such as corn yield in kilograms per hectare or concentration of a pollutant in milligrams per liter. And there are often understood costs associated with different treatments. So we are not likely to be deceived by a small p-value that points to a treatment with a small practical difference or a greater cost for a small difference.
I agree with the comments here. I'll just reinforce that training appears to be one of the major issues. There should be more emphasis placed on issues such as sample sizes, study design, poor data manipulation and outliers, power analysis, pilot studies, replication, and appropriate types of analyses.
There has be discussion about this with respect to the 5sigma rule for "discoveries" in high energy physics. Echoing the comments above, having a more strigent alpha-value could make people think that it is okay to play the p-value game to create inappropriate values.
It is worth stressing a lot of the problems relate to people using p values to reach a dichotomous decision (and ones which NHST are not well matched with). Fisher gave different values different meanings: Box 1 of https://www.researchgate.net/publication/228676951_Ten_statisticians_and_their_impacts_for_psychologists. One of the biggest problems communicating statistics is trying to convey uncertainty, and having phrases that convey certainty like "reject" are problematic.
Article Ten statisticians and their impacts for psychologists
I think the impact of p-value will continue and is not going to disappear overnight as it is well stablished in a public arena. Maybe in academia and research labs the change will come slowly, however, in other areas such as industry it will take time. If your p-value crosses that threshold (P < 0.05), it is more likely either you can sell your product/drug or attract investment. Since some of the peer-reviewed journals started banning p-values, ASA organised experts’ discussion, and they pointed out most of the caveats (issues) with p-value, but they didn’t suggest any alternative, that is why the new proposal came up with that stringent threshold (P
I feel like just opening a discussion without context (e.g., links to and summaries of the arguments that have already been made for and against) is just a recipe for rehashing the same points that have already been said and already been addressed (at., e.g., http://sometimesimwrong.typepad.com/wrong/2017/07/alpha-wars.html and other places).
http://sometimesimwrong.typepad.com/wrong/2017/07/alpha-wars.html
@Stephen Politzer-Ahles, you are right that there is a lot of discussion going on re this new proposed p-value but not that much have been said with respect to a clinical trials design which will have a direct impact on my daily work. If a sponsor has a budget constraint, I can’t just simply advice to increase the sample size by 70% based on this proposed type I error or let compromise the power of the study. Assume, if sponsors have to increase the sample size by 70%, imagine the impact of that in the overall cost of study. Some sponsors would more likely prefer to drop the study rather than doubling their budget for a study which is very difficult to predict its outcome. The question is that how we could able to balance the budget and power of the study in designing clinical trials? If we just adapt this new type I error, we will discourage a lot of sponsors (who have potential products) from experimenting their products/drugs.
But that is exactly the rationale behind Neyman/Pearson's A/B testing procedure. One has to have some loss-function, and based on that one can identify values for alpha and beta (size and power of a test) that is required to minimize the expected loss (or maximize the expected win, or at least make sure that the expected win is positive).
It is rather irrational to give some alpha and beta without having a loss-function. This is why "standard" thresholds are rather non-sensical.
In 2015 the editorial board of Basic and Applied Social Psychology declared that anyone wanting to publish on the journal should avoid any reference to p-values. The Benjamin et al.’s manifesto triggering this thread does not go that far but advocates 10-time tougher standards before statistical significance can be claimed. To motivate their conclusions, the authors compare p-values with Bayesian factors, saying that with prior odds at least as great as 1:5 in favour of H1 the conventional 0.05 threshold value provides too weak an evidence against H0. B. Efron in his JRSS 2015 paper shows that Bayesian credible intervals are basically equivalent to frequentist confidence intervals when using convenience or reference vague priors. With prior odds closer to 1 the curve describing the false positive rate-power relationship shown in in the Benjamin et al.’s Fig.2 would look lower for p=0.05 , with a minimum false positive rate not that far from the nominal 5%.
@Jochen Wilhelm, that is interesting, but how you could define that loss-function having several varying parameters unless you have to set some as constant.
@Luca Mancini, you are right to mention Bayesian factors and I will try to go further in this way.
Confidence intervals and p-values could not answer the question “What is the probability that the unknown parameter lies in a given interval?” As we already showed, it can only answer a far more complex question whose relevance was not self-evident, with confidence intervals and p-values.
Subjectivist statistical inference, with the notion of exchangeability (de Finetti, 1937), enables us to answer the question directly—under certain assumptions, of course, but these can be clearly stated. It is the researcher who, given his or her subject and the available information, can say whether the events studied are exchangeable, conditionally exchangeable or non-exchangeable. After calculating the prior distribution function, you can determine α-credible intervals (Robert, 2006) in which a parameter has a probability 1- α conditional upon the observations. These α-credible intervals really did mean an interval in which the statistician was justified in thinking that there was a 95% probability of finding the unknown parameter.
References
Finetti de, B. (1937). La prévision : ses lois logiques, ses sources subjectives. In Annales de l’Institut Henri Poincaré, 7, Paris, pp. 1-68.
Robert, C.P. (2006). Le choix bayésien. Paris : Springer-Verlag France.
@Michael
Defining a (sensible) loss function is really difficult and surely depends on the subject and on the context that is deemed relevant. That needs reals experts in the field.
@Daniel
I just want to add a point: It is often argued that conclusions from Bayesian analyses are subjective and that this is a bad thing. This is possibly thought because one assumes that the priors are chosen to "fit the prior believes of the researcher", thereby making the desired outcome more credible. But in fact the Bayesian analysis allows to use the prior of a reasonably sceptical subject and thus to infer if the evidence from the data is sufficient to convince a sceptic. So the result, like "the probability of an effect is 92%" may be subjective (depending on the subjective prior used), but if that statement is based on a sceptical prior, it should convince me to believe in an effect even when I was sceptical before knowing the data from the study.
@Joshen
Yes, I agree with your point.
In order to go further, I think that there are two kinds of Bayesian approaches. I will distinguish the first as subjectivist (the one followed by de Finetti and Savage) which considers that any prior probability, defined between 0 and 1, is acceptable if this individual choice is only coherent. The other can be defined as logical (followed by Harold Jeffreys, Richard Cox and Edwin Jaynes) which is not concerned with anybody’s personal opinions but with specifying the logical prior information we have in the context of the current problem. So that such a property of consistency is necessary to this approach as rules for logical inference. The property of coherence will then be automatically followed. For me the second approach appears to be preferable.
@Daniel,
I am still puzzled how logical probability (or priors) can be defined. I think you don't mean Jeffreys priors (that are invariant under reparameterizations, but available essentially only for one-parametric cases). The term "logical probability" (possibly in another context) is known to me from Laplace and based on symmetry considerations (you will also surely be well aware of). But that is only to assign probabilities in situations where physical knowledge of the sampling process does not give any hint that any of the possibilities would be more propable than any other. Even in that particular case, the assignment of equal probabilities is subjectivist, because it is based on the subjective (lack of) knowledge about things that determine the sampling.
Consistency is guaranteed in the process of statistical inference, in the sense that (i) the same data leads to the same change in believes, no matter in what order the data is considered, and (ii) that different believes are modified towards a common point (that means, there is always an amount of data that makes any initial differences in believes negligible, i.e.: coherence).
I would be grateful hearing your opinion or explanation.
@Jochen
Yes, logical probability is known from Laplace and is based on symmetry considerations. It had been developed by Jeffreys, Cox and more recently by Knuth. I will not develop here Knuth’s approach but give you some idea of how he gives a simple and clear foundation for finite inference that unites and significantly extends the approaches of Kolmogorov and Cox. But I think that it is necessary to read in more details his paper on foundations of inference (2012) to understand clearly what he proposes.
His starting point is to propose a synthesis not only of probability theory, but also of information theory and entropy. He begins by noting that probability and entropy describe our state of knowledge about both physical and social systems, but do not describe those systems themselves. From this observation, he shows that the theory of partly ordered sets (posets) and general lattice symmetries make it possible to unify the frequentist and epistemic logicist approaches to probability, and to unify information theory and entropy. However it does not allow the inclusion of subjective probability theory, in which the degrees of belief may no longer be 'rational', in other words, every one is free to have his or her opinion.
I will be very interested to have your reaction about this paper.
Reference:
Knuth, K., Skilling, J. (2012). Foundations of inference. Axioms, 1, 35-73.
I think that clinical trials lead to a massive publishing end: when, approximately, a sample of N=30 is identical with an infinite sample, then it is too easy to establish a trial: Just classify 30 patients with same diagnosis and one paper is ready.
Now, if the p
Agreed. Too many researchers, who should know better, do not understand what a P-value means. In the social sciences, my colleagues too often confuse "significant" with "important", and appear confused when I remind them that a "significant" p-value just means that the difference between two means is probably not zero, with several underlying assumptions ignored. Changing an acceptable p-value to 0.005 (1 in 200) instead of 0.05 (1 in 20) is liable to cause the same mistakes and misinterpretations.
Recently, as a reviewer, I sent back for revision a paper submitted to a well-regarded journal that ran a t-test comparing men and women on a 1-6 rating scale. Sample sizes were in the thousands, and of course the means were "significantly" different, well beyond p=0.005. Mean of 3.26 is different from a mean of 3.30. On a 1-6 rating scale. I had access to the same data and found two things... (1) Eta-squared showed that gender accounted for less than 1/4 of a percent of the variation in the scale, (2) a Monte Carlo simulation of paired comparisons showed that Men scored higher than Women 30% of the time, and Women scored higher 27% of the time. Other times they tied. This is analogous to the Common Language Effects Size (CLES). Based only on the p-value of his t-test my author concluded that "generally men score higher than women on ..." With several exchanges, the author eventually, reluctantly, removed the offending section and the otherwise-good paper has been published. I only just learned that the author is a demigod in his area! Horrified.
We need to move away from reliance on simple p-values alone. Bayesian methods are a good alternative, but there are too many of us, me included, who were not brought up on them, and don't understand them. But we can make small steps - Effects Size should be required in any report: R-squared, Eta-squared, CLES, whatever is appropriate for the audience and the data.
Unfortunately, I don't think that people unable to understand p-values can easily understand Bayesian methods. In my opinion, the only solution is to make people understand that they need to work together with statisticians. When I have high fever, I call a doctor. To each, his/her job.
Using Bayesian methods sensibly would require thinking about data - what is usually tried to be circumvented by many researchers. Substituting the common (mal-)practice using methods labelled with the "Bayesian" tag will presumably only lead to using different words for the same old (mal-)practices.
For instance, a key issue in using Bayes factors are:
- how is H1 selected? It would not make sense to use the MLE...
- how is the factor interpreted? It's likely that another standard will be adopted (like 0.05 for the p-value), irrespective of sensible priors that should be considered...
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4809350/pdf/nihms768589.pdf
https://www.nature.com/polopoly_fs/1.17412!/menu/main/topColumns/topLeftColumn/pdf/520612a.pdf
Excelent discussion, I think that the “p-value problem” will not be overcome until the editorial system that currently exists in most journals change (a truly change). I remember the recommendation of the International Comittee of Medical Journal Editors (ICMJE): “Avoid relying solely on statistical hypothesis testing, such as the use of p values, which fails to convey important information about effect size” (Uniform Requirements for Manuscripts Submitted to Biomedical Journals, 2001). In contrast I attached a recent article that demonstrate that this sentence mostly is only rhetoric.
Best regards
@ Jochen, I like your suggestion to use other approach such as a loss-function to optimise the sample size calculation for clinical trials design although as you pointed out it is not going to be easy to develop such a method which accounts several parameters (such as effect size, spread of the data, alpha and beta or power). However, I think this is something we have to think about in order to deal with the current issue. This approach may open an opportunity to account other parameters as well (such as cost). I am wondering if anyone could suggestion any loss-function (even from other disciplines) we have to adapt as a starting, and then we can modify it and do some simulation study and compare the results with the current approach.
The traditional alpha=0.05 is arbitrary. However, in drug development, statistical significance of a treatment to placebo at the 0.05 level is difficult to achieve. A smaller alpha in regulatory applications would result in effective treatments not being approved, only for statistical reasons.
This discussion is VERY usefull. I agree with Anna´s statement: we need to interact with statisticians. The problem is than it is usually difficult to speak the same language and to get properly involved with the same aims. Recently, it was difficult to me to explain an statistician what I need him to do. Interdisciplinary is be very welcomed, but the aims and procedures of the project must be very clear for all.
Leandro’s answer and opinion reflects exactly the essence of the problem, researches and statisticians must speak the same language, this will results in a proper design, analysis and interpretation. But there’s one problem in Leandro’s statement, the researcher have not to say to the statistician what he need that statistician do, the statistician help to researcher to find the best way to resolve his research problem, the statistician must to be able to understand the research problem and the possible limitations for the analysis, and the researcher must to understand enough statistics to be able to interpretate the analysis.
Best regards
I am glad to be a part of this discussion. In my view the cut off for p value to say the results as statistically significant is a topic which has been much discussed. The conventional value of p less than 0.05 can always continue as it is. In such research studies where researcher has to be harsh with respect to allowing type 1error or false positive error like in case of assessing efficacy of a life saving drug one can bring down the p value to 0.05. It is more important to look at confidence intervals and related interpretation. Clinical significance in terms of Effect size and NNT or NNP or NNS or NNH probably are more helpful in taking decision whether to to import the findings in day to day practice or not.
With regards to ALL
A discussion was organised regarding the new p-value threshold for a statistical significance in which some of the co-authors tried to address/response several potential objections/concerns regarding the new definition of significance.
I think the discussion is interesting as several other lead experts (commenters) also expressed their view on the new p-value threshold. Here is the link:
http://philosophyofbrains.com/2017/10/02/should-we-redefine-statistical-significance-a-brains-blog-roundtable.aspx
Another part of the problem is that there are some people that gladly spend years carfully gathering data, will grudginly spend 20 minutes analyzing the data, and then carefully craft a manuscript over the next few weeks/months. The biology is what counts, the math is an annoyance.
It seems to me this discussion is, despite extremely interesting, ultimately pointless. We have been around this subject many times. It is not changing an arbitrary value for another that will constitute a game changer. P-values arise in the context of a statistical test. A statistical test is a tool. The fact that the tool is abused does not make it a bad tool. Tools, irrespective of how complex they might be, are only tools. You need a brain to interpret when and how the tools should be used, and perhaps even more importantly, when they are being abused.
I hope the discussion is not ultimately pointless. If the problems are not discussed, they get ignored.
Certainly, what's more important than the actual p-value used is the replicability of results across studies, i.e., the robustness of conclusions. No amount of statistical obfuscation changes such common-sense logic. I prefer to look at both statistical significance and effect sizes when undertaking biophysical analyses; the more tools in the toolbox the better! I certainly see no reason to eschew the p-value criterion of 0.05.
-Bob
Justify your alpha: In response to recommendations to redefine statistical significance to P ≤ 0.005, we propose that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.
https://www.nature.com/articles/s41562-018-0311-x
I usually avoid thresholding p-values unless I am doing a simulation study and are investigating type I and type II error, in the Neyman-Pearson fashion.
Pvalues are extremely tricky, but humans are even more, and people usually would run hundreds of models and only some would be reported. Confidence intervals suffer from the same drawbacks because they can be interpreted in the same way.
Fisher himself advocated the 1% as the more stringent alternative, so 0.5% is not that new or strange.
Decreasing the first type error probability makes the second type error prob increasing. So, it cannot be a standard recipe. it depends on which kind of error has worse consequences in each situation.
I agree w/ Anna G. Setting p-values is an optimization process to control both the type I & II errors. If one of these errors has more-severe consequences, that might be a justification for switching away from the 0.05 level.
-Bob
Dear Michael Ghebre,
I have recently made a publication on this important topic:
"https://www.researchgate.net/post/Effect_Size_Report_Have_you_been_thinking_about_how_youre_doing_not_merely_if_you_do"
As I said in other posts, there are many ways in which we can improve the way we report (and interpret) our results. The important thing is not to rely on simplistic modes (e.g., just show the p-value obtained). Complexing the report, for example by showing the confidence intervals and measures of the effect size, not merely give more information for an adequate reading of the results, but also this makes possible the execution of subsequent metaanalytic studies (in pursuit of a cumulative science). Undoubtedly, we must also move from a simplistic reading, based on arbitrary cut-off points, to a more critical and comparative reading of measures such as the effect-size.
Ultimately, we should be doing multivariable, rather than single-factor, analyses, to really understand biological relationships. There's no mention of that in this recent , provocative article:
Amrhein, V., S. Greenland, and B. McShane. 2019. Scientists rise up against statistical significance. Nature 567: 305-307 (https://www.nature.com/articles/d41586-019-00857-9).
-Bob
p-value threshold will depend on the type of experimentation.Suppose p-value (P < 0.05) may be acceptable for fertilizer trial but for drug trial on patient's with (P < 0.05) may not be acceptable.
@Anna:
Only if sample size remains fixed. In the planning stage of a trial, decreasing the type one error threshold with fixed type two error threshold would result in greater sample size. I think trials or empirical studies in general w/o sample size calculation beforehand can only be regarded as exploratory studies and therefore should not lead to any significance tests at all.
Interesting discussion. I need to follow this.
Thanks for the question!