I know that if we have small sample size and we wish to be less orthodox, we could report 10% statistical significance as a significant finding. Is there any literature I could refer to to defend my choice? It must be a peer reviewed paper or a book.
There is no authoritative reference for using 0.05 as significance level. Au contraire, there are references from Neyman as well as from Fisher that the level of significance has to be chosen based on the whole context (scientific, economic, aims, limitations ...). In the Neymanian philosophy a "conventional level" makes no sense because the fixed constraint is the cost/benefit ratio of the research, and there is anyway no possibility to sensibly decide on an acceptable level a posteriori. In the Fisherian philosophy there is no cost/benefit ratio, and the criteria to select a level are made not so explicit; the judgment is neccessarily a researcher's judgement, what always includes a major part of "personal opinion". Note that both philosophies use the "decision" for different purposes, in a different "frame of action". Neyman wants answers about hypotheses: what hypothesis can be accepted under optimum cost/benefit conditions. Fisher uses data to get a rough impression about the "significance" of a finding. If the data is too likely under the null hypothesis, it may not be worth/possible/feasable to further investigate it.
So you can cite both, Neyman and Fisher (depending on the tests you do: null-hypothesis tests or significance tests). Both argue that there can not be a "standard" or conventional level. Although there is the citation from Fisher
“If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05. . . .” (Fisher RA. Statistical methods for research workers. London: Oliver and Boyd; 1950. p. 80)
But this is one sentence, taken out of the context, besides many other statements of Fisher pointing to more flexibility. See here for a collection: http://www.jerrydallal.com/lhsp/p05.htm
i would almost always recommend to report the real p-value. why not report a p-value of - for example - 0.093? so every reader is able to see the significance or non-significance. also i recommend to calculate and report effect sizes. because you know, significance is not the full story (everything will be significant with a large enough sample size). and if you don't want to report the real p-values you may follow this convention: * p
Yes. There's the massive amount of literature on the problems and/or uselessness of significance testing (see e.g., attached link to "402 Citations Questioning the Indiscriminate Use ofNull Hypothesis Significance Tests in Observational Studies"). Fisher's approach has been thoroughly questioned in its entirety since it originated, and there remains little defense for experiments using the "hypothesis testing" paradigm. See attached papers for a small sample of critiques.
http://warnercnr.colostate.edu/~anderson/thompson1.html
Hi,
I agree with the comments stated before. But besides literature I would report not only the exact p values but also the effect size and even the statistical power.
For that, you can use the R package pwr or the freeware G*power.
Yo could also argue in your text that p-values are sensitive to sample size when the sample is not biased.
Hope this helps!
If I were you, I would present the work as a pilot study and put emphasis on effect sizes rather than p-values. Do a power analysis, maybe, to support your assertion that sample size is too small to detect an effect.
Keep in mind that results are unlikely to be generalizeable, anyway, so I suggest you to collect more data anyway.
(Andrew, thank you for the link. It's glorious!)
I think my problem is more simple. I am aware of the discussions about the flawness of the use of statistical significance but anyway, this is a convention and I am not interested in going into these debates in my paper. I only need to cite one or two credible sources which state that using statistical significance at the 10% can also be worth attention.
Of course, I am reporting all statistical information - coefficients, S.E., statistical power, sample size, etc.
Another note: the paper is in the phase "revise and resubmit" so it won't make sense to do anything more than what the reviewers ask me to do.
Plamen - this is what you probably were looking for...
http://stats.stackexchange.com/questions/55691/regarding-p-values-why-1-and-5-why-not-6-or-10
Subhash, I need something like this, but I could not cite a forum discussion in my paper. It must be a peer reviewed paper or a book.
An internet search may help Plamen. I know of one reference I read ten or so years back, but am unable to recall - age is at times a nuisance!
Well ... here is another one if it is of any help
http://www.nature.com/news/scientific-method-statistical-errors-1.14700
There is no authoritative reference for using 0.05 as significance level. Au contraire, there are references from Neyman as well as from Fisher that the level of significance has to be chosen based on the whole context (scientific, economic, aims, limitations ...). In the Neymanian philosophy a "conventional level" makes no sense because the fixed constraint is the cost/benefit ratio of the research, and there is anyway no possibility to sensibly decide on an acceptable level a posteriori. In the Fisherian philosophy there is no cost/benefit ratio, and the criteria to select a level are made not so explicit; the judgment is neccessarily a researcher's judgement, what always includes a major part of "personal opinion". Note that both philosophies use the "decision" for different purposes, in a different "frame of action". Neyman wants answers about hypotheses: what hypothesis can be accepted under optimum cost/benefit conditions. Fisher uses data to get a rough impression about the "significance" of a finding. If the data is too likely under the null hypothesis, it may not be worth/possible/feasable to further investigate it.
So you can cite both, Neyman and Fisher (depending on the tests you do: null-hypothesis tests or significance tests). Both argue that there can not be a "standard" or conventional level. Although there is the citation from Fisher
“If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05. . . .” (Fisher RA. Statistical methods for research workers. London: Oliver and Boyd; 1950. p. 80)
But this is one sentence, taken out of the context, besides many other statements of Fisher pointing to more flexibility. See here for a collection: http://www.jerrydallal.com/lhsp/p05.htm
I agree with Manfred. Report the p-value, effect size, confidence interval, r-squared or whatever makes sense... The alpha = 0.05 rule isn't magic. In a preliminary study, it may make sense to use alpha = 0.10. In a medical study, where people's lives are on the line, you may need a much smaller p-value... I would try to write the results as "In a preliminary study, we find that p < 0.10 is suggestive of a significant effect that warrants further study.", or something like that.
This isn't the strongest statement, but in this quote, Zar does mention that the alpha=0.05 rule is arbitrary. Zar is the bible for statistics in biology/agriculture. Who is the equivalent in social sciences?
'By experience, and by convention, an [alpha] of 0.05 is typically considered to be a "small enough chance" of committing a Type I error while not being so small as to result in "too large a chance"of a Type II error (sometimes considered to be around 20%). But the 0.05 level of significance is not sacrosanct. It is an arbitrary, albeit customary, threshold for concluding that there is significant evidence against a null hypothesis. And caution should be exercised in emphatically rejecting a null hypothesis if p = 0.049 and not rejecting if p = 0.051, for in such borderline cases further examination -and perhaps repetition-of the experiment would be recommended.'
Zar, Biostatistical Analysis, 5th., p.79
If you can get your hands on Agresti and Findlay, Statistical Methods for the Social Sciences, I would see what they have to say. From reviews, it looks like they endorse models and confidence intervals rather than sticking to the alpha=0.05 rule. But I don't have the book to see.
Obtaining a sample size that is appropriate in both regards is critical for many reasons. Most importantly, a large sample size is more representative of the population, limiting the influence of outliers or extreme observations. A sufficiently large sample size is also necessary to produce results among variables that are significantly different. However, for qualitative studies, where the goal is to “reduce the chances of discovery failure,” a large sample size broadens the range of possible data and forms a better picture for analysis.
On the other hand, the degree of variability of some experimental unit imposes a threshold of variation of the variables measured in an experiment. For example, if that threshold is high, then 10% of statistical significance is enough to justify the significant differences of an used treatment compared with the control, although the size of the sample is small. Now, it is important to repeat the experiment because systematizing experiments will allow us to know that this happening is certain or reinforcement under the conditions of the experiment.
Sincerely,
Angel
There is nothing magical about the famous 0.05/5% However I cannot stress how important it is, that one understands what a p-value of 0.05 means: One in twenty times, you will get a significant results by pure chance! I.e. The result does NOT harbour anything biologically meaningful, it simply occurs by 'luck'.
In other words a p-value of 0.10 means that if you are 'lucky' and 'hit' the one in ten times a significant result occurs by pure chance... This is what your result is... A stroke of 'luck'!
Dear Leon,
you are completely right! You know it goes the other way around though - in reality there might be a phenomenon which "by chance" does not appear very robust in the empirical analysis with the non-perfect data we always use.
Leon, except that you can say that about any results. You'll end up sounding like the tobacco companies in the 1950's. "Scientists have a mountain of evidence linking cancer to cigarette smoking? Ha! Just meaningless correlation!"... In this now very theoretical conversation, far from what the OP was originally looking for, I think it's important to keep a couple of things in mind: 1) the idea of science-based evidence; and 2) how evidence is evaluated by science.
1) We should be looking at experimental results in a science-based way, not just in an evidence-based way. Of course if you conduct a thousand correlations on random sets of data, you will get all kinds of spurious correlations. (Results of such an exercise can be seen here: http://www.tylervigen.com/ ). Likewise, if the sellers of woo-woo conduct a bunch of experiments on homeopathy, they will find some significant results. But as far as we understand the world, homeopathy actually working is pretty much impossible. We have "evidence" that homeopathy may be effective, but no one who understands either chemistry or medicine will be swayed by a few spurious positive results.
2) Science doesn't see one p-value and declare that the result is clearly true. Instead we conduct preliminary studies, based on our understanding of previous science. If things look promising, we might get funding to do a more in-depth study. If that's positive, some others might try the same or related experiments. We use a web of evidence and a web of understanding to evaluate what think is true about the world... True, our p-values might just be lucky, or our experiments might have been poorly done, or we might have missed the true cause of what we're seeing, but that's why we proceed this way.
I'm not writing all this to be pedantic. If we don't follow an initial "interesting" result (p-value = 0.10, r-square = not-great, etc.), we might miss something potentially valuable.
But also, there are different considerations in different disciplines. If we are trying to grow nice turfgrass, 90% certainty about some treatment might be acceptable. If we are considering scrapping chemotherapy for some other treatment for cancer, we will want to approach the evidence differently.
And there are different standards for different places you might publish.
Steve Cherry has some interesting thoughts on this issue in the attached paper.
http://warnercnr.colostate.edu/~anderson/PDF_files/Statistical.pdf
In my experience, a finding based on even 0.05 significance doesn't hold up very well. Add an explanatory variable, apply a log transform to one or more variables, change a criterion for inclusion/exclusion a bit, and the finding evaporates. So, perhaps you need to ask why you want to publish such a tenuous finding and carefully look at other options - getting more data, discussing the "findings" informally, using the result of an analogous precursor study to establish a prior and then applying a Bayesian approahc, etc.
I did this blog post related to p-values and 'luck': https://leonjessen.wordpress.com/2014/11/13/why-do-we-need-to-be-careful-when-performing-multiple-tests/
A generally accepted significance level is 0.05. To the library! The Web of Knowledge is a non-exhaustive electronic database of published articles. In 2014 it lists 4,131,272 articles published. Some of these will lack any statistical test, but by the same token some of these will have many statistical tests. If we assume that each article has exactly one statistical test, then a little over 200,000 articles report findings that are simply chance alone (Type I errors). Of course as Salvatore pointed out, some of this is mitigated by having multiple studies from multiple labs doing similar projects. If we move towards 0.1 as the critical value then we need to ask ourselves how many spurious conclusions are we happy reading?
One could argue that I have taken too many liberties in the above example. So in my research I have a system where I have calculated 84 variables. I have 4 treatments, so I will run 84 Tukey HSD tests to look for significant differences. With alpha=0.05, I find one variable that shows significant differences between treatments. With alpha=0.10 I find three. In either case would you be comfortable with an article that talked about significant treatment effects? I argue that this is a good case for saying that there are no treatment effects (I have 20 replicates/treatment -- not a large number, but it takes about a half day of work to get one replicate in one treatment.).
Whatever alpha you choose, I will be unimpressed by a citation. I want to know what alpha you chose, and why. Give me some indication that you understand the consequences of your choice. If there are other approaches to the analysis that could avoid the problem (say linear regression rather than multiple comparison procedures), give some indication of why they were not used. What does one gain if one goes from 0.05 to 0.10 to 0.15? Why 0.1? Is this substituting one magic value for another? I agree with others that one should report p-values, but there is still a determination that "I will discuss these differences as important," and this usually involves some significance test.
Timothy, your last paragraph is great! Well said.
But I also want to put my finger on your statement that many articles "report findings that are simply chance alone (Type I errors).":
First I think that it is a common misconception to say that a finding "is (due to) chance (alone)". Chance is a measure of our expectation of something, and no supernatural property of a result or data etc. The data is the data is the data. We may explore/calculate how likely we will expect such data given some hypotheses about the data generating process or the distribution of the data. This expectation is called "chance" and it is quantitaively measured as probability. The p-value is the probability to get a "more extreme test statistc" assuming a particular probability model about the data and the null-hypothesis about the test statistic. The words "chance" and "probability" are almost synonyms, except that "probability" adds a formal measure, whereas "chance" does not.
Second I am not sure if one can talk about "type-I errors" in this setting. Most research articles have not specified any beta and they do not seem to follow the decision-theoretic approach of Neyman. But then there are no sensible error-rates defined. What is rejected is just the "point-null hypothesis", i.e. the observed data had a too low chance to occur given the effect was exactly zero. But this is generally not a sensible (research) hypothesis. If something is done, if something is systematically different, if there is some intervention, why should we expect an exact zero effect at all? In this formal "kind-of-hypothesis-test" setting I would say that the type-I error rate is close to 0, no matter what the authors publish and at what level they test. The more important question is if the size is relevant and, at least, if the direction of the effect is correct. But this is not answered by claiming a "significant result".
I would be greatful to read other opinions and objections about this.
Jochen,
I agree with the first full paragraph. I am not sure about the second. I'll walk my way through, and try not to jump to the end before I have finished with the beginning.
"no sensible error rates define" I agree, in that the error rate used is just the standard 0.05, and this is seldom justified. It is simply accepted. So this defines, by default, the Type I error rate: If the null hypothesis is true and I get an observed value that is only seen once in 20 tries, then I will reject the null hypothesis just because I want to. The probability of identifying a false positive is 0.05. I choose not to argue if this is sensible, as I have other battles where I think I have a greater chance of winning. The other problem is that I know how to solve this problem, but I never have enough data and I have no expectation of ever being allowed to gather enough data. This is the "I have an experiment where I test a question" followed by "I repeated the entire experiment 100 times to evaluate the performance of the experimental design." However, this is a great exercise in data analysis using computer generated data.
My understanding is that the "straw man" null hypothesis is one of the main criticisms put forth by proponents of Bayesian methods. I see the value in such an approach, but I have yet to be convinced that a straw prior is a great improvement over a straw null. That said, I can also see research questions where prior data can be used to change our expectation of new results and a very good prior can be formulated.
"I would say that the type-I error rate is close to 0" is based on the idea that most data analysis is using tests where the null hypothesis is that the difference is exactly zero in an experimental design implemented for creating differences. The problem is that in my research I think that "zero" is a statement of probability. It is better to say that the null hypothesis is that the difference is indistinguishable from 0 given background variability, rather than it equals zero. Given that uncertainty, I can make a type I error. One can further point out that any p-value that we calculate is also an estimate. So if an experiment is done 10 times I would expect that I could get a p-value of 0.0069 the first time, 0.0073 the next time, and so forth. We never put 95% confidence intervals about our estimated p-values.
All that said, I will agree that it is unreasonable to expect that just because I choose to reject the null if I get a p-value less than 0.05, that I will get an actual type I error rate of exactly 0.05. My intuition says that it will be closer to 0.05 than to zero, but I have no proof that this is true.
Timothy, thank you for contributing.
I think you got my 2nd paragraph wrong, at least partly wrong. The key point is put by you: "If the null hypothesis is true [...] The probability of identifying a false positive is 0.05." - exactly this is it. The probability in this framework is defined as the limiting frequency of falsely rejecting H0, and this makes sense only if the tests are repeatedly done on true H0s. If some H0 are actually true, the actual rate of false rejections will neccesarily be lower. That's fine, alpha gives just the upper boundary of the rate. But when almost all H0s are false on a priori grounds then I do no see any point in a formal test. You can not falsely reject any H0. And a non-signifcant result is then almost certainly a false negative result.
This is interesting: "It is better to say that the null hypothesis is that the difference is indistinguishable from 0 given background variability" but this is not what hypothesis tests do. A correctly performed hypothesis test neccesarily requires to state a "minimum relevant effect" and a "minimum accepted power" (or type-II error rate, beta). Then everything starts making some sense. But without having specified the relevant effect and without having planned the study to obtain just the desired power, the absolute value of p (and that of alpha!) are not well defined. If we did the experiment as good as we coud and we get a "large" p-value, this can only be ued as hint to better stop further investigations on this topic - at least with the current experimental design. A "low" p-value indicates that either further research can be useful or it is worth interpreting the effect size in the scientific context. But the p value can not be used to control any error-rate an any sensible way. I am NOT discussing if 5% is sensible (you may substitute it by any other value). It is the *rate* itself that cannot be defined, and even if - controlling such rates won't make any sense in the this context.
Regarding your last paragraph: The type-I error rate is the (limiting) rate of false rejections (what is possible ONLY when H0 is true). So when H0 is true in ALL the tests, this rate approaches alpha (0.05 or whatever value). The lower the proportion of tests on true H0s, the lower will be the actual type-I error rate. The chosen alpha (e.g. 0.05) still and always gives the upper limit of the (accepted, long-run) rate of type-I errors. So alpha may be always and constantly 0.05, but the actual type-I error rate depends on the proportion of true H0 among the tests that are actually performed. Simple extreme case: if H0 is always always false, it is impossible to do any false rejection (any rejection is correct because H0 is always false). This demonstrates that the actual rate is zero, independent of alpha.
Thus said, you would be a bad researcher if your actual type-I error rate is close to 0.05, because this would mean that almost all of your H0s are true...
This is the researcher's presective. It gets worst when we look from the literature's perspective. Here we no not see what all was tested and how often it was tested. Here we only see more or less "selected results", most of them being "significant". If we look at these selected results only, then, among those, the rate of false positives can easily and considerably exceed alpha. If we imagine that all reasearches work always only on true H0s, and onyl the "significant" results get published, then all of the published results are false-positives (the actual type-I error rate within published results is close to 100%) ...
three type of level of significance 1%,5% and 10%. level of significance depend upon nature of study. i am student of social science study. for social sciences always taken level of significance 0.005 mean 5%.
for 1% level can i accpet 0.13 p-value or it is out of the range
No, it's out of range. For a 1% level you would reject H0 only when p
With small sample sizes, the precision of your study is low. A p-value combines two important measures – the effect size and the precision with which it has been measured – in a way that loses our ability to think about them separately. A p-value of 0.05, corresponding to t value of 2 (roughly) can be from a tiny effect measured with high precision (large sample, low variation between observations) or can be a large effect measured from a low precision study (small sample, lots of variation between observations). You cannot tell.
I would forget p-values and hairsplitting over significance, and be honest: report the effect sizes and their confidence intervals. That way you are reporting what you found out and the margin of uncertainty around your findings.
But what is the effect size?
The usual maximum likelihood estimate (MLE) is only a "large-sample estimate". For small samples, the MLE can be far off any reasonable value (and that applies to a CI around the MLE as well). If there is so little information from the data, the context (not to say the prior knowledge) becomes relevant. So just for small samples, relying on the MLE and the CI may not be satisfactory. It is just the core of the testing philosophy that the MLE (nor the CI) is *not* (neccesarily) a useful indication of the effect size, and a conclusion is based on properties of the procedure rather than on the estimate (the poor-man's solution, if one is unable to put it into a proper context, that is, to specify a prior).
I think, although it is often okayish, we don't do us a favour when we unreflectedly interpret MLEs as posterior modes (or means) and CI as HPDIs (or credible intervals).
Statistical inference is based on the idea that it is possible to generalize results from a sample to the population. How can we assure that relations observed in a sample are not simply due to chance?. Significance tests are designed to offer an objective measure to inform decisions about the validity of the generalization. However, the amount of information varies wildly from problem to problem. Still the most traditional statistics is anchored in fixed significance levels, and constant tables of evidence to judge p-values. According to Pérez and Pericchi (2014), there is no clear justification to the use of fixed significance, except by tradition. On the contrary, there is a vast literature implicitly critical of it, most of it publish outside statistical journals. They recommended put forward an adaptive alpha which changes with the amount of sample information. This calibration may be interpreted as a Bayes/non-Bayes compromise, and leads to statistical consistency. The calibration can also be used to produce confidence intervals whose size take in consideration the amount of observed information.
I have enclosed the reference about it.
Pérez, María-Eglée and Pericchi, Luis Raúl (2014). Changing Statistical Significance with the Amount of Information: The Adaptive α Significance Level.
Stat Probab Lett. 85: 20–24.
and what are the best tests i can use to test GOF with Tobit
i wan to define the best model using GOF tests such as R2 Chi-square, p-value
Colleagues and I have a previously published study using a p value of 0.10 attempting to predict success in a high stakes certifying exam. We developed a predictive formula for people who were at risk of failing, and we wanted to include those who had a possibility of failing, assuming additional instruction to those predicted to pass (at the low end of the exam score) would not harm those students, while failing to add instruction for those at moderate risk might result in their failure. As Dr. Wilhelm notes, above, there is not source mandating p of 0.05 or less. I would, however, include your reason for a p value of 0.10 in your methods section. See Haas, Rule and Nugent, October 2004 Journal of Nursing Education 43(10):440-446
The choice of any level of statistical significance is arbitrary as it is not determined through optimizing an objective function.
Convention suggests the use of 1%, 5% and (less frequently) 10% levels.
Please stop promoting wrong layman's explanations.
In reality, the effect will be either positive or negative. It may be extremely close to zero, but looking precisely enough, it will still be either on the positive or on the negative side. When we reject H0: "the effect is 0", we claim that we do have enough data to "see" if it is positive or negative. If we don't reject, we say that our data is not conclusive (could be positive or negative, we don't dare to say). Thus, only when we reject we can make a claim that may be wrong.
Now consider the "true" effect is so tiny (and/or the noise is so huge, the sample size is so tiny) that we are practically completely inable to "see" it. Among tests under this scenario, we would reject about 5% of the H0s, claiming that we would see the direction (positive or negative). In fact, we must expect that then 50% of the rejected H0s will indicate a "positive" effect and 50% a "negatve" effect". Thus, we must expect that we are right 50% of the time, not 5%. So when there is no power, the test is not better than mere guessing.
The highest possible risk of being wrong is 50%, and that is independent of the level of significance at which you reject H0! The level of significance only controls the (highest possible) risk of ending up with "no conclusion" (= 1- the level of significance).
In the other extreme, if the power is close to 100% (the effect is large, the noise tiny, the sample size large), you will reject H0 with certainty, in any test. I am not absolutely sure about that but I would think that the probability of a wrong conclusion in this case is almost zero. So you will actually never make a wrong decision.
Thus, taken the two extreme scenarios together, your risk that a conclusion is wrong is somewhere between 50% (guessing) and 0% (perfect data). And interestingly, these limits do not depend on the significance level, but on the power (which is usually not known to us)!
Can someone please share me any social science research paper with statistical significance at 10%, please? i needed for my research. Preferably agricultural economics or agricultural marketing paper. Thank you.
The only condition to use "non-standard" significance level is to adopt this value before the start of experiments and to report it prospectively (e.g. in a data analysis plan).
This may be the case in preliminary studies that aim at "suggesting" the effect. The possibility of false-positive results should be taken into consideration.
An alternative to choosing an arbitrary level of significance is to simply provide prob-values and let the reader decide.
That's not that simple. You (as the scientist/author) have to decide whether you trust your data enough to believe that it shows the correct sign (or direction). How else would you be able to build your story? It's ok to provide the p-values to the reader that she/he can decide if it is sufficiently convincing to her/him as well.
I would recommend you to read for its elementary solutions: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf
Article Choosing the Level of Significance: A Decision‐theoretic Approach
The above articles is accepted for peer-reviewed journal Abacus.
A previous version of the above paper is here:
https://mpra.ub.uni-muenchen.de/66373/1/MPRA_paper_66373.pdf
I also attach my working paper under review for a teaching journal, which uses my R package OptSig
https://cran.r-project.org/web/packages/OptSig/index.html
In my opinion, I think if you plan the experimentl with adequate N, although a small N, then minor changes in the data set will not change our conclusions. Also, It depends on the nature of the material under study, that is, whether it is very quantitatively variable or not.
No correct p-value. Read Jochen's discussion of the topic. The more appropriate approach is to work at estimating the magnitude of effects and quantify uncertainty bounds as confidence intervals, or Bayesian credible limits etc. The degree to which results are scientifically meaningful, and the uncertainty in the range of conclusions that can be drawn is much more important than the p-value selected as a binary determinant of meaningful or not meaningful.
I read that if you get a non-statistically significant result but your F-statistic is 4 or more, it is worth collecting more data to see if you get a statistically significant result.
Karla Kassey , I think this is a bit a misconception. A non-statistically significant result tells you that your data is insufficient to conclude that the larger model explains the data better than the smaller (restricted) model (this is what the F-statistic represents: the increase in the residual sum of squares caused by the restriction of a subset of the model coefficients).
It tells you that your data is inclonclusive (according to the standards you have set). It is trivial that more data gives more information (about all the coefficients in the model), and there will be some amount of data that will surely be sufficient to be conclusive regarding F. This has nothing to do with the actual value of F you observed in your sample.
And if you decide to "get more data", it means to get more independent data. This means that you must not use the old data again (by just adding some more data). You need an entirely new set of data and analyze only this. And still you will have a problem with interpreting the statsitical significance. Simply calculating a p-value from the new data does not account for the fact that the procedure that eventually lead to this p-value included the possibility of a non-statistically significant result in a first round (using a smaller sample). You would need to correct the p-value for multiple testing, but since you have two samples and two tests, you might possible also combine the two p-values (https://en.wikipedia.org/wiki/Fisher%27s_method). I don't know how the correction for multiple testing is done here, as these are sequential tests (the second test is performed only if the first was not significant at some level*). I would be grateful if someone could contribute and explain how this would have to be done. There seem to be some resources, but I don't know how they can be transferred to this problem (https://en.wikipedia.org/wiki/Sequential_analysis).
---
* e.g. you may decide to go for a second round only of p in the first round is between 0.05 and 0.2, and not always whenever p > 0.05.
Karla Kassey , I'd add to Jochen Wilhelm 's answer that F is a measure of surprise. It measures the reduction in prediction errors when you compare the fitted model with a null model, which is usually a model where the mean is used to predict all data.
The trouble is that F is not interpretable without taking into account the complexity of the fitted model. Each term added to a model will "soak up" a little of the variation around the mean, so the model errors will tend to decrease. So F has to be interpreted in the light of the model complexity. An F of 4 has no inherent meaning without this information.
A non-significant F ratio says that the reduction in error when comparing the fitted with the null model is no greater than you would expect by chance, given the number of extra parameters that the fitted model has used.
I'd get away from the binary significant or not thought process. Identify the effects you are interested in and estimate them with error bounds and interpret these estimates. Try to build models with a priori defined functional forms and compare differences in practical terms meaningful to the science rather than different , or not different. And thenexpect to get beat up by peer reviewers for junking hypothesis testing.
Karla Kassey , it would be interesting to read the document you mentioned. Can you share this reference with us?
Echoing Jorge Ortiz Pinilla , it would be worth providing the quote you cite Karla Kassey . If you have collected data, calculated F and p, then your p-value is gone, so if you collect new data these should be analyzed as a separate study. This is one of the points Bayesians often raised about the frequentist stopping rules.
Jae-Hoon Kim I read this article before checking this thread. It was a good read and it helped me a lot!
This article is also useful in explaining the choice of level of significance
Article Setting an Optimal α That Minimizes Errors in Null Hypothesi...
Statistical significance is eagerly defined by the researchers depends on the assumptions framed during the study intervention, significance level will be differed research existence and condition of the experiment or research conducted . If the experiment conducted at field level we reject the Ho @5 % level .
The level of significance is arbitrary, so 1%, 5% and, to a lesser extent, 10% seem to be standard.
Stop setting levels of significance and estimate parameters with uncertainty bounds and interpret the information.
Only pre-registered data analysis plan with 0.1 significance level upon rationale provided may justify the use such a non-standard value. Anyway, please refer more to effect size rather than a p-value.
I agree with John Kern. You have to draw your own conclusions. Generally, a 10% confidence level is not considered statistically significant, but in some situations you can argue it may be significant (because of the small sample size, because of the low representativeness of the sample).