A million dollar research grant was issued to reject null hypothesis X. Unlucky researcher A could not find statistical evidence to reject X. With his test, he found a non-significant p value of 0.1. You still believe in the alternative hypothesis and replicate Researcher A's study. You, too, find a p value of 0.1. What is your conclusion? How does this finding influence your beliefs about the null and alternative hypothesis?
As I am not a statistician I leave the interpretation of the question from this point of view to the professionals (there are already great answers). Yet, I got to point out a few general issues:
First: "the million dollar research grant" certainly was not issued "to reject null hypothesis X" (well, maybe it indeed was after all, as embarrassing as that might be...), but to find and investigate evidence, whether the hypothesis could be considered false. Second: Always remember that an isolated p-value doesn't tell anything about the practical importance of an effect, as p depends on the sample size. And because p isn't what we are actually interessted in (namely P(H_0 | x) != P(x | H_0) = p).
Whether the repeated finding of "an insignificant p value" would change anything about my beliefs about anything would depend mostly on the quality of the study design and the reported (raw) data (if it actually *is* reported...), not on the outcome of some null-hypothesis-signficance-test nonsense. Why is a p-value of < 0.05 significant, why p >= 0.05 not? Right: It's an *arbitrary* decision. And thus are the binary reject-accept conclusions drawn from such tests: arbitrary.
The idea behind Fisher's original concept of a p-value wasn't to "reject" or "accept" a hypothesis, but he thought of it as a “… rough numerical guide of the strength of evidence against the null hypothesis.” (R. A. Fisher) In Fisher's framework there was no concept of an (disjunct) alternative hypothesis or power/error rates. In Fisher's thinking, a small enough p could justify repeating an experiment, but wasn't evidence enough to reject or accept anything.
Neyman-Pearson on the other hand where interested in minimizing the long-term error rate of *repeated* decisions. Think quality-control and the the like. This kind of thinking is - by design - fundamentally *not* applicable to single studies, but only to a series of repetitions: “... no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for
rules to govern our behavior with regard to them, in following which we insure that, in the long run of
experience, we shall not often be wrong.” (J. Neyman and E. Pearson, 1933)
Mixing both concepts together leaves us with the useless, yet seemingly objective null-hypothesis-signficance testing as it is forced onto scientists these days... The only sensible advice to such a question I can thus provide is: Don't be a slave to the p value or any mechanically applied statistical testing procedure!
“But given the problems of statistical induction, we must finally rely, as have the older sciences, on replication.”
— Cohen, 1994
“If Fisher and Neyman–Pearson agreed on anything, it was that statistics should never be used mechanically.”
— Gigerenzer, 2004
Further reading:
- Belief in the law of small numbers (Tversky & Kahnemann, 1971)
- Statistical inference: A commentary for the social and behavioural sciences (Oakes, 1986)
- Things I have learned (so far). (Cohen, 1990)
- The Philosophy of Multiple Comparisons (Tukey, 1991)
- p Values, Hypothesis Tests, and Likelihood (Goodman, 1993)
- The earth is round (p < .05). (Cohen, 1994)
- P Values: What They Are and What They Are Not (Schervish, 1996)
- Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy (Goodman, 1999)
- From Statistics to Statistical Science (Nelder, 1999)
- Calibration of p Values for Testing Precise Null Hypotheses (Sellke et al., 2001)
- Misinterpretations of Significance (Haller & Krauss, 2002)
- It's the effect size, stupid (Coe, 2002)
- Mindless statistics (Gigerenzer, 2004)
- The Null Ritual (Gigerenzer, Krauss and Vitouch, 2004)
- Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing (Hubbard & Lindsay, 2008)
- http://www.johndcook.com/blog/2008/02/07/most-published-research-results-are-false/
- http://www.stat.duke.edu/~berger/p-values.html
A question seems essential for me here: How was the sample size calculated? If it was calculated to be able to detect a difference of delta in both studies (or at least in the second one), so the null hypothesis can not be rejected. Of course with a sample size large enough, you will be able to detect as significant a very little difference, but this tiny difference is not clinically important. "Significant is not important". Hope this helps...
If I a priori believe the null, as you said, why should I have done an experiment?
If I have at least some doubts about the null a priori, I could calculate an updated judgement based on the data, given I had some specific idea about the alternative, too (what I dont have anyway).
A judgment about hypotheses based only on p-values is never possible. It always ends in circulary arguments. Anything else would be a "free lunch", and there is nothing like a free lunch in nature.
What is possible is a statement about the beliefs about the data, given the null hypothesis. There is a possibility to combine the two p-values. Following Fisher's method, the combined p-value is 0.056, so I'd expect a fraction of 0.056 of dublicate-tests with two p-values < 0.1 given the null was true. Unfortunately, as any p-value, it does not tell me anything about what I should believe about the null hypothesis. To make this clear: if the null was that a person has no psychic abilities, I would still believe this null; if the null was that drinking alcohole is not correlated with cigarette smoking, I would strongly disbelieve that null.
In fact, I had a typo before. You are supposed to not believe in the null hypothesis and try to replicate the study to reject it.
Audrey, let's assume the supervisors of both researchers A and yourself are ignorant of this issue and will only prolong your contract if you find evidence to reject X.
Andreas, even then my main counterquestion remains the same: If I a priori do not believe the null, as you said, why should I have done an experiment?
Right. However, in our scientific routine, the above situation is one that we face, isn't it? You are supposed to run a study to empirically show what you belief. And quantification of the results is expected to be in stats speak... or am I missing your point?
What I intended to find out is how people interpret the aggregation of multiple null results.
To make it short: The best way to combine the two p values is by Fisher's method, what gives 0.056. Thus, this would not be sufficient to reject the null at a 5% level of significance. Stop. Fin. There is nothing more to say or to interpret or to conclude.
The whole concept of "believe" makes no sense and is incompatible with hypothesis testing. As I said, if you want to discuss beliefs, you have to have a defined prior believe and you have to have a specific alternative. Under these circumstances you could use the data from the two experiments to update your prior beliefs to posterior beliefs.
Patrice, I just read your post. Your last sentence is critically wrong: "At that point [p
Andreas, if one day, my contract would depend on my results, I will quit my job.
Then, Audrey, you are in a very priviledged position in this respect.... congrats!
My conclusion is that while the "assume the null hypothesis unless alpha level x is met" can be useful, the number of ways in which it can fail tend to exceed those in which it is meaningful, at least in practice:
"criticisms of statistical significance tests are almost as old as the methods themselves (e.g., Boring, 1919; Berkson, 1938). These criticisms have been voiced in disciplines as diverse as psychology, education, wildlife science, and economics, and the frequency with which such criticisms are published is increasing (Anderson et al., 2000)."
(http://laits.utexas.edu/cormack/384m/homework/Journal%20of%20Socio-Economics%202004%20Thompson.pdf)
First, the most frequently used statistics are those that rely on the mean as a measure of central tendency and derive or extrapolate variance based on deviations from a population or sample mean. However, while t-tests and the like are robust against type two errors, arbitrary departures from normality and/or heteroscedasticity can easily in a significant result that such tests do not miss. Second, the entire notion of hypothesis testing in this way is in general inadequate. We construct theories out of research questions/hypotheses not just to test them but to create predictive models. Predictive power is a far better indicator of accuracy than an arbitrarily chosen alpha level which can be reached or not reached by a range of qualitatively different ways. Third, superior methods have been around for almost a century, but thanks to the decade or two between such methods and the necessary computers possessing the requisite computing capacity that would eventually make such methods practical rather than essentially impossible, they were and are widely ignored. Finally, one of the most difficult challenges in research is determining whether a result indicates that one is wrong (and usually by extension that one's theoretical framework is flawed), or that ones methods are inadequate EVEN IF they are standardly used.
Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing
(http://wiki.bio.dtu.dk/~agpe/papers/pval_notuseful.pdf)
The ongoing tyranny of statistical significance testing in biomedical research
(http://peer.ccsd.cnrs.fr/docs/00/58/01/11/PDF/PEER_stage2_10.1007%252Fs10654-010-9440-x.pdf)
Ziliak, S. T., & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. University of Michigan Press.
How many discoveries have been lost by ignoring modern statistical methods?
(http://isites.harvard.edu/fs/docs/icb.topic477909.files/Wilcox_1998.pdf)
Taagepera, R. (2008). Making Social Sciences More Scientific: The Need for Predictive Models: The Need for Predictive Models. Oxford University Press.
I think Audrey's first response to this question (re: Sample size) is very key to this discussion. Given P = 0.1 (which is obviously greater than 0.05, which is important assuming alpha was set to 0.05), there may be evidence of a weak treatment/exposure effect.
In addition to hypothesis testing, an effect size (e.g. eta, see http://en.wikipedia.org/wiki/Effect_size) can also be estimated. If the effect size is large and p > alpha, then the original study likely did not have suffcient power to detect a significant difference. If the effect size was small, and p > alpha, then it is likely that the effect of treatment/exposure was not substantial.
Therefore, if the first study had sufficient power/sample size to detect a *biologically important* difference, then I would think that it would be reasonable to repeat the experiment, I don't think it would be necessary to repeat the experiment. If the orignal study did not have sufficient power to detect a biologically relevant difference between groups, then I would think that it would be worthwhile to perform a similar study that WAS powerful enough to detect the difference.
However, repeating the exact same study reminds me of the old saying: "Doing the same thing over and over again and expecting different results is the definition of insanity."
Unfortunately there are lots of variables in every function that they are out of our imagination.It will change the meaning of replication.
You could consider combining the P-values using one of various methods that are available. A nice review is given in German /which will presumably be no problem for either Andreas or Jochen ) by Sonneman (1) and it is also covered in my Encyclopedia article (2). However, some purists might still argue that the first test should be regarded as hypothesis generating and hence discounted.
References
1. Sonneman, E. (1991). Kombination unabhängiger Tests. Biometrie in der chemisch-pharmazeutischen Industrie 4. J. Vollmar. Stuttgart, Fischer Verlag. It is also covered in my Encyclopedia article for Dekker
2. Senn, S. J. (2003). P-Values. Encylopedia of Biopharmaceutical Statistics. S. C. Chow, Marcel Dekker: 685-695.
May be in an experimental work that you think it is a replication, a difference in answer cause to find a manipulating var.
The multiple replication of a null result across a variety of different tasks and experiments becomes convincing evidence for the absence of single-stream task sequence learning.
Source: Journal of Experimental Psychology: Learning, Memory, and Cognition, 2010, Vol. 36, No. 6, 1492–1509
Patrice said "I meant to present the case of continually finding p≤0.1. Then (assuming a critical value of p≤0.05), one cannot reject the null. One concludes that the data does not support 'intuition'." But of course if you ran the test a third time and still found a p-value of 0.1, then the overall p-value would be 0.032, which is statistically significant at the 0.05 level. With six such tests, the overall p-value would be 0.0063. See Lou Jost's page http://www.loujost.com/Statistics%20and%20Physics/Significance%20Levels/CombiningPValues.htm on combining p-values.
Personally, I believe that we meet a big question to be solved. My opinion, however, independently of this specific problem, is that when the reproducibility of a research gives the same result, even if null, is to be emphaized since that is the confirmation to do not continue to follow the engaged position.
I'm very much with Patrice Corneli's response ... And that's why I think the pure replication of A's research would never improve knowledge, maybe only my certainty in how likely it would be wrong to reject the hypothesis. If the power of A's study was low, I should at least increase the sample size, but if that was not the case I would have to improve the model/hypothesis, e.g. in terms of covariates or reduction of measurement errors.
If I repeat exactly the same study several times, and I get always a p-value of 0.1, I would not conclude from that, that it gets more likely that the hypothesis has to be rejected, but I would conclude that I can be quite sure that I can't reject it, at least if I want more than 10% certainty that I'm doing things right. Regarded like this, the repetition of the study changes the certainty about the p-value itself, which is then treated as a random variable ...
The distribution of P-values under the null hypothesis is uniform. That is to say, every value between 0 and 1 is equally likely. Thus if you regularly get low values (say not individually significant but generally low) this indicates the null hypothesis is not true. If the experiments were of a similar size then this would imply that there was an effect but that the power of the studies was moderate for the effect present.
However, even if the expected P-value were 0.1 individual P-values would vary considerably.
This by the by is my explanation for the typical finding that only half of the phase III trials in depression are significant. The treatments work but not very well.
Good answers about statistical significance above, but to me the inability to achieve significance in any one experimental run suggests the effect is weak and one would have to question whether the magnitude of the almost-effect is practically (not statistically) significant.
Of course not being able to reject the null hypothesis is not the same as saying there is no measureable effect. For every correct way to carry out an experiment there are thousands or millions of incorrect ways, which is why proving a negative is difficult.
One would have to reassess whether the tests were appropriately powered. If the power was estimated in an appropriate way then one has to question the value of beating a dead horse. If the power calculations had been inappropriate, possibly based on wild guesses before you had better info, then such a tantalising replication of an almost significant rest would need a further iteration that is powered based on the new details.
I would take it that getting significance from combining multiple non-significant results only allows you to decide to further test a hypothesis using better approaches and methodology, not to formulate a theory or implement a procedure.
There is insufficient information - for instance if the two observed effects turned out to be in different directions that probably adds weight to the plausibility of a null or near-null effect. If they were in the same direction that adds weight to to the plausibility of a real effect.
In this situation I would combine the effect sizes via a mini meta-analysis to refine my views. However, as Stephen Seen noted, it is very likely that the studies are underpowered or that the practical effect size was small (if the study was powered to detect effects of practical or theoretical importance).
As I am not a statistician I leave the interpretation of the question from this point of view to the professionals (there are already great answers). Yet, I got to point out a few general issues:
First: "the million dollar research grant" certainly was not issued "to reject null hypothesis X" (well, maybe it indeed was after all, as embarrassing as that might be...), but to find and investigate evidence, whether the hypothesis could be considered false. Second: Always remember that an isolated p-value doesn't tell anything about the practical importance of an effect, as p depends on the sample size. And because p isn't what we are actually interessted in (namely P(H_0 | x) != P(x | H_0) = p).
Whether the repeated finding of "an insignificant p value" would change anything about my beliefs about anything would depend mostly on the quality of the study design and the reported (raw) data (if it actually *is* reported...), not on the outcome of some null-hypothesis-signficance-test nonsense. Why is a p-value of < 0.05 significant, why p >= 0.05 not? Right: It's an *arbitrary* decision. And thus are the binary reject-accept conclusions drawn from such tests: arbitrary.
The idea behind Fisher's original concept of a p-value wasn't to "reject" or "accept" a hypothesis, but he thought of it as a “… rough numerical guide of the strength of evidence against the null hypothesis.” (R. A. Fisher) In Fisher's framework there was no concept of an (disjunct) alternative hypothesis or power/error rates. In Fisher's thinking, a small enough p could justify repeating an experiment, but wasn't evidence enough to reject or accept anything.
Neyman-Pearson on the other hand where interested in minimizing the long-term error rate of *repeated* decisions. Think quality-control and the the like. This kind of thinking is - by design - fundamentally *not* applicable to single studies, but only to a series of repetitions: “... no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for
rules to govern our behavior with regard to them, in following which we insure that, in the long run of
experience, we shall not often be wrong.” (J. Neyman and E. Pearson, 1933)
Mixing both concepts together leaves us with the useless, yet seemingly objective null-hypothesis-signficance testing as it is forced onto scientists these days... The only sensible advice to such a question I can thus provide is: Don't be a slave to the p value or any mechanically applied statistical testing procedure!
“But given the problems of statistical induction, we must finally rely, as have the older sciences, on replication.”
— Cohen, 1994
“If Fisher and Neyman–Pearson agreed on anything, it was that statistics should never be used mechanically.”
— Gigerenzer, 2004
Further reading:
- Belief in the law of small numbers (Tversky & Kahnemann, 1971)
- Statistical inference: A commentary for the social and behavioural sciences (Oakes, 1986)
- Things I have learned (so far). (Cohen, 1990)
- The Philosophy of Multiple Comparisons (Tukey, 1991)
- p Values, Hypothesis Tests, and Likelihood (Goodman, 1993)
- The earth is round (p < .05). (Cohen, 1994)
- P Values: What They Are and What They Are Not (Schervish, 1996)
- Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy (Goodman, 1999)
- From Statistics to Statistical Science (Nelder, 1999)
- Calibration of p Values for Testing Precise Null Hypotheses (Sellke et al., 2001)
- Misinterpretations of Significance (Haller & Krauss, 2002)
- It's the effect size, stupid (Coe, 2002)
- Mindless statistics (Gigerenzer, 2004)
- The Null Ritual (Gigerenzer, Krauss and Vitouch, 2004)
- Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing (Hubbard & Lindsay, 2008)
- http://www.johndcook.com/blog/2008/02/07/most-published-research-results-are-false/
- http://www.stat.duke.edu/~berger/p-values.html
If the second researcher also found p=0.10, then the probability of a mistake is very small, so I have to accept my fault impression and not to insist, just for $1million. Otherwise, I am not a scientist but I am a fraud businessman.
There are many things which can be said after you question Andreas.
First, we should know that all final conclusion has to be derived from replications and not from a single study. The result coming from a single study may be obtained from a particular sample. That's why meta-analyses are more powerful. I remind the fact that not to reject the null hypothesis does not mean that the null hypothesis is true. It means that from your data, it is not possible to demonstrate that the null hypothesis is false; but if several studies do not reject the null hypothesis, it is allowed us to believe that the null hypothesis is true (but you will never know). I also remind it is never possible to show that a hypothesis is true in experimental science.
Second, many people should read the numerous papers which exist in the statistical literature about statistical tests and the use of the p-value (for instance, see “Jonathan A C Sterne, George Davey Smith, Sifting the evidence—what's wrong with significance tests? BMJ 2001” or “John P. A. Ioannidis, Why Most Published Research Findings Are False, PLoS Medicine, 2005”):
- We should know that statistical tests have never been designed to be “significant” or “non-significant”. Personally, I do not use the adjective "significant" anymore. In statistical (frequentist) testing there are two approaches: Fisher and Neyman-Pearson. In the approach of Fisher there is only one type of risk (commonly symbolized by alpha) and thus only one hypothesis (the null one). Fisher has never said that the test is significant if the p-value < 0.05 or even < 0.01 like is almost said everywhere in the scientific literature. A p-value is a probability informing on the data given the null hypothesis: Pr(data|H0). The p-value is the strength evidence against the null hypothesis. Fisher has indicated that around 0.05, nothing can be said against the null hypothesis and the experiment should be replicated. We should consider a rejection of the null hypothesis with p-values around 0.001. But I say again, there is no real threshold. In the approach of Neyman-Pearson, a threshold is first set (before carrying out the study) for the rejection of the null hypothesis AND an alternative hypothesis has to be decided (a second risk, beta, is thus introduced). In this alternative hypothesis we have to set the value of the parameter we test i.e. an odd ratio or a regression coefficient. Frankly, are we able to set this alternative value? No, we almost always have no idea about this value. However, in practice, a bad approach is currently used in the scientific community: only the null hypothesis is used (we thus lie in the Fisher approach) with a threshold for the p-value – very often 0.05 (we lie in the Neyman-Pearson approach). So we mix the two approaches but with keeping only what it is practical. It is totally dishonest!
- We should know that “non-significant” results are not less valuable than a “significant” results. You are not “unlucky” because you get a “not significant” result. However in the literature a “non-significant” result is very often considered as poorer. This implies publication bias because editors thus accept more often abstracts with “significant” results (some researchers do not submit anymore their “non-significant” results because they know that their article will be rejected).
- With increased sample size you will be able to get “significant” results. This “significance” is not the most important part of the test. The most important part is the utility of your result as Audrey said above, result which will be reliable with a high sample size. For instance, you get a p-value = 0.001 for a regression coefficient b1 and a p-value = 0.05 for another one, b2. The effect of b1 in the response variable is low but the effect of b2 is important and has important implications in your field of research. What is the most interesting result? It is the result about b2 and you should confirm this effect with results of other studies or confirm by a new experience.
Frank Niemeyer may not be a statistician, but he has provided the most cogent argument about the interpretation of a p value. His discussion of accepting or rejecting based on a p value, and the false reliance on p-values alone is the most reasonable response to Andreas Brandmaier's question.
As an example of why relying on a p value alone can lead to uncertain results, consider a researcher who flipped a coin 6 times and it came up the same way 5 of the 6 times. Under the null hypothesis the p value is p
What is the power of the test? If you have not computed the error probability of a negative result there is no basis for making inferences about it.
This is seemingly difficult question because it is cast as a hypothesis test. The course of action is clear if the same experiment is viewed as an attempt to estimate the (possibly zero) magnitude of an effect. If the effect size was estimated precisely and was close to zero, in terms of practical significance, replication would be hard to justify. If, on the other hand, the effect size was potentially large but estimated imprecisely, a followup could be warranted.
Several have mentioned well know disadvantages of p-values, but there seems to be some confusion about what they ARE capable of doing. Samir argued, and Aurelio seems to agree, that replication of a null result becomes convincing evidence for the absence [of an effect], which is perhaps true, but that is not what is supposed to have happened in Andreas' question. Shiv seems to say that both findings of non-significance are "useless". But these findings are only weak; they are not useless. Michael comes out and asserts "If I repeat exactly the same study several times, and I get always a p-value of 0.1, I would not conclude from that, that it gets more likely that the hypothesis has to be rejected", but that is exactly what we SHOULD conclude. The reason is that p-values are uniformly distributed on the interval [0,1] as Stephen pointed out, so a value of 0.1 is suggestively low, albeit only weakly so. It is unlikely that chance alone would cause many studies to all yield low values like this. Even though none may be significant by itself, they may be highly significant in combination. Again, see Lou Jost's page for the formula to compute how unlikely it is.
Finally, getting five heads when you toss a coin only five times would be a significant result (p=0.5^5=0.032
As previously mentioned, the critical value for rejecting null hypothesis is an arbitrary decision. From the experimenters point of view It really only means how much risk we will take on making such decision. Thus a p=0.1 means we accept a 10% of "error or risk" to make a wrong choice.
I think the most important issue is to "see the data" . I found a bit strange (unless we are discussing on the math aspects of statistical analysis) to discuss regarding an abstract p value and how to interpret it without additional information.
Is this some kind of experiment on statistical knowledge? By the way the one million dollar funding was a nice touch.
As Scott Ferson mentioned the key point is that two INDEPENDENT tests of the same hypothesis with a medium p-value do provide evidence against the null. If the studies are independent, the probability of obtaining consecutively p-values of 0.1 multiplies, and, under the null, only 0.01 of times such a result would be meet. Those are strong grounds for rejection of the null.
That said, the fact that the p-values remain consistently low in several studies might point to effects that are not particularly strong. In my view it is very unlikely that the effects of anything are exactly zero, and once the null is rejected focus should concentrate on measuring how strong the effect is. The value of rejecting the test of no effect is towards the non-believers, not for the researcher who did not have that in his/her mind. The problem for the "believer" is more of point estimation, not of testing. In your formulation, the question becomes which of the possibilities within the alternative space is the right one.
Exactly Timothy Wojan. Need to know the power and what effect size is meaningful in terms of rejection of the null hypothesis.
Null value is defined in computing is the initial value of pointer. the pointer it self is a location in memory that may hold value of data type ( int , char, struct, object ....etc) . Firstly this memory location (pointer ) is pointing to no other location in memory which means Null , after this pointer starts to point to other pointer (memory location ) is holds no Null value . Butting many pointers in this series we start to see linked lists , the end pointer (tail) of the linked list must be Null, but in any case we will never have iteration of Null TO Null values
I like this interpretation of "p-value" and "null result" ! Great, you made my day :)
Show me the confidence intervals for the estimate in both cases and the pooled case and we have a pretty good place to start the discussion. Without them, the whole situation is more confusing than it needs to be. That being said, it would clearly be a logical error to say that the second finding of non-significance confirms the first!
More generally, those who are interested in properties of P-values might like to look at my paper "Two cheers for P-values" http://www.phil.vt.edu/dmayo/personal_website/SENN-Two_Cheers_Paper.pdf
@ José Ortega... hummm, I'd say there's something not adding up here. Consider in the second test the p-value was 0.4, which is evidence of... essentially nothing... but by the same reasoning you used the probability of obtaining those two results is now 0.04, which would be considerable evidence of... what?... so it seems to me you can not do the maths you're doing!
I agree with you, Tiago. You can't just multiply P-values together to get a new P-value. This is because each P-value is bound to be less than 1 so that you will always reduce the product by carrying out another test. You can consider this by plotting two P-values (say P1 & P2) in the P1, P2 plane. Under the null-hypothesis each has a uniform distribution so any valid combined P-value, say Pc=f(P1,P2) must have the property that it would have a uniform distribution under the null. For example values of Pc less than 0.05 should occur 5% of the time under the null. Fisher's approach is based on the product of P-values but requires a re-calibration. The P-value is a function of the product not just equal to the product. The attached graph shows how Fisher's rule works..
And here is an alternative scheme of Tippet's for combining P-values based on using the most extreme one.
And this is a third one based on the average Z score. That is to say transforming the P-values to a Normal deviate, averaging the deviates and using the fact that the standard error of the average is 1 over root 2.
Stephen, thank you for these instructive plots. They can't all be valid, can they? Could you comment on that? At least by simulation I found that Fisher's approach give a statistic (of the combined p-values) that is in fact Chi²-distributed under H0.
Yes they can all be valid. The trick is to divide the P1,P2 space into regions that have the required probability and there are many ways to do this. If you compare Fisher's to Tippet's for the 5% level you will see that Fisher's is a curve that cuts Tippet's twice. This means that they would both give you significance at the 5% level for some cases but for some Fisher's would and Tippet's wouldn't and vice versa. The frequency of the compensating cases is equal.
Thank you very much Stephen for the plots. I agree with you and with the comment of Tiago. While the probability of obtaining more extreme results on each of the tests is the product, that only identifies the rectangle from the origin to the point in Stephen's plots. The reason that other areas are added according to different criteria, is that other combinations of p-values that are less "extreme" can be seen under different criteria as providing the same evidence against the null, and that is why you have the different plots provided that Stephen, each of them according to a particular criterion.
Thanks. That was interesting.
I now attach a combined plot for the Fisher boundaries (white) and the Tippet boundaries (red). I have drastically reduced the number of contours to make it easier to read. You should fine that the area to the bottom and the left of a Fisher boundary is the same size as the area to the bottom and the left of a Tippet boundary.
As pointed out by others, the original question does not contain enough information to form a good answer. While the intent is well understood, more information is needed (or must be assumed) for a precise answer.
ASSUMING a well-posed question, a well-conducted experiment with ample sample size to show the proposed difference in effect for H0 and HA, an appropriate and well-conducted analysis in both the replicate and original experiments (both having freely available data); there is ample reason to believe (trust, accept, ...) HA over H0, if pre-determined measurement criteria are met. For many clearly defined situations one experiment and one replication is all you get (e.g., a billion dollar site cleanup). In epidemiology or testing of a new medicine, sufficient control of the subjects and/or sample size may have severe limitations. One experiment and one replication that meet pre-determined measurement criteria might be a sufficient basis to continue experimentation.
Predetermined measurement criteria is the requirement for the experiment and replication, not a search for P-values after the fact. The exercise of combining P-values can hardly add any value. It is better to combine the experiment and replicate in an appropriate manner, and then evaluate the combined result.
The only suggestion I have is to increase the sample size to get a better P-value to sharpen your decision. The new P-value might get bigger or smaller helping you to form an affirmative conclusion.
For interpretation of P-value you may visit:
Statistical Thinking for Managerial Decisions
http://home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rmip
But do you agree, Patrice, that if you got a THIRD failure-to-reject with a p-value of (say) 0.1, that you could conclude using a meta-analysis on the three probabilities that you CAN REJECT the null hypothesis, with a probability p = 0.032 < 0.05?
First, I would not expect a legit granting institution to give out money with the insistence you find a predetermined result. That sounds more like a bribe. If a pharmaceutical house issued such to back a product of their's, one would hope they'd go to jail for this. Much better is to enter into a matter just wanting to know the truth about it.
Secondly, I would think two researchers both coming up with exactly .1 would just be a coincidence.
Thirdly, if I read both reports my conclusion would be that we now had that much more evidence that the null hypothesis was correct and I would be that much more confident in holding onto that.
The phrasing of the question seems to invite mixing of two different philosophies of statistics. The question is phrased in terms of results from classical NHST affecting beliefs or confidences, but talking about beliefs and confidences is a very Bayesian way of approaching statistical reasoning. If what is wanted is an estimate how a new result affects belief then Bayesian methods are much more appropriate.
The many problems of NHST and its logical problems have been extensively rehearsed before (and by some of the posters in this thread), so perhaps this problem should be attacked from a Bayesian perspective?
This asynchronous discussion makes me feel a bit like I'm taking crazy pills, but I wanted to respond to the reply by "Deleted" [Patrice?] who did not understand what I meant by a meta-analysis of three probabilities.
The value of 0.032 comes from combining the probabilities from three separate experiments (each yielding probability 0.1) using Jost's formula at http://www.loujost.com/Statistics%20and%20Physics/Significance%20Levels/CombiningPValues.htm.
We CAN analyze these probability values, not just the underlying sample data. And indeed we SHOULD do so whenever we must judge the overall significance of a series of experiments, especially if we cannot pool the different original data sets for some reason.
Jost's formula is a kind of meta-analysis. Meta-analysis is not always about effect size. It can also be about significance level too. And that's what Andreas' question seems to be asking about. (The issue of the SIZE of the effect is also important, but that was not the concern in Andreas' question as he first posed it, nor did he give us any information that would be relevant to address that issue anyway.)
Yes, this is related to the issue of multiple comparisons, but the modern formula by Jost is much better than Bonferroni's which is well known to be way too conservative.
So you absolutely can combine probabilities from different tests to get an OVERALL assessment of the probability. The formula is not quite as simple as José Ortega first suggested, but it is still quite simple. For two probabilities, the formula is just
k - k * ln(k)
where k is the product of the two probabilities. For the Andreas case, then, where k = 0.1 * 0.1 = 0.01, this yields 0.056, which is not significant but is perhaps suggestive. For the general case, here is a little R function with some sample calculations:
jost
A reference for computing power of a test that only uses your original sample and requires stipulating an effect size is here: http://aepp.oxfordjournals.org/content/early/2014/06/12/aepp.ppu013.short?rss=1
What to Do about the “Cult of Statistical Significance”? A Renewable Fuel Application using the Neyman-Pearson Protocol.
It overcomes the problem of fallacious results that plague simple post hoc power analysis.