The point estimate is not very important, but reporting the confidence interval would at least give a hint if the data allows ruling out a relevant effect.
I agree with Jochen Wilhelm and Sal Mangiafico. Especially in cases of underpowered studies you might receive a non-significant test result even though there is a considerable effect size. Or, putting it the other way around: The effect size can help drawing futher conclusions from your study(design), so it's always a good idea to report it.
Just to clearify my answer in response to Jens-Steffen Scherer:
"Statistically significant" is not a result (or a proof) but an interpretation.
The interpretation is that one is of the opinion that the data at hand provides sufficient information to recognize a signal in the noise.
If one believes that the data allows to see a signal, then one also believes (at least) in the sign of the estimate. Being able to distinguish signal from noise does not imply that the size of the estimate (its absolute value) is interpretable. If the noise is huge and the data is "significant", then the estimate will also be rather huge, but this is still the sum of a "true effect" plus an unknown amount of noise (which is likely huge!). Just stating (and interpreting) the estimate can lead us seriousely astray!
Therefore I said that one should give the confidence interval (not the point estimate!). It shows us with what hypothesized effect sizes the data at hand is still "statistically compatible"). If the limits of the confidence interval exceed values that are considered "relevant", then the conclusion is that the data does not rule out that the effect size may be relevant. If not, we can conclude that we don't say if the effect is positive or negative, but that it is so small that we don't believe it's relevant.
Example:
a small study found a non-significant effect of exposure of atmospheric NO in concentrations reached in polluted cities on the blood pressure of adult humans.
Scenario 1: the confidence interval is from -12 to +27 mmHg. Conclusion: NO pollution may have a very relevant effect on blood pressure, but this study has not enough data to estimate it with sufficient precision.
Scenario 2: the confidence interval is from -0.2 to +0.7 mmHg. Conclusion: NO pollution is unlikely to have any relevant effect on blood pressure.
Now consider the study had a "significant" result:
Scenario 1: the confidence interval is from +0.1 to +38 mmHg. Conclusion: NO pollution is believed to increase the blood pressure, but this study has not enough data to estimate the increase it with sufficient precision. In particular, the data is not sufficient to rule out that the increase may be irrelevant.
Scenario 2: the confidence interval is from +0.1 to +0.9 mmHg. Conclusion: NO pollution is believed to increase blood pressure, but not to any relevant amount.
Scenario 3: the confidence interval is from +4.2 to +9.3 mmHg. Conclusion: NO pollution is believed to increase blood pressure by a relevant amount. The precision is large enough to say that the increase likely to be close to about 7 mmHg.
PS: I am aware that confidence intervals are not credible intervals. Actually, the estimates should be determined from the posterior. However, often a flat prior is not all too bad, and in most cases confidence intervals and credible intervals will then be numerically identical. For non-extreme priors and resonable models the intervals will be at least quite similar, numerically. So there is often no huge problem to take the confidence interval as a good numerical approximation of a credible interval. The same applies for the ML-estimate which is seen as a numerical approximation of the posterior mean.
Agree with previous responses. I would like to add that the more information you can provide to the reader (succinctly) the more valuable your research becomes (to those interested in your area of research).
First make yourself clear: the CI is some interval. It can be anywhere. It can be wide or small, and it can include H0 or not.
The following statements all refer to the expectations (that is, in any concrete case this may not be as described, but "in the long run" it will be true "on average"):
For a *given* effect size and a *given* variance (of the measured variable), the position of the CI is the effect size, and its width decreases with the sample size.
The CI includes H0 when its width is "large" compared to its distance from H0. The p-value is related to the ratio of the CI-width and its position relative to H0. If the limit of the (1-a)-CI just falls on H0, the p-value is just a. If the p-value is u, then the limit of the (1-u)-CI just falls on H0.
Since the p-value refers (so to say) to a ratio of the position and the width (of the CI), it does neither provide information about the position (effect size) nor about the width (precision).
I should point out that the term "effect size" may be being used in two different ways in this thread.
I had assumed that the original post was asking about effect size statistics. That is, if you are comparing two means with a t-test you might report Cohen's d. If you are comparing two samples with Wilcoxon-Mann-Whitney, you might report Vargha and Delaney's A (VDA).
It is clear that Jochen Wilhelm is referring to the effect size as, e.g., when comparing two means, the difference in the means.
Reporting either of these can be useful, and they sometimes convey different information, so reporting both can useful. For example, when using a Mann-Whitney test, the effect size statistic VDA refers to the probability that an observation in one group is greater than an observation in the other group. This is an appropriate effect size statistic for this test. But for practical reasons, the size of the effect might be reported as the difference in medians, or this difference reported as a percent.
I think it's relatively rare to report the confidence interval with an effect size statistic, though it is a good idea.
An example. (And also a caution in response to the response by @ Jens-Steffen Scherer .)
If we compare by t-test,
(1,2,3,4,5,6,7,8,9,10) and (4,5,6,7,8,9,10,11,12,13)
n = 10 per group
p = 0.04
Difference in means = 3, (95% CI = c. 0.16 to 5.8)
Cohen's d = 1, which would be interpreted as "large"
But note the the 95% confidence interval of this statistic is c. 0.1 to 2. That would be interpreted as "smaller than small" to [really quite] "large".
In this case we, if don't include confidence intervals, we might be jumping to an incorrect conclusion. Presumably a difference in means of 3 may be meaningful, but on the end of the confidence interval, would a difference in means of 0.16 be meaningful? [Since there's no context in this example, we don't know, but you get my point.] Likewise, the point estimate for Cohen's d of 1suggests a large effect. (Here, the difference in means is about 1 pooled standard deviation). But on the end of the confidence interval, a Cohen's d of 0.1 probably isn't too exciting. (a mean difference of 0.1 standard deviations).
The p-value and the point estimates tell one story: a significant effect effect with a large effect size. But the confidence intervals urge caution!
Joseph L Alvarez , I'm not sure if you are making some kind of analogy, but as far as I know, the figure you included has nothing to do with p-values, confidence intervals, Ho, or Ha.
The p-value, confidence intervals, H0 testing are frequentist concepts. The frequentist concept is that the system is closed and repeated measures will tend to the average, H0. An effect HA will be different from the average. The p-value is a difference from the average that is unusual. The attached control chart shows the average and confidence intervals. The average is H0. Assume the p-value as the upper line. Repeated sampling will find a value that is 'significant' or above the line. Minimally above the line is simply unusual and hardly worthy of reporting. Any value below the line is not worth reporting.
, you got that correct: large effect and large noise gives a large CI that is far enough away from H0 to have a low p-value.
[EDITED] If the effect equals H0, for n->Inf the width of the CI will aproach 0, and the also its distance to H0. These effects (smaller width but also smaller distance) will just cancel out in a way to make the p-values have a uniform distribution.
If a particular p-value for some particular sample of size n is just 1, it only means that the sample statistic of this particular sample happend to hit H0 exactly. Just as a low p-value means that the sample statistic of this particular sample happened to be far away from H0 (relative to the variance of the sample statistic).
@Nina I am confused by your remarks and the answer by @Jochen. I believe we have a problem with definition. This problem often arises with the usual language used for H0 and p-values.
The question concerned p-values and effect size. Assume we are measuring a possible difference from a known value A with standard deviation s. We obtain a value B. When B is sufficiently different from A to we suspect that B is indicative of an effect. We wish to know the effect size d.
d = (B-A)/s
Do we report d if the p-value of B is not significant? (The question was not clear as to the meaning of effect size. )
The usual null hypothesis language would be H0 equals A. The frequentist assumption is that n measurements of A would result in a series of values as in the graph I provided in an earlier answer. Significance is reached when a measured value B exceeds a line on the graph that has a low probability for the value A.
You discussed the possibility of having a large effect B and a large CI of A. The measurement of B should not broaden the CI of A, because we assume B to be different for the comparison. If we reject B as different then it must be included in the new CI of A.
A p-value = 1 means that every measurement of A yields exactly A. In your measurement system this is the frequentist definition of true value. Nevertheless, it does not ensure you have accounted for all sources of uncertainty.
"Assume we are measuring a possible difference from a known value A"... "H0 equals A", ...
Now you say "that n measurements of A would result..." and "... every measurement of A yields exactly A" - but if A is a known (or an assumed, hypothesized) value, it has no CI. Only B is estimated from some data, and for this estimate one get's a CI.
Further you write " ...and a large CI of A". - again, it's not A that has a CI but the estimate of B (or the estimate of (B-A), what has the same CI, only shifted by A).
A p-value of 1 means that the sample estimate of B equals A.
Assume we are measuring a possible difference from a known value A"... "H0 equals A", ...
Correct or incorrect?
Further, you state that I said
Now you say "that n measurements of A would result..." and "... every measurement of A yields exactly A" - but if A is a known (or an assumed, hypothesized) value, it has no CI. Only B is estimated from some data, and for this estimate one get's a CI.
I did not say that.
A p-value of 1 means that the sample estimate of B equals A
I did say that.
Then you say
Further you write " ...and a large CI of A". - again, it's not A that has a CI but the estimate of B (or the estimate of (B-A), what has the same CI, only shifted by A).
Now this is totally confusing. How can you not have a variance in A? Otherwise, you cannot have a p-value.
I was hoping you would respond in your usual informative way. I feel this is not well intended
Assume we are measuring a possible difference from a known value A"... "H0 equals A", ...
-> correct. If we measure a difference D to a known value A, we may rephrase your H0 "H0 equals A" as: "H0: D=0".
A p-value of 1 means that the sample estimate of B equals A
I certainly agree here. But I am really confused when you write that A is measured. You later wrote that A has a CI, what matches to this confusion.
How can you not have a variance in A? Otherwise, you cannot have a p-value.
If A is known, we have its value. Its a fixed constant value. What we don't know is if our measurements will scatter around this known value. A sample of such mesurements has a variance.
The reason for some of the confusion may be that A may not be known (as you wrote) but rather an assumption (about the expected value of the random variable describing the measurements). The assumption is also is fixed, given value, used as a "benchmark value" to test the measurements against (and to get the p-value). So it does not matter if A is known or not. But in any case, A is a fixed constant given value.
I did not mean your question was off topic, but that it was a variation of the same. Nevertheless, the original question required a specific answer.
I do not agree that the average height, A, of a group of 6th graders is a known without a CI. The heights have a distribution and this distribution has a CI. More important, A has measurement errors. If you repeat the height measurements, say, 20 times the average A will approach the 'true' value with a CI based on the SE.
Jochen Wilhelm
A is a fixed, constant, given value? The confusion continues.
Please give an example of such an A*.
We want to know the difference between A and B. A and B are measured values. We use the uncertainty in A and B to decide if d is 'significant.' Am I confused in thinking this is where we establish p-values? When is it proper to assume A is a fixed, constant, given value and use it as a benchmark?
*Our best known constants have a CI. (Except the speed of light, which is defined.) The kilogram and meter are exact in their environment in Paris, but any transfer has a CI.
I am citing you: "Assume we are measuring a possible difference from a known value A"... "H0 equals A", ...
Here you wrote that A is a known value, and that A is the value you take your Null hypothesis to be.
Do not confuse mathematical constants and physical constants. Meter and kg are being refined to counting processes. That's very interesting, but off.topic here. Avogadros number is another ultimately defined pysical constant.
A point hypothesis is a (mathematical) constant. This is in contrast to a random variable, which is a function that returns a value, but where we don't know precisely what value it will be. We can only make probability statements (most of the time, but not neccesarily always!).
An example?
I have a solution and want to determine the difference in its pH to 7.0. Here A = 7.0. That's what is given. I take a sample of measurements. They vary. Not a single measurement hits 7.0000.... exactly. Some are higher, some are lower. Maybe more often I get a measurement that is lower than 7.0. Is my data sufficient to conclude that the expected value of my measurements is lower than 7.0? I can "benchmark" my values against the hypothesis that the expected value is 7.0. Under this hypothesis (and a distributional assumption) I can calculate a p-value for getting measurements that are on average at least as much below 7.0 than the values of my actual sample, assuming E(B) = A, where B is a random variable for the measurements with an adequate probability distribution that is in line with my knowledge about such measurements.
Mia cupla. I said known value when I meant estimated or measured. Had the entire quote been included
"Assume we are measuring a possible difference from a known value A with standard deviation s"
it would be clear that I meant an estimated or measured value. I would have appreciated having that error in terminology pointed out. Is there a problem with estimated or measured in place of known?
Please let me know what is incorrect in my first answer to this discussion. I recognize your expertise and am willing to learn and correct my answer.
We learn from each other. Don't assume I am always correct ;)
I have no problem with the "measured difference". This measured difference is a quantity we don't know and that can turn out to be different in each trial. This difference can be handled mathematically via a random variable, and we can calculate sample statistics as estimates for properties of this random variable (like the expectation). We can assign confidence intervals to such estimates.
However, this difference has to be the difference from something. And this something may be a given value. This reference is given, fixed. It does not need to make sense in reality. But mathematicalyl, to describe and handle the problem, we take this as a fixed reference point to which we determine the difference. You called this reference value "A". This is how I read your post. This smothly translates into a hypothesis-testing problem. This value "A" could be seen as being a hypothesized expected value of the random variable we use to describe the measurements, which you seemingly called "B". So B is a random variable and A is a reference value. One can test the hypothesis that the expected difference to A is zero (H0: E(B-A) = 0), what is identical to testing that the expected value of B is just A (H0: E(B) = A).
Mostly, pleople have two groups of values, and they are interested in the expected difference between these groups. To not confuse this with the previous case I will use new letters. Let G and H be the two groups, or, more correctly, the ransom variables describing what we measure from these groups. A reasonable hypothesis to test is now H0: E(G-H) = 0, what is identical to H0: E(G)=E(H). Here, the fixed reference value is just 0 (we may test other values, but most often 0 makes pretty much sense, practically). This value 0 is a fixed constant. To map it to your example, this would be A = 0, and B = G-H. Here, B is a random variable that describes the difference between two other random variables. We want a p-value and a confidence interval for sample means of B, that is for sample means of G-H. We do not really need confidence intervals for sample means of G nor for H. This may be a source of confusion: many people do calculate confidence intervals (or standard errors) for G and H rather than for G-H, but test G-H.
I once read an article that stated in the introduction, "H0 is ..." In the conclusion was a statement along the lines of "This shows ... at p=0.47." I could not discern how the conclusion related to H0 , even after reading the document several times and attempting from the information provided to find p=0.47 or what it could possibly mean. This may be an extreme example, but one of the problems with NHST and even the H0 to p-value. There seems to be too many ways to express the difference between values or difference from a benchmark as your answer demonstrates. Much of this is misunderstanding by those performing the analysis.
I used H0 in my first answer because it was already in the discussion. While it is tidy in basic theory to express concepts mathematically with generic terms and call the base case H0, it is not necessary in practice to resort to pure formalism. What is necessary is to state what you are doing and what is necessary to make a conclusion.
A is a baseline value with an average and standard deviation (say, background signal)
B is an experimental measurement with a standard deviation
Is D=B-A a difference that can be interpreted as a signal?
D is a usable signal if the uncertainty in D is less than 0.1 D. (Smaller Ds will be more uncertain. Ds with uncertainty greater than 0.6 D will not receive further investigation.)
"The point estimate is not very important, but reporting the confidence interval would at least give a hint if the data allows ruling out a relevant effect."
Didn't you read it, aren't you satisfied with it or dont't you just like this answer?
, yes, the slope in such an ANOVA model is the mean difference, and if you get it and it's CI, then this is just what you can "happily live with" :)
Not all software give "slope" estimates for ANOVA models (e.g. it is not part of the standard ANOVA tables), and in more complicated models, the meaning of the slope coefficients also depend on the coding scheme*.
If the software is not that handy wrt ANOVA models, an easy** work-around is to manually code the groups (0 for the reference group and 1 for the treatment/experimental group) and do a (multiple) regression.
Perhaps some of the confusion arises because a P-value is based on a ratio - the effect size divided by its standard error. So it conflates the effect size and the precision with which it has been measured. It's based on a signal-to-noise ratio – the effect size is the 'signal' in the data and the SE is the noise-generating potential of the data.
The noise-generating potential of the data is also measured using a ratio (so, again, it is scaleless) which is the amount of variation in the data divided by the amount of information in the data (measured by sqrt(n)).
If we unpack these three constructs – effect size, data variability, amount of information – we can see the complex relationship between effect size and statistical significance a little clearer.
I think it depends, if you are testing such variable say that is not statistical significant might be useful for others researchers to formulate their future hypothesis and expectation. In addition, sometimes is very important to mention such "non statistically significant" effect to show other researchers that some expecting relationship between certain variables do not hold true in all the circumstances and study areas.
If the variable is not statistically significant, I think you do not need to refer about the effect size, because there is not enough evidence to support such estimated coefficient, even the sign can be totally inverted.
Patrik Silva: "... to show other researchers that some expecting relationship between certain variables do not hold true [in all the circumstances and study areas]"
That would mean to interpret "non-significance" as "absence of an effect". That's wrong. Non-significance just means that the data was inconclusive to judge the direction (sign) of the effect/relationship. Claiming "no (or irrelevant) relationship" because the data failed to be significant is like claiming quantum physics is wrong because my daughter failed to explain it to me clearly enough.
Jochen Wilhelm, I agree with you, maybe I was not clear in my statement.
What I wanted to say is that, if the data was inconclusive it's also mean that the expected relationship can not be proven in all the circumstances/study areas. I did not want to say that not statistically significant variable means negation of some kind of relationship.
yes, give means and the p-values as well. you can talk about statistic size (e.g. percentages). Significances also depend on the variability of the Experimental units e.g. p
If I may add further comment: Research is also about increasing our baseline knowledge of any research field. Based on the cummulative VARIABILITY knowledge new experiments can be better designed e.g. number of Replicates needed (not-significant p-value with n1 d.f. may be significat if there were more replicates which would lead to higher Residual d.f. etc).
!!! whatever scientific knowlege you generate, let others know it !!!
A p-value combines two things : an estimate effect size and the uncertainty of the estimate. So a non-significant p-value could be either
A trivial effect size measured with high precision or
An important estimated effect size but measured with low precision
The latter case isn't evidence of absence, but an absence of evidence. It is therefore very important to report effect sizes and their confidence intervals, much more than p-values.
Group 1: (n = 71, mean = 4.13, SD = 0.07) Group 2: (n = 522, mean = 4.17, SD = 0.03). We calculated cohen's d for this insignificant result between two groups, using the independent samples t-test. We got d = 1.08.
How can this result be made sense of? Or: Is something wrong here?
Jochen Wilhelm, how do you make sense of an effect size of 1.08 when the difference between the groups is not significant? Could you say that this means that it is strongly/very unlikely that the two groups are different??
What do you mean with "not significant"? Statistically not significant?
The sample size is huge (593), the mean difference is 0.04 with a standard error of 0.0015*. The test statistic is t = 0.04/0.0015 = 26.2, this is is definitively statistically significant: yes, there is sufficient information provided by the observed data, relative to the statistical model,** to interpret the sign of the mean difference. If this is relevant is a subject matter question, not a statistical question.
---
* se = sqrt( (71*0.07^2+522*0.03^2)/(522+71)^2 )
** I don't know anything about the model here. I wildy assume that it's a simple t-test-thing.
Asta. Effect size is meaningless if the results are not significant. You have merely shown that your data does not provide evidence for a difference between groups. Among the reasons I can think to present negative data - and it is a good one is that the results of studies such studies are very, very common and not reporting them biases our sense of what we can and cannot show using the data we have. Another very good reason to do so is if you have a lot of data with good power and you find not significant effect when one is definitely the expectation by many scientists. For example, we showed that among 18 genetic polymorphisms that were believed to affect risk of coronary artery disease, only one had an only marginally significant result when uncorrected for multiple tests. And that is precisely the random result one might obtain when none of the polymorphisms have a real effect. This negative result with thousands of observations was published because the geneticists and coronary artery disease specialists thought the the effect should manifest. But it didn't. This means that the risk of CAD is mush more complicated that these 18 polymorphisms.
Åsta Haukås , a Cohen's d of 1 suggests that the difference in means between the groups is equal to 1 standard deviation. Usually this is a pretty notable difference.
In the attached figures, means are ~ 10 and 11 and the sd's are ~ 1.
Jacob Cohen * lists Cohen's d ≥ 0.8 as "large". He notes that the difference in height between 13-year-old girls and 18-year-old women yields a d of 0.8. [I didn't confirm this.]
But of course these kind of interpretations are dependent on the field, specific study, and what is practically meaningful.
* in Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Routledge.
It's up to your interest. If you are more interested in the direction of association (positive/negative effect size), you may report it and further indicate that more data would be needed to test the hypothesis.
well, step back from the significance, test statistics are RATIOS of "extra variance" caused by the TREATMENT relative to the "error or residual variance" at a specific geo-position, time and under that locality's environment. Our planet Earth rotation around its own axis every 24 hours together with its yearly path around the Sun create much of the background variation witch may affect the denominator in your test statistic and hence the significance probability. Anyhow, all pieces of EXPERIMENTAL evidence adds to our knowledge baseline that in turn will help us do better in design of further studies. Human knowledge is still minute when compared to nature, solar system and the whole cosmos. So tell what you know...
Patrina Bevan , whatever logic you follow for reporting an effect size statistic for a parametric test, you would follow the same logic for reporting for a nonparametric test. ... That being said, there's some disagreement in this thread as to whether you should report an effect size statistic if the hypothesis test is not significant. ... I vote for reporting the effect size statistic. An effect size statistic simply gives different information that does a hypothesis test. There's no reason not to report both. ... But also, in cases where there is a large effect size, but the hypothesis test is not significant, this gives some suggestion that the effect might be meaningful, but that, for example, the sample size was too small or the data too variable to achieve a significant hypothesis test. ... It is helpful to report a confidence interval for the effect size statistic; this gives more information than just the point estimate for the effect size statistic.