I estimated the percent tree cover for two fields and obtained as an example the following values ( 95 % CI): 13.67% (11.93-15.41%) and 9.27% (7.8-10.74%) . As the 95% CI do not overlap, can I say they are statistically different.
Yes, Assuming you checkedassumptions and are computational correct. Very good. D. Booth As an example see the Kaplan-meier plot discussion in the attached. Best, D. Booth.
Here is some Stata code for Daniel Wright's example which might make it a bit easier for others to copy the data.
clear
input byte grp y
1 1.2027143
1 1.1468594
2 2.5887360
2 1.6287144
2 0.6139361
2 1.1071487
2 0.9162051
2 3.5465277
2 0.1255809
2 3.5514520
2 3.1861957
2 2.9446761
2 1.5751574
2 2.2409351
2 1.7276551
2 1.5953351
2 2.5484082
2 3.3454902
2 2.2039619
2 2.2214461
end
ttest y, by(grp)
It gives the same results Daniel posted. However, I would add that in this particular case, I would never use the pooled variance version of the t-test, because the sample sizes are very far from equal (2 vs 18) and the ratio of variances are extremely heterogeneous. (As Dave Howell used to put it, heterogeneity of variance and unequal sample sizes don't mix.)
. display "Ratio of variances = " r(sd_2)^2 / r(sd_1)^2
Ratio of variances = 660.41091
Using Satterthwaite's t-test (i.e., with the -unequal- option in Stata), I get these results:
Bruce, this is a perfect example why tests of assumptions (even if it is just looking at the sample, not a formal significance test) can lead us astray. The samples were produced under true variance homogeneity. Having different observed sample variances is definitively just a sampling issue (the good thing here is: we know that).
The more relevant "problem" with this example is, to my opinion, the difference in sample size (2:18), what is quite extreme if the task is to evaluate the difference of two separate statistical populations.
I did a simulation with Daniels example (see R code below) to observe following results:
For the unequal sample sizes, Welch does NOT attain the nominal significance level of 5% (but Student does). The power of both test (for d=1) is similar at about 0.22, slightly better for Student. The probability of overlapping CIs ist about 0.01.
For equal sample sizes, both tests attain the nominal significance level of 5%. The power of both tests is again comparable but clearly higher than for unequal sample sizes (0.58). The probability of overlapping CIs also clearly higher than for unequal sample sizes (0.22). According to this, there was no need for Daniel to deliberately force different sample sizes.
I hope I did not make any major mistake, but I would be grateful if mistakes will be pointed out.
Concerning Daniel Wright's example, Jochen Wilhelm wrote, "The samples were produced under true variance homogeneity."
Fair point, Jochen. I'm not a (regular) useR, and I didn't study the code carefully enough the first time to catch that. Thank you. I certainly agree that the homogeneity of variance assumption is about population variances, not sample variances. But as you also hinted, when working with real data, we're usually not in a position to know whether the population variances are exactly equal or not.
As Jochen Wilhelm shows, other samples produce the same finding. I had just looped through random numbers until I got the CIs not overlapping but Student's t-test being significant. I just took the first case since I just wanted to show it occurs.
Well, I did some more simulations to compare the performance of Student's test and Welch's test and wonder if the results are correct. Please have a look at the pictures. The plots show the sorted p-values (the lowest 500 out of 5000 simulations). Under H0, the p-values should be uniformly distributed what would result in all points lying on the Y=X line (that is shown in the plots). Under not-H0, the points would lie on a curve below that line. The flatter (less steep) this curve initially is, the more p-values are small and the better is the power of the test. I indicated the p < 0.05 condition be a horizontal line. The more of the curves is below that line, the higher is the power at alpha = 0.05. Note the the two right diagrams in the upper row are actually identical (swapping the sample sizes does not matter when the variances are equal).
I tested several combinations for sample size imbalancies and heteroscedasticity. The total sample size is always 20, the "non-H0" effect is 0.5, and the variances are given as indicated in the figures.
When H0 is true, the lines are all expected to be on the Y=X line. This is the case for both tests when the sample sizes are equal, no matter how the variances differ. When the sample sizes are not equal but the variances are, the p-values from Welch are systematically too low. This is much severs for the Student's test when the small group is the one with the large variance. If the large group is the one with the large variance, Welch's test behaves well and Student's test hat too large p-values.
When H0 is false, one can get an impression of the power und the different conditions. For small effects, the results are heavily influenced by the fact that the tests does not always hold the nominal significance level. For larger effects, both tests perform similarily when either the sample sizes or the variances (or both) are similar. There are only drastic differences when sample sizes as well as variances are different. When the smaller sample has the larger variance, Studen't test drastically outperforms Welch's test. In the opposite case, it's the other way around. But these findings are hard to interpret, as the significance levels are not held bei either test under these conditions. However, the Welch test is closer to the nominal significance level and may therefore give a less biased picture over all.
I certainly agree, Bruce, that "we're usually not in a position to know whether the population variances are exactly equal or not". However, it doesn't really matter if the variances are exctly equal. Variances are a concern only if they are considerably different, and if they really are, we should ask why and model that appropriately (instead of taking this fact as an unwanted complication).
The discussion is interesting to this point, but the question was not answered. The discussion shows the problems with significant testing, particularly when tests are used without understanding the relationship of the test to the intent of the investigation. The testing is further restricted to the assumptions made for the study and applying the test.
The question regards the difference in tree cover of two fields. One field has more tree cover than the other. The 95% uncertainty bounds of each tree-cover estimation do not overlap. The difference is 4.40% tree cover with an estimated propagated uncertainty of about 1% tree-cover. So, is this significant? Statistical significance is the result of a statistical test that meets a criterion. No test or criterion was applied. The result is not statistically different, because there was no statistical test. The result has an uncertainty of about 25%. Is this a small enough uncertainty to form a conclusion? Is it good enough to warrant further study? Everything depends on the question asked? What can a difference of 4% in tree cover mean in whatever the context of the question.
Correct. And it's hard to answer that, since the sampling scheme and the calculation of the CIs is not explained. It's strange that the CIs are perfecly symmetric to the point estimates given. Given these are percentages, I would expect CIs to be symmetric (technically, they can be symmetric, as long their construction rule ensures the desired confidence level - but under the reasonable assumption of a beta-distribution for proportions and constucting CIs as the inversion of hypothesis tests, they should be asymmetric).
I agree with Joseph L Alvarez, ultimately any question on whether something is 'statistically significant' is indicative of the fundamental incongruities and flaws inherent to null hypothesis significance testing. This has been argued convincingly in papers by Sander Greenland, Gerd Gigerenzer, among others.
Jochen Wilhelm, while the analysis is interesting, I don't understand the goal, in terms of context. The Welch's and Student methods of arriving at a t-statistic are basically just alternative ways of dividing the difference between sample means by an error term.
A resampling/random number generator approach to looking at these formulas will have an entirely deterministic outcome. Indeed, you could probably just geometrically describe the behavior of the error functions based on the metrics from your R script, without including the means at all (the shared numerator in the formulas), but just including the code below:
Exactly, Miky Timothy . Statistical significance means that the data have met a mathematical criterion. It says nothing about the question asked. One must look at the data in terms of the question. The statistical test, if relevant, answers a statistical question concerning minimum data, but not the scientific question.
The discussion concerning tests and particularly the simulations by Jochen Wilhelm show that when the data minimum is attained according to the test, the test might not be applicable to the data. The data must be examined in terms of the question asked. Significance means that the data might meet a threshold for further investigation, if the statistical test is applicable to question.
Like Miky Timothy, I was also reminded of Danial Lakens' blog post and the article by Delacre et al. Here a couple more articles I found that may interest followers of this thread.
Article A note on preliminary tests of equality of variances
Here is the abstract from the 2nd of those articles, by Zimmerman (2004).
"Preliminary tests of equality of variances used before a test of location are no longer widely recommended by statisticians, although they persist in some textbooks and software packages. The present study extends the findings of previous studies and provides further reasons for discontinuing the use of preliminary tests. The study found Type I error rates of a two‐stage procedure, consisting of a preliminary Levene test on samples of different sizes with unequal variances, followed by either a Student pooled‐variances t test or a Welch separate‐variances t test. Simulations disclosed that the twostage procedure fails to protect the significance level and usually makes the situation worse. Earlier studies have shown that preliminary tests often adversely affect the size of the test, and also that the Welch test is superior to the t test when variances are unequal. The present simulations reveal that changes in Type I error rates are greater when sample sizes are smaller, when the difference in variances is slight rather than extreme, and when the significance level is more stringent. Furthermore, the validity of the Welch test deteriorates if it is used only on those occasions where a preliminary test indicates it is needed. Optimum protection is assured by using a separate‐variances test unconditionally whenever sample sizes are unequal."
That final sentence summarizes my own approach fairly well.
When n1 and n2 are not equal (or very nearly so), use the unequal variances test.
When n1 = n2, use the pooled variances test unless the ratio of (sample) variances is quite large, and there is no compelling theoretical reason to believe that the population variances are equal.
This differs from the advice Daniel Lakens gives, but I think that one needlessly sacrifices power by routinely using the unequal variances test when n1 = n2. (Remember that even if the population variances are exactly equal, the sample variances will differ differ virtually all the time, and when the sample variances differ, the unequal variances test has less power than the pooled variances test.)
PS- Here's one more article with a great title: "Of rowing boats, ocean liners and tests of the ANOVA homogeneity of variance assumption"
I wanted to follow up my previous post by presenting a visual simulation/model of the question. Also, having re-read my previous post, it appears that I may have been critical of Jochen Wilhelm model without bringing anything constructive to the conversation. My apologies to Jochen Wilhelm on this; I don’t mean to imply that there is anything wrong with his approach to modeling the statistics and the goal was simply to arrive at a set of meaningful metrics to make the tests practical, I think. Thus, I’d like to elaborate my own approach to conceptualizing a question of ‘percent area’ that hopefully will inform what I was trying to get at in the previous post, and comments are welcome.
The core of an experiment like Evans Kyei I would argue is not one of statistical inference. Rather, it is foremost a matter of descriptive measurement, followed by careful attention to sampling. Here’s what I mean: I can quite confidently say, knowing nothing else about Evans Kyei's experiment that his results ARE statistically significant. If one considers his data in terms of a Tukey quick test, I’m fairly certain (though I haven’t run the test) that the exclusive location in the distribution of his two sets of measurements would meet the criterion of statistical significance.
But what can be made of this? Well, a statistically unlikely result may just as readily indicate something odd about the methodology used or one’s approach to estimation, and it is especially likely that any such an outcome is due to very low power, and in this case entirely theoretical data (Evans Kyei maybe can confirm).
However, recall my above point arguing that such an experiment should be thought of primarily in terms of descriptive measurement. What I was getting at in my previous post and perhaps what Joseph L Alvarez is also implying is that percentages cannot be usefully interpreted independently of the real-world measurements from which they are derived.
Critically, in terms of a description of spatial data in two dimensions, what is the resolution of the measurements? If these measurements were made in a forest using some kind of physical device, wouldn’t the minimum resolvable fraction of the sampled space be a mater of the precision of that device? In other words, is the (significant digit) precision of 13.67% warranted or should it be rounded?
So my approach to thinking about the problem starts with the idea that spatial data, unlike other relatively undefined experimental measurements found in, for example, the social sciences, can be visualized and examined directly. While one should not naively conflate real-world phenomena with models or graphical simplifications, this is what experimental science is essentially doing anyway, so the at the very least scientists can make the models they employ as clear to the reader as possible.
My model, which is an ImageJ macro, is intended to visually illustrate parameter expectations in an analysis. What is evident is that even though ‘tree’ placement and size between the sample items is ‘random’, in a sense, nonetheless the outcome entirely predicated on the assumptions of the model. These assumptions include very high ‘tree’ resolution, random tree placement (except for edge offset for readability), limits to maximum tree size, etc.
Missing the forest for the trees:
What’s interesting is that while the resulting maps are very reasonable representations of the data as a statistical test would see them, they are most probably not at all representative of what tree cover actually looks like. This information gap is what makes science so difficult but also so valuable – when it is successful after much rigor and critical contemplation.
Anyway, I feel like this post is already way too long, so getting into the details of the model would be overkill. Something of a takeaway that I think is interesting: it is quite easy to tell the difference between ~9.5% vs ~13.5% pixel cover in an image (to my eyes). But that’s only because we’re looking at them side by side!
@ Evans Kyei That there is a statistically significant difference, perhaps, BUT better to be cautious with the meaning you are going to convey. In a clinical study, say, a statistical significant difference and a clinically important difference are two completely different concepts. There is the possibility that, in the case of no difference, if a sample is too large, the standard error is tiny and separate the 95% confidence intervals, and yields a tiny p-value, making us think that is statistically significant. For this reason, it is very important to work with a important difference that is worthwhile to detect (from the point of view of clinical, educational, etc), and calculate the sample size required to detect it with a given power, of 80% say.
's pointless and disruptive double posts. I too noticed this. Perhaps I am mistaken but my cynical conclusion is that they are trying to game ResearchGate's score system by spamming answers - it seems that a relatively "high" RG rank, either in one's institution, or by some other rank is incentivized.
Researchgate is now one of the very few censorship-free platforms available for scientists to express critical positions and discussion, so the cure to this annoyance cannot be worse than the disease. I propose a simple fix with two easy changes:
Only one answer should be allowed at a time. There is no reason for double posts, given the edit function.
A minimal length requirement for answers. The current TWO character requirement is ridiculous.
By the way, since the original question is on practical statistics, how is it possible that you and I identified Ali's anomalous behaviour given a sample of one or two of their answers. I believe we both used an informed RUNS HEURISTIC wherein even a run of two identical double posts would provide evidence that additional investigation is warranted. This concept can conceptualized using a control chart, of which you are a great proponent of, but with very strict control limits based on an empirical cut-off of only a SINGLE run of two identical double posts. Having observed such a run, we shut off the factory, so to speak, and take a look at Alaa Ali
Massimo Sivo & Miky Timothy, I don't know if it will have any real impact, but you can also try reporting such useless (and probably self-serving) posts by clicking on the down-arrow beside Share.