I read several research papers on the motivation to study science. All of them used t-tests without first testing that the data was normally distributed.
Tthis was one set of data that I got for one analysis: for Levene's test, F=0.537, p=0.465, t=1.612 (if equal variance assumed), t=1.649 (equal variance not assumed; df=141 (equal variance assumed), df=107.05 (equal variance not assumed). My sample size is 143, 50 in group 1, and 93 in group 2. The values of p for the t-test exceeded 0.05 whether equal variances were assumed or not.
I can still report means, and SD's, but I should use a non-parametric test like Mann Whitney, right? What is your advice? Is it ethical to use a t-test without stating/making sure that the data is normally distributed?
Personally, I think that testing for normality prior to using a t-test (or ANOVA, etc) is a rather pointless exercise. Here's a conference talk I gave a few years ago that explains why I think that.
https://www.nosm.ca/uploadedFiles/Research/Northern_Health_Research_Conference/Weaver,%20Bruce_Silly%20or%20Pointless%20Things.pdf
Note that there are some slides in the "cutting room floor" section that you may find interesting. (I didn't have time to include them in the presentation.)
Re heterogeneity of variance, the unpaired t-test is incredibly robust to heterogeneity of variance when the sample sizes are equal. Some authors (e.g., David Howell) say that with equal sample sizes, the t-test is fine provided that the ratio of variances is no more than 4 or 5. They are probably being conservative--there is an old paper (I forget the reference right now) that shows pretty good t-test performance when the ratio of variances = 10 (with equal sample sies). But the more discrepant the sample sizes become, the less robust the t-test is. In summary, I would not rely too strongly on Levene's test in deciding whether to report the Welch-Satterthwaite (unequal variances) t-test. Instead, I'd pay attention to the ratio of variances, and how discrepant the sample sizes are.
HTH.
Hi Dr. Yeoh, I just had a crash course on SPSS. I'm yet to use it. He taught us how to compute and how to report.. What I remember, as long as your sample is more than 30, then we can assume its normally distributed. So you should use t-test. If its less than 30, then you need to test it if its normally distributed. This part, I'm yet to read for my analysis - haven't gone through my papers yet.
Friends, I read so many papers today on this issue. Not a single one said anything on equal variances could be assumed and why. They just reported the values of t-test....
Thanks Fatimah, but I'm still not so sure. If the p of the F value is less than 0.05, it means that equal variances may not be assumed, right? But my value of p is more than .05, so I assume equal variances. That's just satisfying 1 condition or that's enough? I will search for info too. Thanks
Personally, I think that testing for normality prior to using a t-test (or ANOVA, etc) is a rather pointless exercise. Here's a conference talk I gave a few years ago that explains why I think that.
https://www.nosm.ca/uploadedFiles/Research/Northern_Health_Research_Conference/Weaver,%20Bruce_Silly%20or%20Pointless%20Things.pdf
Note that there are some slides in the "cutting room floor" section that you may find interesting. (I didn't have time to include them in the presentation.)
Re heterogeneity of variance, the unpaired t-test is incredibly robust to heterogeneity of variance when the sample sizes are equal. Some authors (e.g., David Howell) say that with equal sample sizes, the t-test is fine provided that the ratio of variances is no more than 4 or 5. They are probably being conservative--there is an old paper (I forget the reference right now) that shows pretty good t-test performance when the ratio of variances = 10 (with equal sample sies). But the more discrepant the sample sizes become, the less robust the t-test is. In summary, I would not rely too strongly on Levene's test in deciding whether to report the Welch-Satterthwaite (unequal variances) t-test. Instead, I'd pay attention to the ratio of variances, and how discrepant the sample sizes are.
HTH.
The t-test for two independent groups has three crucial conditions:
1) An equal size of two independent groups
2) A normal distribution of your dependent variables for each group
3) An equal variance for both groups
If your data satisfies two of those requirements, you can use the t-test.
So, as you mentioned, you have equal variance (p-value of Levens test is greater than 0,05), but you do not have equal size of groups (chi-squared test has p < 0,001 in your case), so you should check the normal distribution of your variables for each group. If your data satisfies this requirement, you can use t-test, if not – you should use U Mann Whitney test.
Here you can find precise information:
http://libguides.library.kent.edu/SPSS/IndependentTTest
I think your reflection is indeed relevant. Testing the assumptions is crucial before interpreting the results of any statistical test.
In the case you present, you ensure the assumption of equal variances. You do not have results for normality tests. Ideally, the sample size should not be too different between groups. But, generally, a proportion of 1-1,5 is accepted with some guidelines accepting a proporion of 1-2.0.
In fact, if you do not use parametric tests to analyze your data, you should not report mean and standard deviations. Instead, you should refer medians as a central tendency measure and inter-quartile range as a deviation measure, for instance.
With respect to your last question, in my opinion it is plausible to use parametric tests when the assumption of normallity is not met. It is relevant to know that the test for normality assumption (e.g. kolmogorov-smirnov test) tests that your sample is significantly different from a normal distribution. As your sample size increases and if your coefficient is maintained constant, your results will become more significant. Therefore, I would not say that if you have violated the assumption of normality, you should not use a parametric test. Depending on the cases, I'd say that, although you do not have a normally distributed sample, considering a large sample size, you decided to use a parametric alternative.
Actually, the tests are all based on assumptions. These assumptions should be resonable/sensible by theory. If there is no adeqaute theory to derive these assumptions one should ask oneself why and what actually should be tested then.
Looking at the data may be helpful to see if the actual data strongly contradicts the assumptions. If so, one needs to reconsider the model, rethink the theory. A Formal test of the assumptions is logically nonsense (although often done, I know).
The t-test is a likelihood-ratio test. The likelihood approaches the normal distribution for large n. If the data is unimodal and not extremenly skewed, n>50 is very likely to guarantee a sufficiently good approximation (see "central limit theorem").
If you want to test the mean values, the t-test is the only method that does it. The Mann-Whitney does something different (it compares distributions with particular -but not exclusive- sensitivity to a location shift). Your aim should decide what you do. An alternative still is to bootstrap the sampling distribution (of whatever you wish, for instance a mean difference).
Following up on Jochen's point about the Mann-Whitney test being sensitive to things OTHER than location shift, see Fagerland's nice simulation study. I think it is available for download in ReserchGate--but here is the PubMed page:
http://www.ncbi.nlm.nih.gov/pubmed/19247980
Here's another article that may be of interest to those who believe (erroneously in my view--see my earlier post) that one should test for normality before using a t-test or ANOVA.
http://www.bmj.com/content/338/bmj.a3166
Cheers!
Thanks for your views. If you have some helpful thoughts, just post. I found this link helpful.
http://www.uwlax.edu/faculty/toribio/math145_spr09/Reading_SPSS2_Output.pdf
Friends, if I don't need to check the normality, all the better for me. So as long as variances are equal, I can assume that my t-test is fine. Then I report F value, it's p value and straightaway, I report t(df) = t value, p value. Does this seem ok to you? So often F value and it's p value aren't there.
Short: Yes.
Longer: Assumptions should be based on theoretical considerations. Surely, the reasonability of assumptions my be checked (but not tested!) on the available data. As beautifully stated in the presentation linked by Bruce, the assumption of "normally distributed data" is required only for an *exact* test - what is anyway never ever a sensible possibility using biological data. So a t-test on real data will always be an approximative test anyway, and the question is whether or not the degree of approximation is sufficient for the purpose. Unless the data has a very strange distribution, the approximation is generally quite good, especially when the sample size is relatively large (n>30). Plot the data, For your sample sizes, a box-and-whisker plot would be ok (a scatterplot would be ideal). From this plot one could see if there are any very strange things happening. If not, using the t-test for testing the mean difference is absolutely ok.
I only have the general problem with such tests that the result ("significant" or not) is often not useful (and not meaing what most people think it would), and that a "non-significant" result means "we can not say anything" unless the experiment was designed to control the type-II error rate as well. Without having a particular, specific alternative hypothesis in mind, testing a null hypothesis is simply not useful. Then it would be better to only provide the confidence interval and interpret it in light of the whole context and with all available expertise (and this doe particularily NOT mean to simply reject H0 when the interval does not include H0, what would reduce the analysis again to a hypothesis test).
http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf
Hello Miranda, Now that you have sorted out things in relation to significance testing, please note, that there is a long tradition of heated debate between statisticians, theoretical as well as applied, about the sense and merits of pure significance testing. Actually it is not just one debate, but there are several parallel discussions about alternative approaches and concepts. As you are a working researcher in an applied field, you are probably not interested in those personal quarrels between statisticians - I myself find them rather amusing as well as intriguing. However, if you are planning to do more research in which you are forced to rely on a lot of statistics, it may be worthwhile at least to find out about the importance of the concept of "statistical power of tests". See links below on two well-known books about that topic, as well as a general paper about p-values (2005)
http://www.amazon.com/dp/0805802835
http://www.amazon.com/Beyond-Significance-Testing-Statistics-Behavioral/dp/1433812789
https://www.youtube.com/watch?v=QW9_T8nrApU
Just saw your presentation Bruce (Weaver) that was just what I was looking for and have been mulling over for quite some time. Brilliant!
Thanks @Bruce and all of you, dear fiends.
@Bruce, I like that humorous presentation.
Thanks very much. For all the t-tests that I carried out, equal variances could be assumed. And for each occasion, I stated the value of F and it's p, followed by t(df) = t value, p = p value. :) Thanks.
There are several studies demonstrating that making choice of t test conditional on a prior significance test of assumptions plays havoc with the Type I error rate.
http://onlinelibrary.wiley.com/doi/10.1111/bmsp.12001/abstract
http://onlinelibrary.wiley.com/doi/10.1348/000711004849222/abstract
A well known approach is to use the Welch-Satterthwaite corrected test by default. It is also a reasonable approach to run a more conservative test such as the Mann-Whitney and switch to that if the t test and the Mann-Whitney differ.
Thank you, Thom and friends, for your answers on this thread. I have learned many things from you all.
Hi Thom. Before using the Mann-Whitney U in place of the unpaired t-test, I would carefully consider some of the simulation studies that have been published, like this nice one by Fagerland & Sandvik (see link below). Their simulations demonstrate that small differences in variance or skewness of the populations produce a lot of Type I errors when the MW test is used as a substitute for the t-test.
https://www.researchgate.net/publication/24045186_The_Wilcoxon-Mann-Whitney_test_under_scrutiny?ev=prf_pub
Cheers,
Bruce
Article The Wilcoxon-Mann-Whitney test under scrutiny
Hi Bruce, I agree. I was on a train when I sent the last message and wrote more hurriedly than intend.
Generally if the Mann-Whitney U test and t test differ to any great extent it can make sense to switch the more conservative test (the one with the larger p value). It is a sign that the assumptions of one or both tests are violated though so I would normally check the distributions (if I hadn't already).
This is the question where most people make their catastrophic mistakes because most of them,without looking at data distribution, do t-test.You first do your data distribution and do data distribution histogram to see how numbers distribute.If they distribute symmetrically (normal or symmetric or Gaus distribution) then you can do t-test in order to see whether there are any differences between two groups because t-test can be used only and only in cases data distribute normally.However,if you see in a histogram that data are distributed asymmetrically (shifted to left or to right directions outside normal curve distribution) you then calculate skewness,curtosis in order to calculate what percentage of your data distribute outside of normal distribution curve and then do ,Colmogorov-Smirnov test to see if these calculations are correct and then use one of non-parametric tests like the test of last squares or Mann-Whitney test.
Hamza I do not agree with some points you make:
# "first [...] do data distribution histogram to see how numbers distribute"
Well, looking at the data is a good idea, I agree. But still the distributional form is an *assumption*. This should actually be given outside of the particular set of data, derived from the understanding of the kind of data and the way it is generated/measured. Adjusting the kind of the test preformed on the actual data based on the distribution of the very same data subverts the idea of the test and renders the interpretation of the test results difficult or impossible.
Further, many people have only few data (small sample sitzes). There will not be much seen of the distribution anyway. Histograms would be useless. Better use quantile-quantile plots to check if the data fits to the distributional assumption.
Finally, it is not about the distribution of the data, but about the distribution of the residuals (this becomes important especially when covariates and interactions are considered, but it is valuable even in simple experiments because the distribution of the residuals can be based on all samples together, whereas the data has to be analyzed sample-by-sample).
# "t-test can be used only and only in cases data distribute normally."
Clearly: no. Data (or residuals) are never distributed normally. The normal distribution is a model and it is only a matter of how closely you look at the data: you willl always be able to show that it is not normal distributed. The t-test would be an exact test if and only if the data(or residuals) were normal distributed. When this assumtion is only approximatively true, then the test is an approximate test. The question thus is not if the data/residuals are normal distributed but if the approximation is good enough.
Further, the test is based on the likelihood function (or sampling distribution), and this tends to approximate the normal distribution with increasing sample size (central limit theorem).
And finally, most "bad" approximations will comprise the power of the test, it will be more conservative. Thus considering a result "significant" is save (in most cases). The problem is tha non-significant results can not be interpreted based on the power determined under the wrong assumption.
# "However,if you see in a histogram that data are distributed asymmetrically [...]"
The following advice is not good in my opinion. If the aim was to test a mean difference, the suggested methods simply do not do this job. It's like having a look at the watch to get the time, but when recognizing that the watch is out of order, one instead simply takes the color or the shape of the watch as a substitute... I am only aware of the possibility to bootstrap the mean difference, but this will be inefficient for small samples (and for large samples the CLT will likely provide a good-enough approximation).
Jochen,you completely missed the point.If you read my comment correctly you then would see that I talked about residuals because tests I mentioned are about residuals.It is not about approximation ?! It is about looking at the real nature of results.This is the best method there is.Why don't you suggest a better one if you know.You just theoretise.As regards bootstrapping,that method is in accordance with your understanding of statistics-worst virtual method without any connections with real situation.
Hamza, Jochen is I think pointing out that the residuals and/or data are never perfectly normal in a sample. In addition, even if the the errors are normal in the population (the residuals being a sample of the population of errors) there is no guarantee that a small sample will look particularly normal.
This is part of the problem with using tests of normality etc. to decide on the choice of test of differences. Another aspect is that the tests of assumptions tend to lack power and not to be very robust themselves.
Descriptive statistics and particularly graphical checks are good practice because they allow us to assess the degree of violations of assumptions. What I advise students to look for is evidence of large departures from normality. Procedures such as the t test perform well when the assumptions are not severely violated.
There may some confusion of terminology here. All statistical methods are approximations in the sense that the true model is unknown. Some methods are 'exact' (rather than approximations) in the narrow sense that if the assumptions hold perfectly then the method will hold particular properties with certainty (e.g., a particular Type I error rate).
I've not much to add to Bruce's excellent and entertaining presentation, but maybe this will help to convince believers, like Hamza, in numerical methods such as the "Colmogorov-Smirnov test" that they are on the wrong track - not much theorizing here, just some simulations and useful graphs including R code:
http://www.statisticalmisses.nl/index.php/frequently-asked-questions/77-what-is-wrong-with-tests-of-normality
Until you suggest another procedure,other than I suggested,that would be significantly better, any other discussion here is really aimless.
Hamza, you obviousely don't understand me and, to your statement, I seem not to get you right either. So maybe we should both work on our communication skills first.
To the best of my knowledge, the t-test is the only available test for means. Your proposed U-test is not a test for means (unless the distributions are unimodal symmetric, in which case the t-test would again be the better alternative). As a crude alternative I mentioned bootstrap methods, what you consider the "worst virtual method without any connections with real situation". Surely I and many others will not agree with your opinion here.
At the risk of theoretizing, I aimed to correct some misconceptions you broadcased. If this aim is reached will be decided by others and not by you or me.
Miranda,
Your two samples differ in size, so t-test can be misleading. Depends on how the SD-s are distributed... I published about exactly this probelm - how robust are parametric and non-parametric tests? We wrote ot for ecologists but in many areas the problem is the same... here is the abstract:
"Ecologists, when analyzing the output of simple experiments, often have to compare statistical samples that simultaneously are of uneven size, unequal variance and distribute non-normally. Although there are special tests designed to address each of these unsuitable characteristics, it is unclear how their combination affects the tests. Here we compare the performance of recommended tests using generated data sets that simulate statistical samples typical in ecological research. We measured rates of type I and II errors, and found that common parametric tests such as ANOVA are quite robust to non-normality, uneven sample size, unequal variance, and their effect combined. ANOVA and randomization tests produced very similar results. At the same time, the t-test for unequal variance unexpectedly lost power with samples of uneven size. Also, non-parametric tests were strongly affected by unequal variance in large samples, yet non-parametric tests could complement parametric tests when testing samples of uneven size. Thus, we demonstrate that the robustness of each kind of test strongly depends on the combination of parameters (distribution, sample size, equality of variances). We conclude that manuals should be revised to offer more elaborate instructions for applying specific statistical tests"
and the link (it is open access publication):
http://eprints.iliauni.edu.ge/usr/share/eprints3/data/462/
In sum, I would recommend ANOVA (even though only two samples are analyzed).
@Zaal and friends, thanks for all your views.
I followed the method used by Prof Glynn et al (2011), and I also got permission to use their questionnaire to measure student motivation to learn science. The ratio of male to female students in my sample was 50:93. This is representative of students in my college. It was also the same in Glynn et al. Among their science majors the ratio of men to women was 127:240.
I also followed their use of t-test to measure the differences in achievement and in motivation factors between boys and girls. They did not report F value and its p value, only the t values and p values.
'Science Motivation Questionnaire II:
Validation With Science Majors and Nonscience Majors
Shawn M. Glynn,1 Peggy Brickman,2 Norris Armstrong,3 and Gita Taasoobshirazi4 (2011).'
Miranda,
You did a right thing! But you still keep some doubts and to resolve them just use additional tests (ANOVA and Mann Whitney) - almist sure the results will be the same. So you can safely report your results after that.
@Zaal:
This I do not understand: you write that you "found that common parametric tests such as ANOVA are quite robust to non-normality, uneven sample size, unequal variance, and their effect combined." and later that "At the same time, the t-test for unequal variance unexpectedly lost power with samples of uneven size."
The t-test is equivalent to the ANOVA (for k=2), the F[v,1]-distribution is the distribution of t²[v]. How can one be robust and the other not? - Or is ANOVA is robust only for k>>2, was the t-test performed with the same total sample size? This would be important because the uncertainty in the estimate of the standard error will decrease with the total sample size in both, ANOVA and t-test, both operate on the likelihood of the *whole* set of data (in (k-1)+1 = k dimensions [k being restricted to 2 for the t-test]).
@ Jochen:
I wonder if Zaal is comparing the Welch-Sattertwaite t-test (which does not assume homogeneity of variance) to ordinary one-way ANOVA?
FOLLOWUP:
Here is a relevant quote from Zaal's article:
"For example, ANOVA can compare two or more means, and, when applied to two samples, produces exactly the same p-values as t-test for equal variance. Therefore, ANOVA can replace t-tests for two samples, yet habitually we still use t-tests, and employ ANOVA only when we have more than two means. Why this redundancy? The advantage of t-tests may be that it has a special version adjusted to unequal variance. As far as we know, ANOVA does not provide any widely used procedure for correcting p-values for unequal variance, although tests for equality of variance, such as Bartlett’s test or Levene’s test, are routinely calculated for testing departures from ANOVA assumptions." (p. 67)
It would appear that the software Zaal uses does not have an option to display unequal variances versions of the F-test for one-way ANOVA. Many stats packages do have that option--e.g., the ONEWAY procedure in SPSS has both the Welch and Brown-Forsythe F-tests.* IIRC, these two tests are equivalent when there are 2 groups, and their F-value = the square of the Welch-Satterthwaite t-value (i.e., the "equal variances not assumed" t-test in SPSS).
HTH.
* You can see the algorithms here: http://pic.dhe.ibm.com/infocenter/spssstat/v21r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_oneway_forsythe.htm
The paper explicitly states that it is calling the t test with equal variances ANOVA.
The power results seem weird as they, for example, contradict Zimmerman & Zumbo (1993) who show that the Welch-Satterthwaite corrected test outperforms t for six different type of distribution. The paper also is useful in pointing out that you can use the Welch-Satterthwaite t test on rank transformed scores to overcome the problem with unequal variances of the Mann-Whitney U test.
I think I understand now. The claim "At the same time, the t-test for unequal variance unexpectedly lost power with samples of uneven size" is incorrect. The Welch-Satterthwaite corrected test doesn't lose power relative to ANOVA or Mann-Whitney when sample sizes are and variances are unequal. What happens is that the Type I error rates for ANOVA (uncorrected t test) and Mann-Whitney U are sky high (see p.72) when the smaller group has the higher variance (a well known result).
The technical issue is that you can't compare power/Type II error between procedures with different Type I error rate. The Type I error rate is in effect the lower bound for power.
Just to check, I simulated the power and type 1 error rates for 20000 normal samples with the same sample sizes (6 v 60), sd (15 v 5) and means (100 v. 105 or 100 v. 100) for the t test and the Welch t test.
t test/ANOVA Welch-Satterthwait t
Power .52 .10
Type I error .39 .05
* these values jump around a little bit as even 20,000 simulated samples isn't that stable with these sample sizes and variances.
Miranda, you can encode your data in the form of table of frequency distribution in Excel using the Countif function, click insert, click an appropriate graph/chart, and from the graph make an eyeball estimate if the data are normally distributed or not. You do not need to assume, you will see the plot by your very own eyes. Ed
Thanks Ed and friends for all great advice :)
In SPSS, I can use the Q-Q plot or P-P plot to see it 'fits' the normal straight line? Which plot is better, Ed et al.?
Miranda, I guess you, who would actually see the result, would be the best judge for that. Ed
There must be Shapiro–Wilk test for normality, and Q-Q plot is also useful.
Friends, thanks very much: Got it!
'If the data points fall on the straight line, the sample data likely came from a population that is approximately normal.'
Nice that things got resolved. However, I wouldn'd like to see that a common misconception about tests is carried over even to the interpretation of QQ plots:
Miranda cited: 'If the data points fall on the straight line, the sample data likely came from a population that is approximately normal.' - Thats not always correct. actually it means only that the *data* is expected under the hypothesis of the normal distribution. This statement cannot by simpla turned around to say that the hypothesis (normal distribution) would be likely given the data. To make it clear by (a rather silly) example: if your sample size was n=2, the points will *always* lie perfectly on a straight line - absolutely no matter what the distribution is.
@Jochen
Good point, thank you! Yes, this is why testing distribution normality in small samples is useless, and the tests like t-test and ANOVA work with such samples well. By the way, nonparametric tests also seem to work better with small samples.
@Miranda,
Anyway, your sample size is not small, so it is right to test normality and be sure that t-test is the valid one in your case.
Dear friends, now I understand why the studies that I read did not mention normality tests at all. During the time I did my PhD, testing normality seemed to be all important. But if I only investigate for my sample, and do not make any inferences, then I am 'safe'. The studies that I read were published in journals with IF.
HI, you had more than 30 participants per condition and so you can invoke the central limit theorem (i.e., once sample size exceed 30 the shape of the distribution is close to normal) - Andy Field provides a nice explanation of the central limit theorem in his research methods textbook
Nadia - the central limit theorem applies asymptotically (and then only under certain conditions). For any finite sample there is no guarantee that the CLT applies and thus no fixed n that implies that the inferences from a t test would be OK.
That said, if you have a symmetrical distribution without heavy tails (i.e., not leptokurtic) then you the CLT implies that with moderate sample sizes (e.g., 30 or so per group) you should be OK.
This is a direct quote from Andy Field's blog "However, the central limit theorem tells us that no matter what distribution things have, the sampling distribution will be normal if the sample is large enough. How large is large enough is another matter entirely and depends a bit on what test statistic you want to use."
(He is oversimplifying it a bit as the central limit therorem requires that the distribution you are sampling has finite mean and variance - which, perhaps counterintuitively, is not always true).
Dear friends, thanks for all your posts. Two heads are better than one, this is the value of RG. I have written the 2nd draft of my paper, checked that my numerical values were correct. I would not claim anything but just say that the results obtained were confined to my sample.
ENJOY YOUR DAY :))
Thanks friends, Bruce Weaver, QuickDraw McGraw, Baba Louie et al.,
'As n increases, normality becomes less and less important for the validity of the t-test.'
https://www.nosm.ca/uploadedFiles/Research/Northern_Health_Research_Conference/Weaver,%20Bruce_Silly%20or%20Pointless%20Things.pdf
I check normal distribution of my data set before doing t-test. In case of skewness or kurtosis, I use log conversion or exponential treatment of data before t-test if it is mandatory for some reason.
I would like to point out in this discussion, that t-test just as any null-hypothesis testing are sensible to violations of prior assumptions and that one should be careful with using transformations to make the data "fit! into what analyses require. The distribution of the data itself may be a piece of information itself.
I would very much recommend to calculate confidence intervals for your mean scores (after veryfying that means are the estimate you want to use here. If your data is heavily skewed, another measure for centrality may bear the information you actually want to get, e.g. the median). For this approach I heavily recommend the work of Geoff Cumming (who also provides an easy to use Excel sheet for calculations, but R has excellent packages, too).
Aenne, could you please explain why transformations making the data fit the assumptions of a test could be a problem?
I absolutely agree that the distribution of the data itself is a piece of information itself. However, if a test is performed to control some error-rate (i.e. Neyman/Pearson philosophy), this is known before any reasonable test is even planned (not to mention the experiment and the data collection itself!). There is no effective control when the kind of test is chosen by the appeance of the data that is then tested. This is one important reason why "testing the normal-distribution" before a t-test or ANOVA is logically nonsense.
And a note: confidence intervals, if calculated based on the t-distribution, require the same assumptions like the t-test. I would say that these intervals can be calculated using the t-distribution for transformed data (meeting the assumptions). But this raises the same question I asked above.
When the distribution is known for some theoretical reasons or from past experience, there might be anyway better solutions than using the t-distribution (think for instance of proportions, counts, rates, concentrations, fold-changes etc).
A last note: Sometimes transformations enable us to ask the right questions with simple procedures, what would not be possible in an easy way using the original data. For instance consider the change in some hormone level. The absolute change is (to my understanding) not biologically relevant, but the relative change is (I skip to give a lengthly explanation, it should actually be quite clear). So we should not analyze the difference in hormone concentrations but instead the ratio of these concentrations. Now, the log ratio can be written as the difference between the logarithms of the concentrations. Thus, analyzing differences between the logarithms (e.g. by a t-test) actually means analyzing the ratios of the concentrations, and this is about the relevant thing.
n < 30 is not in all variables a small sample....
the t the student was develop by William Sealy Gosset is her job in Guinness. (aprox 1904)
the idea that n
Hi Dear All
sometimes although the result for normality testing approve non-normal distribution, but the shape of histogram is rather symmetric. then we can use parametric statistics. the other point is the size of difference, In my opinion if the difference between means is high, the result of both parametric and non-parametric tests would be the same.
t-test has been shown to be pretty robust to normality deviations, especially when you have large samples. However, you should be concerned with your unequal sample sizes and definitely report the "variance not assumed" test. Finally, since your samples are fairly huge the p values have littles importance. Any non-consequential difference would be significant. You should make sure that your effect size is large enough.
I believe that more recent simulation work has shown that the t test (and related tests) are not generally robust to violations of normality. These problems are reduced but not eliminated with large samples (e.g., large samples will mitigate if there is skew). The biggest problem is with heavy tailed distributions (and the worst case scenario is that the CLT will not apply with some heavy tailed distributions).
A good reference here is:
Wilcox, R. R. & Keselman, H. J. (2003). Modern robust data analysis methods: Measures of central tendency. Psychological Methods, 8, 254–27.
It is unfortunate that that many hypothesis-testing decisions that are based on t-tests and ANOVA are made without checking the assumptions, especially normality. Sometimes, even the CLT does not apply when data follows something like a Cauchy distribution. So why not supplement the analysis with the appropriate no parametric test?
Hypothesis tests are performed (and sensible only) when the assumptions are a priori known to be ok, either from theoretical considerations or from previous experiences. Deciding for a test based on the same data that should be tested undermines the whole idea of the test. Specifying both error rates for hypothesis test is essential (otherwise it would be something like a significance test, where, in turn, a strict control of the type-I error rate is neiter possible nor sensible). To achieve this, one has to calculate the sample size, what is only possible when (amongst other things) the test is known.
For significance tests, the test is just performed without having a specified type-II error-rate. Here the p value must be interpreted in relation to the entire context, including the distribution of the data. So here it is not a general problem if the assumptions are not met, since this is considered in the interpretation. However, the bad common practice is to simply compare p to some fixed value ("alpha"), what is utter nonsense anyway. Here I wonder why one should then be concerend with a possible violation of some assumption.
@Saad et al., in most papers, the researchers reported only t-test. I also check with Mann Whitney, but should I report this as well? But I didn't report Mann Whitney test results because I reported means and SD's. What do you think?
But for my last paper, I used non-parametric stats. This was what I reported: The results showed that the minimum score for the control group was 1, and the maximum was 7 of 15 (total score). As for the experimental group, the minimum score was 14 and the maximum was 15. The medians of the experimental group and the control group were 15.0 and 4.5 respectively. The values of mean rank were 45.50 for the experimental and 15.50 for the control groups; and the sum of ranks were 1365 and 465 for the experimental and control respectively. A Mann-Whitney test was carried out to evaluate the scores of the two groups. A significant effect was found: W (58) = 465, Z = -7.076, p = .0005. How can I improve this?
But I didn't report Mann Whitney test results because I reported means and SD's. What do you think?
There is no intimate relation between data summaries and a test. They have nothing to do with each other. However, presenting the means indicated that they seem to be important for you. If is was also important to test differences in the mean values, then you can only use the t-test (there is no other test testing differences in means; the MW-test tests rank distributions - that is a different question; if this is important for you, you should use the MW-test but not the t-test).
How can I improve this?
Visualize the score distributions. and interpret this diagram. Mean ranks and rank sums are not really interesting. I do not understand what you mean with "values of mean rank". If you have to have a p-value, compare the rank distributions (MW-Test).
I assume you are analyzing real-world data. I would test for normality. If normality is supported, go ahead and do the appropriate t-test after checking equality of variances.
Dear Saad, have you read my comments above? If so I would apprechiate a comment. To my opinion, one of us has given a bad advice, and I would like to know if that one is me and why.
To complete my answer, I would also do a nonparametric test to see if conclusions differ.
If normality is not supported, I have to do nonparametrics which I prefer to transformation techniques. If you plan to publish your work based on nonparametrics,
be careful because many journals and reviewers are biased against nonparametric statistics.
Dear Prof Zaal, Jochen et al., the reason I didn't use t-test for this paper was that the experimental group did so well, with nearly every students getting maximum score. If equality of variances were satisfied, can I go ahead and do a t-test? Please let me know.
Of course, means and SD's are more meaningful to me. As teachers we always place importance on mean test scores. But I read from 1 online link that if I used MW, I should report medians, so I just proceeded to give the data output of MW verbally.
Often, authors claim they compared mean values (directly or indirectly) but perform a non-parametric test (to support their claim that the means are different). This is logically clearly wrong and practically not neccesarily correct. Therefor I can sometinmes understand the "biased" reaction of reviewers. If you have reason to think that the errors are not symmetric, then the practical meaning/interpretability of the mean is anyway in doubt. I would suggest to re-think the model.
Example: Given in a paper you read "The treatment increased the response from 2.3±0.2 to 2.8±0.3 (p
@Miranda,
Statistical tests are just tools, as any other method. But they are well standardized which means no need to report all the details such as rank sums etc. You can report like this: “variance did not differ among two groups (P=XXX, Folded F-test) and so we applied T-test for equal variances”. Then you report the mean and sd values and indicate P-value (by the T-test). But I usually produce a figure (for your type of data – a bar-chart showing mean values and standard errors). This ის practically the same as Jochen writes above.
Jochen: what you wrote is theoretically accurate. My approach is that one must test for normality. You actually stated "Looking at the data may be helpful to see if the actual data strongly contradicts the assumptions. If so, one needs to reconsider the model, rethink the theory. "Then you stated "A Formal test of the assumptions is logically nonsense (although often done, I know)." How to look at data and decide subjectively. As Thom stated "The biggest problem is with heavy tailed distributions (and the worst case scenario is that the CLT will not apply with some heavy tailed distributions)." I give example which is the Cauchy distribution. I agree with you that the MW test is about ranks, but it does detect difference in medians. One can report the the median of grades. The question is if normality is not supported, then what to do in practice.
When the sample sizes are the same (or nearly so)...
“To make the preliminary test on variances [before running a t-test or ANOVA] is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!”
- George Box, Biometrika 1953;40:318–35.
;-)
P.S. Alternatively, you can produce a similar chart again but without error bars. Instead of them you can indicate the p=value obtained from the Mann-Whitney test. Or if you prefer only text without figures than just mean values and P-value by the Mann-Whitney test
1) Thank you Saad for your response. The Cauchy distribution has no finite variance and, thus, no defined mean. So there exists no way to sensibly estimate it. It should be realitvely obvious from the variable (how it is measured and so on) that it will have a Cauchy distribution. Then one would not consider comparying means in the first place anyway.
2) Some colleagues above suggested to plot the results, some specifically talk about plotting means (or medians) and standard deviations or standard errors or interquartile ranges. I also said that a plot should be produced and interpreted, but not a plot of means (or other summary statistics) but of the frequency distributions (plotting the data, so to say).
Friends, thanks. For the second study, the number of respondents in each group (control and experimental) was equal. Variances were not equal, nor was the data normally distributed. I thought I was right in using MannWhitney. I reported the medians, but the means are more important to me. Should I re-word my report, report the means as well?
I went on to report effect size: The effect size was calculated with the use of the formula (r = z/square root of N), and it was found to be 0.9135 which was a large effect, when compared to the effects of other mnemonics that produced a moderate effect size (Yeoh, 2013c).
I agree Jochen. If we can document the change in distribution shape, it is also a very important result.
@Miranda, I would look at the distributions too, as Jochen recommends.
@Mirana,
Sorry, I just have read your last post.
So you have compared distributions and the experimental group shows skewness! Very good! Skewness can be measured and tested too, I recall there is D'Agostino's test on that. Then, in my opinion, no need to test mean differences any more, if you show that skewness is significant only in the experimental group in contrast to the control. And, importantly, you can use mean values for comparisons between the groups and assessing the effect size. I would also add the figure of two distributions.
Miranda, you will always find a reviewer that is not perfectly happy with whatever you do. Speaking for myself, still would prefer to see the distributions, and those should be interpreted and discussed. I also guess that the effect size measure is not sensible for this kind of data (it's a guess! You could link the cited paper). If some "relevant" statistic should be given in the text then I would prefer the mode (so neither median nor mean, haha). But this is your work, and if you think the means are interesting, then report them. If, in case, a reviewer moans, then you can explain why you reported the means (if not already done in the manuscript).
You should acknowledge that there are cases where it simply is not possible to summarize data in one or two statistics (using 5 or more statistics to "summarize" won't be very helpful and it would be more straight-forwart to simply describe the relevant features of the data). If you do it anyway, you usually don't do yourself (and others) a favour.
For me it makes no sense to load a paper with numbers nobody is able to interpret, or -even worse- are usually interpreted wrongly (p-values are top candidates among such values). Discuss your data, not some (complicated, misunderstood, inappropriate) statistics.
Dear Profs and friends, thanks very much for your great advice. My college is one of the feeders for universities. The competition is intense for critical courses like medicine, pharmacy and dentistry. So you can be sure that our student marks must be skewed, with about 15-20% getting A's in all subjects (Bio, Chemistry, Physics or Computer Sc and Maths).
When test scores seemed normal and ,my QQ plot is ok, I use t-test. When the treatment seemed to produce high scores in my experimental group, I used MW. I have not read a paper where the researcher reported using both t-test and MW. Because I'm not so expert in stats, I run both tests, but I just reported either 1 of them in my papers. Your good advice will prevent mistakes on my part, and less of rejections by journals. Thanks.
@Miranda,
That is just fine as long as you getting good results with your novel and admirable teaching methods!
Thanks dear Profs. I got the data here. I can report the means and medians. Then I report on normality and non-normality, as below. Is it better now?
For the control group, Kolmogorov Smirnov test showed that the test scores did not deviate significantly from a normal distribution (D = .151, p = .079).
But the test scores of experimental group were negatively skewed (skewness = -5.477; kurtosis = 30.0).
(Imagine it, one of the functions of RG is to prevent rejection by journal! Thanks very much for all your help.)
Miranda, it is surely ok (i.e. accepted) if you leave it like this. However, you then commit a classical error: you compare a significant result with a non-significant result. You conclude theat the distributions of the groups are different because two test give you once p0.05. The conclused based on this that the two distributions are different can not be drawn from these results. This makes no sense logically. Never compare p-values!
Thanks Dr Jochen. I will leave it like that. Thanks for the caution. For normal data, I can mention D and p. But for the non-normality data, I just talk of skewness and kurtosis. I took the hint from Prof Zaal's earlier post, thanks :)
The only and probably correct conclusion about testing for normal distributions from all of discussions here is: If you have a rather small sample you better test for normality in order not to lose any peace of data in outliers.but if you have a large amount of data you don't care about tests of normality because as sample size increases sampling distribution converges.
@ Hamza: If the sample is rather small, the test of normality will have too little power to detect important departures from normality. On the other hand, if the sample is rather large, the test of normality will have too much power, and will detect departures from normality that do not matter.
Unless you are the first one to be working with the variable in question, there is probably information from outside your study that you can draw on when deciding whether it can be analyzed with a parametric test. See the Bland & Altman's BMJ Statistics Note on this topic, for example (link below).
HTH.
http://www.bmj.com/content/338/bmj.a3166
Yes, Bruce, thank you for pointing out to important remarks of Bland and Altman. I am pleased to see your comment as I expected it-Indeed, small samples have too little power to detect significant deviations from normality with large samples having too much power. Methods based on the t distribution can detect differences in samples as small as two for paired differences, but, then, no one can trust such small numbers The problem is that rank methods, doesn’t matter whichever one we use, are least effective with small samples. And so, we have here small sample size with low effect size and these don’t allow conclusion, replication or prediction. I am personally a clinician being occupied in last 20 years with clinical trials. For us in the clinic small samples often do matter and I am trying to be practical and short. Since rank methods cannot produce P
Please note a reference to Box(1952) that conducting preliminary tests of the assumptions underlying anova is like putting a canoe out to sea to determine whether conditions are safe enough for an ocean liner to leave port. In short the t test is regarded as fairly robust with respect to violations of assumptions underlying it
I'm a fan of that Box quote - but I think that saying that the t test is more robust than most tests of assumptions doesn't logically imply that the t test is fairly robust. Modern statistical work has demonstrated that t tests and similar procedures are not at all robust by standard criteria such as breakdown point and bounded influence.
I have just analyzed another dataset. Both my experimental and control group had scores that were normally distributed. But Levene's test showed that equality of variances could not be assumed (F=7.11; p=.01). But the value of t was the same;
t (60) = 8.092 (p= .0005, equal variances assumed) and
t (52.328) = 8.092 (p= .0005, equal variances not assumed).
So I just quote the second line. But that is behaving like a technician, rather than a scientist! My question now is how was the df of 52.328 generated/ obtained? (My N was 62.)
The test for unequal vairances is called Welch's test, you can find formula e.g., in
http://en.wikipedia.org/wiki/Welch%27s_t_test
Just a slight modification of t-test, and do not give it too much importance, especially in your case as both tests concide, your samples are large and of same size, with normal distribution.
To: Mohammad Al-khresheh:
most often, data on behavioral sciences do not follow a normal distribution. I prefer a nonparametric test. I do not know how serious statisticians avoid checking the assumptions of specific testing method.
To be honest, I'd just use the T-test as the Central Limit Theorem should give us some protection for the normality assumption. The t-test is also more powerful than the non-parametric alternatives.
@Mirinda, I found the thread of discussion very interesting. If I analyze your problem with unequal (very very) sample size, my first tool would be to understand my objective and target outcome from the analysis. Then I'll check the theory of change that I'm working on. Based on these framework, I'll select my analytical models with tests and theories. I would be rather critical about validity concerns when I'm comparing apples with oranges. Not to be iterative, I agree with Bruce to keep the analysis as 'it is what it is', and interpreting information with care. I understand your scenario, but if I put my feet into your shoes, I would be careful at the designing phase for the project/ work. I usually say my students, "Think rigorously at the research planning phase, and act smartly at the analytical phase". It's a learning for growth no doubt.
You can do both. If your data are more than 25 or 30, significance test will be that applied in normal cases not that of Chi-square distribution.
Thanks to all of you for your responses. I asked several related questions on this thread. At first, it was for a research where the sizes were unequal. That's because gender was a variable, and like many places, I have 25% male and 75% female respondents. After that I asked for another study where gender wasn't a variable, and the 2 groups were equal in size. I was able to confirm many things from the expertise of all of you, and I got 2 papers published in local journals. For us on RG, WE CELEBRATE SCHOLARSHIP.
Dear all a very energetic and informative discussion. I got stuck at a point similar to his and need help. Whether to go for parametric or non .P. Tests while testing mean significance.
Population may be assumed normal but the sample does not look normally distributed and uneven sample size. 161 at one end and 92 at other.
My levenes test value is F=2.5 Sig.=.109,
T= 1.97 , Sig=.049 (for equal V.) and
T=1.94 , Sig=.053 (for unequal V.)
At this close call should i go for non.Parametric test mentioning above condition.
Sorry Dr. Zaal ! Sir I just edited the question , had a misprinting.Sorry for Inconvenience.
It is OK...
First test normality in your samples... their size is large enough to run this test (e.g. Shapiro test). In non-normal, then use non-parametric test.
Otherwise, if normal distribution confirmed, Leven's test shows that variances do not differ, so the difference is significant. (p
Ashish, with sample sizes of 161 and 92, tests of normality will be statistically significant even for very small departures from normality. So there is absolutely no point in carrying out those tests. Second, homogeneity of variance becomes more important as the discrepancy in sample sizes increases. So in your situation, I would go with the unequal variances version of the test. Finally, if you use a Wilcoxon-Mann-Whitney test as a test of location, it is bound to have an inflated Type I error rate, because it is sensitive to differences in variance and skewness, etc., not just to differences in location. It is only a pure test of location when the two populations have identical shapes--and that is very unlikely to be true when you're working with real data. See the Fagerlan & Sandvik (2007) simulation study for more info about this this final point. HTH.
Article The Wilcoxon-Mann-Whitney test under scrutiny
Ashish,
You got two very different recommendations! :-)))
I would rather learn from Bruce, a professional staistician. I agree, small devaition from nromality, even if detected by a test, shall not be taken into consideration in your situation.
I erred with Levene's test - it tests equality, so your results show variances are not equal! It means, "T=1.94 , Sig=.053 (for unequal V.) "