First you have to know if you have normal or not normal distribution. To know that you can apply Kolmogorov-Smirnov or Shapiro-Wilks. After that, if they show you a normal sample: use ANOVA, if not: use Kruskal-Wallis H (like ANOVA but for non parametric samples)
It depends on your question of interest. What do you want to know? Is the question whether there are any differences across the groups, or are you interested in particular group differences? Or something else?
You should calculate estimates for the differences you are interested in, together with their uncertainties.
I think it can be analyzed by univariate ANOVA. Let the body weight be Xij (i=1,...,nj; j=1,...,8), the body weight of the ith subject in the jth group. Use the following model to fit your data:
Xij= a+ bi +eij,
where a - overall mean,
bi- the ith group effect,
eij- random error.
You can estimate the parameters by the least squares method and perform hypothesis testing for H0: b1=b2=...b8=0. The F-test can be useful.
See Johnson and Wichern (2007). Applied Multivariate Statistical Analysis. 6th ed. Pearson Prentice Hall.
Jiancheng,
this tests the null hypothesis that the expected weights in all groups are equal. Nothing wrong with it, but to my opinion this hypothesis is not sensible, and it does not fit the question ("compare the differences between groups").
ANOVA can be used, but it depends on the question and how the groups were obtained. So, how did you define your groups, what are their intrinsic characteristics other than weight, are they equivalent and why do you want to know this, that is, what is your independent variable? If it is a true independent variable, ANOVA can be used. If instead it is a PREDICTOR, than you should go with regression analysis.
First you have to know if you have normal or not normal distribution. To know that you can apply Kolmogorov-Smirnov or Shapiro-Wilks. After that, if they show you a normal sample: use ANOVA, if not: use Kruskal-Wallis H (like ANOVA but for non parametric samples)
As said above, check the normality of the variable first, if the variable is normally distributed apply ANOVA, if otherwise Kruskal-Wallis.
Neither ANOVA nor Kruskal-Wallis does give any hint to which groups are different, let alone to what the actual difference between the groups is.
Sounds like ANOVA (F test) can be used. However, you need to be careful about the assumptions: all groups are independent, normally distributed, and have equal variances. Independence and equal variance may not be easy to validate.
How to look the diferences intragroup it is easy in ANOVA, you should do post hoc tests (Bonferroni, S-N-K) but with Kruskal Wallis H I think is not possible to do the same.
Dear Down-votor of my answer and of Jon Salmanton-Garcia's answer, could you kindly provide a clear answer how to compare difference in mean body weight for eight different groups?
Dear Pahlaj, I gave a down-vote. And I also think I explained why and also how Abdulsalam's problem is to be solved.
Again: ANOVA does not tell you anything about the differences between the groups. ANOVA can be used to assess if the whole variable ("group") with all its levels is a valueable predictor in a model. This is an entirely different question. If one is interested in the differences between groups, then there is only one simple solution: actually calculate these differences. These are still expectations or estimates, so they are associated with some uncertainty. This uncertainty should be given in form of confidence intervals. The required standard error can be obtained from a pooled variance estimate, as it is calculated as one interim result in an ANOVA.
The advise to test differences between groups is not the answer to how groups are compared. Post hoc tests like Tukey's HSD test do this replicate testing of a null hypothesis while controlling the family-wise type-I error rate (FWER). But again this does not provide any information about the differences. It only gives you those pairs for which you may reject the null hypothesis ("no difference") while controlling the FWER at some arbitrary level. Note that the FWER is controlled inependently of an ANOVA.
Running Tukey's HSD tests, some software provide not only a classification regarding the rejection of the null hypothesis (and/or p-values) but also confidence intervals adjusted for multiple comparisons. If this adjustment is required and sensible has to be decided by the researcher, If not, then the differences and confidence intervals are best calculated manually.
Dear Jochen,
Oh, I see. Thanks for your explanation. I am interested to know a bit more.
Could you kindly provide with supporting reference (any text book or paper) to what you state about ANOVA. I had different idea about ANOVA. Appreciate your contributions that are very useful for my knowledge about ANOVA.
Original question is: "What type of statistical analysis should be used to compare difference in mean body weight for eight different groups?" and not "comparing difference between groups?"
Thanks in advance. I am very new in this field.
For ANOVA it is extremely helpful to read Fisher's original work.
http://psychclassics.yorku.ca/Fisher/Methods/
There is a blog entry about the misconception (ANOVA vs. post hoc tests):
http://www.math.yorku.ca/Who/Faculty/Monette/Ed-stat/0525.html
More about the HSD test in
Seaman, M.A., J.R. Levin, and R.C. Serlin. 1991. New developments in pairwise multiple comparisons some powerful and practicable procedures. Psychological Bulletin 110:577-586.
Much of the confusion arises because of the deep misconceptions about hypothesis tests. There is a hell lot of literature about this topic. This may be a primer:
http://warnercnr.colostate.edu/~anderson/thompson1.html
I attached some papers about multiple-comparison procedures.
Dear Abdulsalam,
Dear All,
I agree with the comments of Jochen. I suggest you a very simple and easily understandable homepage where statistical concepts and applications as well as on line statistical programs can be found:
http://vassarstats.net/textbook/index.html
Dear All,
In a scientific discussion participants change arguments and not down-votes. I theoretically do not down-vote any comments but put my opinion.
Dear András (and all others),
I agree that a scientific discussion must not be based or driven be votings (neither up nor down) but on the exchange of arguments.
However, a scientific discussion is not the only purpose of this forum. Questions are asked and answers or helps are given. Sometimes there are very useful and helpful answers, smart ones, or comprehensive ones. Taging such answers with an up-vote may help to guide seeking readers to such valuable posts. On the other side, there are sometimes answers that are logically flawed, off-topic, misguided, and likely to put the reader on a wrong track. Tagging such answers with a down-vote may make seeking readers possibly a little sceptical when reading these posts. However, down-voting without correction or explanation is bad (and I always explain what and why I down-vote [if I do so]). Voting (up or down) to simply express a similar or diverging opinion is also bad: collecting opinions is not really helpful. Thus I see that these votings do have a value.
So we have two aims: discussing(=learning) and providing help. The latter I see is supported by a voting system, given the votes are given carefully and thoughtfully.
Dear Jochen,
Thanks for your answer.
Unfortunately, many RG participants misused the down-vote opportunity. Practically all of them have used it anonymously and mostly without explication. You are the first one – I have ever met - who articulated explicitly to have down-voted some comments. As I expressed it above in science the discussion, the learning possibilities are important. Punishment - because I cannot call it evaluation – may not be an instrument of mutual education.
In your case many thread participants were not able to understand what you have written. Regrettably, statistical teaching is mechanical at many, even European universities. Not everybody can get as excellent education as a German citizen.
I also submitted a wish to the RG team that down-votes should be made non-anonymous, just to avoid that it is used as a means for punishment.
PS: there is a lot of very bad education in Germany as well, believe me! - But however, this RG forum is a place where we can, in part, overcome the local differences in educational quality. At least the willing students can get a lot good ideas to think about and may encounter new aspects, new insights, and other philosophies.
Thank you Prof. Jochen for sharing historical resources. I am very interested and reading, I may ask your further help, if needed.
I agree with Dr. Jochen Wilhelm. You can use ANOVA and then post hoc analysis such as Scheffé test if mean differences were significant in ANOVA.
Dear Amin, this is NOT what I said. To which post are you refering to?
To compare difference in mean body weight for eight different groups you should use T test for data with normal distribution or man Whitney test in case that data are not in a normal distribution.
Notice: to test the data distribution use ( klomogroph Smirnoff test)
Best regards
Mahmoud, the T test (or the MW test*) is not a comparison. It is a test. It does not provide any estimate of the difference; instead it only gives you the probability of the observed and more extreme t-statistics under the null hypothesis that the groups have the same population means.
* the null for the MW test is more complicated. Given the assumption that the distrbutions are identical except for the location, the null is the equality of the population medians. Otherwise the null is the equality of the population probability that a value of the one group is larger than a value of the other group.
Dr. Jochen Wilhelm
Firstly thank you for your comment,
secondly i mean he can use T.Test to find out that the groups have the same population means or they have different population means, if it was found that they have different population means that mean the groups have different opinions around the variables or the item we measure.
thirdly: for the MW test it is used for finding out the equality of the population medians,the equality of the population probability that a value of the one group is larger than a value of the other group.so here you compare between groups as you said.
the question is what if the groups are unequal in medians? what does it mean if they have different values?
i think that means they have different opinions around the same item or variable.
at the end
T.test and MW are used to find out if there are differences or not between groups in the areas shown in the last comment. and they are not used to compare the differences between groups.
Mahmoud,thank you for your reply and I see your points.
I still have to point out that tests do identify differences, and they also do not identify similarities. There are two (mutually incompatible) philosophies behind tests, the philosophy of Neyman and the philosophy of Fisher. Typically, tests are interpreted in realm of Neymans philosophy, that is the rates of false decisions are controlled, often with the strange restriction that only "rejecting the null hypothesis (H0)" is considered a decision (that can bei right or wrong), and only tha rate of false rejections is controlled.
Based on this, the result of a test is a decision, not a comparison. The test gives no indication how likely this particular decision is right or wrong. The repetitive application of the testing procedure only guarantees an upper limit of the rate of false rejections (type-1 error rate). This is all. After a test I might decide to behave as if the population means were different, but this conceptually completely different to actually compare the groups or to make an inference of how different the population means may be.
In the vast majority of occasions performing a test on a point null hypothesis is non-sensical on a-priory grounds. If you have different groups, that are different to some respect, there is no reason that the theoretical populations of these groups should have exactly identical population means or medians (or whatever). I would bet my right hand that this is not the case. So what actually do I test? I already know that they are different. The key question is: "how much different are they?", and: "is the difference relevant?". Tests do not answer these questions, but a comparison does.
There's one problem: t-tests can only be used to compare means of TWO groups. For comparison means of more than two groups, ANOVA is the appropriate method. If you use t-tests repeatedly in more than two groups, there will be error propagation.
Gloram, ANOVA is not a comparison between groups. ANOVA is at best a "comparison" of a model with the respective categorical predictor (with k>2 different categories) and a model without this predictor. ANOVA is about the reduction in the residual variance that can be attributed to a (set of) predictor(s). This is not identical to "comparison of k>2 group means".
ANOVA: ANalysis Of VAriance between groups. ANOVA is NOT a comparison between predictors (that is MULTIPLE REGRESSION). ANOVA is exactly used to compare groups subjected to interventions. Please refer to:
"Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences between group means and their associated procedures (such as "variation" among and between groups), developed by R.A. Fisher. "
http://en.wikipedia.org/wiki/Analysis_of_variance
Here's also an excellent source:
"Analysis of variance (ANOVA) refers to statistical models and associated procedures, in which the observed variance is partitioned
into components due to different explanatory variables. It provides a statistical test concerning if the means of these several groups are all
equal. In its simplest form, ANOVA is equivalent toStudent's t-test when only two groups are invloved. The analysis ofvariance were first
developed by R. A. Fisher in the 1920s and 1930s1. Thus, it is also known as Fisher's analysis of variance,or Fisher's ANOVA."
http://brainimaging.waisman.wisc.edu/~perlman/R/Chapter%20seven%20%20Analysis%20of%20Variance.pdf
Seems to be an example where the Wiki article can be improved.
ANOVA, btw, is a procedure that is used with (simple and) multiple regression models. It is not in opposition to regression, it is a tool that can be applied to regression models.
No, that is NOT it. ANOVA is NOT a multiple regression technique. Here's a quote from "Essentials of Behavioral Research" by Robert Rosenthal and Ralph L. Rosnow, 3rd Ed. Mcgraw-Hill:
"The F test can be used to test the hypothesis that there is, in the population from which we have drawn two or more samples, (a) no difference between two or more group means or, equivalently, (b) no relationship between membership in a group and the score on the dependent measure." Chapter 14, Analysis of Variance and the F test, page 409.
ANOVA is usually used on true experiments to see if the treatment had an effect in various experimental groups compared to the control group which did not receive the treatment, effect being defined as DIFFERENCE BETWEEN MEANS. T-tests are a special case of ANOVA, for 2 groups only. ANOVA requires normal distribution and homoescedasticity. Since those can be tricky to find, and true experiments are even harder, many researchers are using a Logistic Regression technique which does not require as many statistical assumptions and, since it is a multiple regression procedure, it can be applied to quasi-experiments.
One more thing: ANOVA requires a continuous dependent variable. Logistic regression deals with a binary Criterion (or Outcome, in Epidemiology) variable.
Dear All,
Now, I can realize that we successfully have confused our dear Abdulsalam who asked this question.
Dear AbdulSalam,
Can you kindly explain a bit more about your problem/question?
Is it about comparing differences between means of body weight of 8 groups?
or
Comparing 8 different groups?
Give us a bit background, are you looking at experimental angle? or Mathematical statistics use only. What is your purpose of finding difference.
For this question at least 3 of us have received down-votes for suggesting ANOVA to compare difference in mean body weight of 8 groups. Prof. Jochen thinks that we have given wrong answer so we should deserve down voting.
I am reading materials provided by Prof. Jochen, and may I invite all of us to read references provided by Prof. Jochen, and help ourselves to up-date our knowledge and assist our dear Abdulsalam to find solution of his problem. My apologies for this message.
Thanks
Ideally, normality assumption should be tested before selecting the technique. If the assumptions of ANOVA are met it can give reliable results. Else, use Kruskal Wallis Test which is non-parametric alternative of ANOVA.
But not that testing assumptions on the same data set that is used to test the hypothesis will lead to a bias. Further take care that the hypothesis tested by ANOVA and KW are not the same. Actually, one should be clear what hypothesis one wants to test and then use the respective test. It is not very sensible to chose a test (and, thus, the hypothesis) based on properties of the particular sample. This is part of an exploratory analysis (EDA) that has entirely different aims than the test of a hypothesis.
Another remark: What do tests of the assumptions tell you? Is is "significant" remains the problem that the detected difference (between assumptional model and real-life data) is relevant to the problem, - is it "non-significant" remains the question if an existing relevant difference was just not detected (false-negative). So whatever the result is - it does not answer the central question. In and EDA, the difference may be estimated and its relevane might be assessed. When this has once be done, one might consider planning an experiment to gather new data with which a specific hypothesis might then be confirmed under the control of some error-rate, using the assumptions that are justified by theoretical considerations and the experience of similar and previous experiments.
There are two issues that have arisen from this question: 1) testing the assumptions of the model; 2) How does one compare means.
Part 1) I agree with Jochen's latest post but will rephrase it. In testing for normality there are two outcomes: significant or not-significant. If we reject the null hypothesis that the data are normally distributed then we either transform the data or we try a nonparametric statistical approach. If we do not reject the null hypothesis we then pretend that the null hypothesis is true. However, a failure to reject the null hypothesis is not the same as proving that the null hypothesis is true. If you want to prove the null hypothesis you could apply symmetry: if a p-value less than 0.05 results in rejection, then a p-value above 0.95 would be grounds for accepting the null hypothesis.
In any case, you need to be aware that problems in your data can result in problems with your analysis. Always make a scatter plot of your data to find outliers. This is a great way to check for mistakes: 79 grams becomes 790 grams if the finger accidentally hits the 9 and the 0 key at the same time. A scatter plot also helps identify distribution problems. Are your data bimodal, or highly skewed? Is the variance in one treatment very large in one treatment and small in another?
Part 2) The answer to the original question depends on the experiment. It also depends on the software that you have available for the analysis. I'll give two examples.
Example 1) I measured insect body weight for sublethal doses of eight insecticides. The categories have nothing in common so I cannot use regression methods. I should use a multiple comparison procedure that controls both the experimentwise error rate. With eight categories, something like a t-test will be almost certain to find at least one significant difference by chance alone. There are a large number of options here that all depend on what you want to assume and the nature of your data. There is a very nice small book that discussion many of the different options and why you should choose one over another. Larry E. Toothaker, 1993, "Multiple Comparison Procedures" SAGE university paper series on Quantitative Applications in the Social Sciences, #07-089, SAGE Publications INC, 95 pp. ISBN 0-8039-4177-3. There are other sources, and the user manual for your statistics package may provide a good discussion. A few options include: LSD, Dunn, Tukey, Scheffe, Newman-Keuls, Ryan-Einot-Gabriel-Welsch, Shaffer-Ryan, Dunnett, and many more.
Example 2) I measured insect body weight for insects feeding on plants treated with 1 kg/ha nitrogen, 2 kg/ha nitrogen, ... 8 kg/ha nitrogen. In this case, the treatments were applied as categories. However, the amount of applied nitrogen is continuous and you have sufficient range in the data to justify treating nitrogen as a continuous variable. Please do not use a multiple comparison procedure on this type of data. Some form of regression analysis should be used. The easiest might be polynomial regression if low application rates stimulate growth while high rates are toxic. There are also more complex dose-response models that would include these effects. Such models were developed in one case for modeling glyphosate toxicity to plants. At low doses glyphosate stimulates growth. At intermediate doses glyphosate is a potent herbicide. At high doses glyphosate creates localized lesions, the herbicide cannot become systemic, and therefore toxicity is reduced.
This said, if I have eight treatments: Nitrogen at two levels, Potash at two levels, and Phosphate at two levels, then you don't have enough range in any one of these continuous variables to justify a continuous model. It is easy to draw a straight line between two points, but there isn't enough data to determine if a straight line is appropriate. So in this last case, keep with the multiple comparison methods.
Timothy,
your symmetry argument is not correct. A simple proof-by-example: take the extreme case you have only a single value in each of two samples (the values are not identical) and you perform a Wilcoxon-test, the p-value is 1 (no matter what values you have; only for identical values the result is undefined because of 100% ties). In your argumentation this would be evidence for the identity of the population distributions. Here it is obvious that simply the (severe) lack of power is responsible for this result. Only when the power is specified and when it is reasonably large, then a "non-significant" result may be used to decide that the population effect is lower than the minimum relevant effect (as used for the sample-size determination to get the desired power).
Jochen,
You are partly correct. The problem is that I left out a large number of details from a post that was already getting too long. How about the other extreme: I have 10,000 samples from each of two populations that are both normally distributed. I then use a randomization test to see if the means are different. I find that the observed mean from my sample is exceeded by 95% of the mean differences from all possible combinations of the data. Would that really not be sufficient to say that the two samples are from the same population?
Now we have example and counter example. The take home message is that one needs to be careful in how you apply the idea. But that could be said of all of statistics. There is no single statistical analysis that applies to all data in every case. Even something as universal as the mean doesn't always work (consider bimodal data). I study insect feeding behavior. The insect feeds briefly, then moves. It is trying to find a good feeding spot. Once it finds a good spot it will stay there for hours. The mean feeding time is 40 minutes. However, it is not possible to find a feeding time of 40 minutes in these data. Since I can find a case where the mean is invalid, do we stop using it?
The two way ANOVA approach is the equivalent of an LSD test with no protection for the experimentwise error rate. I assume that one assigns significance at p=0.05. Each test is independent. With eight treatments there are (8!/(2!(8-2)!))=28 possible pairwise comparisons. The p-value for the entire experiment is then 1-(1-0.05)^28 = 0.76. You are quite likely to find at least one significant difference at this rate.
You could solve the problem by setting p=0.00184, but that is too restrictive for each pairwise comparison. Hence there are tests that have been specifically designed to achieve the proper balance between the power of pairwise comparisons while controlling the experimentwise error rate.
This assumes that you want all comparisons. It isn't so bad if you are just comparing seven treatments to a control. 1-(1-0.05)^7=0.3. Still, not great either.
Dear all,
thank you for your discussion and comments
its really helpful
Dear Pahlaj Moolio
my question is how can i compare "mean body weight " of eight different groups ?
i am working on rats , i want to examine the effect of smoking in combination with vitamin E & exercise on the mean body weight for 8 different groups
according to this parameters my experiment wad designed to eight groups
smoking groups, exercise groups, vitamin E, smoking and vitamin E ,smoking and exercise ........
ANOVA will work very nicely, because you actually have control of the participants and their conditions. Make sure you have sample sizes large enough to have enough power.
http://www.andrews.edu/~calkins/math/edrm611/edrm11.htm
ANOVA. If S=smoking, E=exercise, V=vitamin, then the model is weight=S+E+V+SE+SV+EV+SEV. S, E, and V are categorical variables. Then report least squared means. All the variables are binary, so all you need is to know if the main effect is significant. You need to see if the model fits. Check for outliers in the data that could be highly influential. Check for normality, and heteroscedasticity. If there are non-significant terms consider examining a reduced model. Do not remove a main effect unless all interactions with that effect are not-significant. It would be best to present both the full model and the simplest model. Note that some people will only want to see one or the other model.
A multiple comparison procedure approach was discussed. It was a natural choice based on the question asked and details that were provided. However, I now don't see much value in a multiple comparison procedure. Is there a reason why you care if the SEV treatment is significantly different from the SE treatment? It would seem more useful to know the effect of smoking, the effect of exercise, the effect of vitamins, and how these treatments interact. The table/figure from a multiple comparison procedure that compares all the treatments will be much harder to interpret.
If you have followed the principles of design of experiments, and your data is normally distributed, directly use the ANOVA for RBD factorial experiments. Otherwise , use non-parametric tests.
The multiple comparison test will add to the value to get the significant differences between the two treatments means.
Eight groups mean there are 28 pairs, where difference will be tested between each pair. Multiple comparison test may be needed (Duncans' Tukey's, Scheffe's and even t-test or Z test)
In the classic process (no statistical software available) ANOVA is done first to ascertain that there is a significant difference between at least two groups. Then one proceeds to the multiple comparison test.
Nowadays, one can directly proceed to multiple comparison tests, no sweat.
Multiple comparison tests are not suggested. You have three treatments: Exercise, Smoking, and Vitamin treatments. So what does it mean if the treatment with exercise, smoking and vitamins is significantly different from the treatment with only smoking? Was it the exercise, the vitamin, or the interaction that yielded the significant difference? The obvious answer is that you look at the significance of the other treatments, but that only sort of works because you are using pairwise comparisons to try and recreate an ANOVA. The two are not the same thing.
Abdulsalam stated that there were eight groups and there were three treatments (smoking, vitamin, exercise). It happens that with three treatments there are seven possible combinations (S, V, SV, E, SE, VE, SVE) and one has another treatment where nothing was done to give a total of eight. For argument, say that in this experiment my pairwise comparisons showed that there was a significant differences between all treatments. What is the average effect of smoking? Was the effect of smoking stronger or weaker than the effect of exercise? Did smoking hurt everyone equally? Least squared means will give a quantifiable answer to these questions. This is so much better than just saying that there was a significant difference between two treatments.
Dear Abdulsalam,
I expect I can help you a little bit more with your question. As other forum members have already pointed out, it is very important to detail your experimental design.
Apparently, you have three between-subject factors, with two levels each:
Factor 1: Smoking (absence or presence)
Factor 2: Vitamin supplementation (absence or presence)
Factor 3: Physical Exercise (absence or presence)
This would give you 8 groups (2x2x2=8).
Since you are going to measure body weight along time in the same animals, you also have a within-subject factor (the measures themselves, or "days" - also called repeated measures), that must be included the analysis. In fact, you will be determining whether these factors alter body weight along time, and whether they modify the effect of each other. The interval between measures shall be equivalent (every day, or every other day, or every three days, whatever...). This is important for the repeated measures analysis.
This is a very complex design (three-way ANOVA with repeated measures), and it will be hard to interpret the output of the statistical package. For this reason, you should talk to a statistician before starting your experiment. Moreover, selecting the exercise regimen or vitamin dose will be critical, because you have only two levels of each factor. Therefore, you should discuss carefully your experimental design with your advisor (make your hypothesis very clear) and with a statistician, who will help you to plan your experiment and deal with some other points that I have not mentioned here, such as the assumptions for this analysis. I hope this may help you. All the best, and good luck! Carlos.
Carlos' answer was good but it depends on what data you have. There are at least four choices:
1) I took a bunch of rats, treated them with 8 different treatments and measured them 7 days after treatment.
2) I took a bunch of rats, measured them, applied treatments, and then measured them again 7 days after treatment.
3) I took a bunch of rats, applied 8 treatments, and then measured them at 1, 2, 5, 8, 15 days after treatment.
4) I took a bunch of rats, applied 8 treatments, and then measured some of them at 1 day after treatment, measured others at 2 days, ..., and others at 15 days after treatment. All rats were measured only once.
Carlos' answer is the correct one for #3. You can avoid a repeated measures design by using #4. If rats were measured before treatment in #3 or #4, then convert the data to a difference from initial conditions. Likewise, #2 can be converted into #1 by taking the difference between before and after treatment for each rat.
Thank you Carlos for broadening the scope of the answers and helping us avoid a serious oversight.