In case of small data sets, a test of significance for normality may lack power to detect the deviation of the variable from normality. Therefore, I advise to take a subjective route looking at the two things; one, what literature contains about the normality of the variable under consideration, and two, look at the descriptive statistics, namely, mean, median, mode, range, and quartile deviation.
Good question. Unfortunately, just as for other statistical tests, the chance of detecting an effect if an effect is present is greatly reduced by such small samples.
But, are you asking this so that you can decide whether or not to transform your data prior to some other analysis? If so, remember that many statistical tests actually do not require normality of the raw data. It's the normality of the model residuals that you're most concerned about, since this tells you if the model is explaining the distribution of your data or not. In some cases, in order to improve residual normality, you may need to resort to data transformations. But, I would say be very careful of these as well.
Budi: could you please clarify why you suggest this particular test?
J. Patrick: One of the reasons why I am interested in this is whether or not I should choose parametric or nonparametric bootstrapping to estimate confidence intervals for the parameters of a model (e.g. nonlinear) fitted to the data set.
I would examine your model residuals to see if they are normal. With such a small data set (the kind that I also deal with), such an assessment of residual normality may be quite subjective (i.e. does the QQ-plot show a straight line?). With such a small sample, you're right that some bootstrapping is important.
I've only worried myself with parametric bootstrapping. So, perhaps others can weigh in here...It seems like the sample size alone would prevent you from adequately making an assumption about the distribution, so wouldn't a nonparametric bootstrapping approach be the most conservative?
Wilk-Saphiro test was designed to test for normality for small data-size (n < 50). This test is more powerful than Lillifors, Kolmogorov-Smirnove, Anderson-Darling and other tests for small data-size. (See Wilk-Saphiro test).
Thank you, Budi, this is the kind of information I was looking for! Perhaps you can suggest a paper about this topic?
Thanks again, J. Patrick! I am thinking too that nonparametric bootstrapping would be a first choice in such situations, but a concerning factor (if I am not mistaken) is effect of outliers. What method of outlier detection/data transformation would you recommend for small data sets?
Just to clarify, normality of raw data is not an assumption for models like ANOVAs, GLMMs, etc. You're only concerned about normality of residuals. Sorry if I'm sounding like a record player. I just think people use a good bit of time messing with their raw data unnecessarily to make them normal, and that's not necessarily required. That said, any outlier test may also be sensitive to sample size. The outlier tests I know of still need a set of values to then assess an individual point's leverage value (for example). I can imagine a case--but I haven't simulated this--where an outlier point detected with n=10 might turn out NOT to be an outlier with an n=100.
Thanks again, J. Patrick! So, does the following procedure make sense for small data sets?
1. Fit the model (e.g. non-linear) to the raw data.
2. Test residuals for normality (e.g. with Shapiro-Wilk test).
3. If residuals are normal, use parametric bootstrapping to estimate model parameter confidence intervals. If not, use nonparametric bootstrapping.
Still unclear about what best to use for outlier detection, particularly keeping in mind what you mention about sample size effects. What do you use generally in this case?
That sounds like a great approach. With n=10, I would likely just view the residuals with some level of subjectivity and simply ackowledge that there's going to be noise either in the normality test itself or my subjective characterization of normality. Of course, that means that it will be good to perhaps put the QQ-plot in the supplementary materials so that others can see what you judged to be normal, regardless of what your normality test says about your residuals. It's just tricky with small sample sizes. Re: outliers. I tend to keep all data points, since I usually can't justify why a particular data point should be excluded. We've all noticed that one individual or subject that has some unexpected character (sort of a Black Swan). Those entities are still in the distribution, but they may not look like it. So, that's the reason I keep them around. (I have one set of eggs from a clutch of my focal species that built a nest out of completely different materials and laid a strangely colored egg---with video data of the parents!...it's quite evil for graphs and analyses, but I always keep that nest around.)
In case of small data sets, a test of significance for normality may lack power to detect the deviation of the variable from normality. Therefore, I advise to take a subjective route looking at the two things; one, what literature contains about the normality of the variable under consideration, and two, look at the descriptive statistics, namely, mean, median, mode, range, and quartile deviation.
In the first place what is your specific need to perform a normality test if I may ask? I would be able to better address your query knowing it in the first place. Thanks and I look forward to the questioner's response.
J. Patrick, thanks again! I agree with reluctance to completely remove points which can look like outliers. But what about transformations (e.g. log) to reduce their effects?
Thank you, Murali and Shree! For the purpose of publishing results, would it not be useful to perform and report formal normality testing rather than a subjective analysis only (even though it does make sense of course that power of such tests may be low)? I would be grateful for your suggestions.
Congratulations on your publications (ready I suppose). When you say "purpose of publishing results," what is the purpose served by the normality test in this work that you are doing that according to you requires the normality test, if I may ask? In order to avoid an IP conflict, you can make it very general. i am conducting a DOE, or assessing a process improvement etc if you choose to. So by understanding your statistical purpose of your process (rather than professional purpose of publishing), I would be better able to articulate my thoughts. My apologies for not clarifying this in the first response of mine. Thanks for your patience and I look forward to hearing from you to help you move ahead.
Forget testing normality when you have 10 statistical units, even because in the great majority of practical cases knowing that a sample is normally distributed is less important than usually thought...
Just look at this (http://stats.stackexchange.com/questions/13983/is-it-meaningful-to-test-for-normality-with-a-very-small-sample-size-e-g-n) if you wish to disregard the purpose of the test (a statistical explanation with small sample size). Knowing the purpose of your test will also tell me if you need some other test or information in the first place other than the "Normality" test. This is why I was insisting on the purpose at the very first.
Rahul, this is a strange advice. Usually, the NP "equivalents" to the prarametric tests test something different, what actually may not be intended. Also, science is more about modelling than about testing. There are often no reasonable models using just rank information.
...adding to Jochen answer that I subscribe, but you probably want to have a rough idea if 'something is there' in that cae NP is perfect...as for normality it is important to stress tha any significance test has to do with sample and not population variance and when you have n > 30 any sample distribution can be considered as normal. NP are OK for getting rid of outliers...
If you have "large" samples, some few outliers don't matter much (one still can analyze their influences). If you have a lot of outliers, you should think harder about the process/model used to explain the responses. Thus, few/sporadic outliers[*] are not a problem, and a larger group/frequent outliers are less a problem but rather an important information you should use appropriately instead of getting rid of them.
If you have "small samples", meither the shape of the error distribution nor the presence of outliers can be judged with resonable confidence. Here only theoretical considerations help. If you think that there is a stable common center and a finite variance, and if this is all you know, then the normal error model is the one with the highest entropy, i.e. adding the least specific information to the analysis. There is nothing wrong with this. If a hypothesis test is performed, it can be done within the Newman-Pearson's regime or withing Fisher's regime. Only the former is illegitimate, because you can't claim that any particular long-run error-rate will be held (this frequentistic feature holds only when the error-frequency distribution matches the error probability model). The latter is still ok, because the p-value is just one of many indicators and is not interpreted all alone in outer space, and it is not used to control fixed error-rates. Fisher would call this a potential "type-III error".
[*] outliers are meant to be values showing "unexpectaldy large" residuals, way off from the majority of other residuals. Clearly, any obvioulsey wrong, unreasonable, impossible values *must* be excluded from the analysis. Here, a measurement error, typo, or some ther physical incident may be responsible, so that the value cannot contain any information about the process under study. Outliers not to be recognized as "measurement failure", "unphysiological", "unreasonable" or even "impossible" values may tell an important story. A good model should apprechiate this.
Formal normality testing is never very useful - either it is underpowered and relevant deviations are not recognized or shows "significant" deviations, but than it is still not clear if these deviations are *relevant*. Many common procedures (linear models, for instance) are quite robust against deviations form normality. The biggest problem is usually a loss of power.
Bootstrapping CI's is the only option if the distribution of residuals is very assymetric or not unimodal (that is: when it is quite strange). Otherwise, bootstrapping does not perform any better than the usual methods - the small sample is just taken to be perfectly representative for the population, not only for mean and variance, but also for other properties of the distribution; this is surely as hard to justify as for mean&variance alone (and further assuming there is nor severe assymetry and/or non-unimodal shape) .
Thank you very much, Jochen! I thought that bootstrapping on small data sets is not very difficult or time consuming, and has the advantage of not making any assumptions about the data distribution. Does this make sense to you? Or perhaps an alternative would be to generate synthetic data sets assuming say Gaussian (or log-normal) errors?
My question is why do you need to perform the normality test other than it being a statistical test? What is the purpose behind this with respect to the content you are trying to publish? To put it in another mode, what would your readers challenge you, if you were to publish the results without sharing the normality test results? Couldn't be more blunt I guess. Hope you don't misunderstand this probing.
I don' t think that it is worth doing a normality test at all, not only for small samples but also for big samples. Normality is an option that has been used so widely that it will not add more respect to your research work. Perhaps you should try alternative thoughts for judging about your data. Just because of curiosity: can you provide us this small data set?
As far as I know, Normality tests such as K-S test only will have power when your data number in more than 30. even though, you can only compare the odd results with each-other without any statistical analysis.
I prefer NON-PARAMETRIC TESTS in these kind of situations. :)
Igor, you might want to check out work by Stephan Morgenthaler (http://statwww.epfl.ch/morgenthaler/people/morgi.shtml), recently he has been working on estimators in small to extremely small samples. He doesn't directly deal with tests of normality, but you may be able to extract something useful from his work.
Shree, thanks for your question: my original thought was to test the residuals of a fitted model for normality. In case they were normal, I could use parametric bootstrapping to generate model parameter confidence intervals. If they were not normal, I would use nonparametric bootstrapping. But after reading the comments here I am now leaning towards skipping the normality testing as not useful, and going straight to nonparametric bootstrapping. Does this make sense to you?
Thank you, Demetris, Scott and Valeriy! After reading the comments here I am indeed much less inclined to use normality testing. I have no particular data set of interest yet - my question is generic. I am trying to figure out what would be a good approach for analyzing small data sets in general and reporting the results.
Just for identifying outliers the best approach is to draw box plots, and remove the outliers but the issue is to analysis with the outliers. You may not apply any parametric test if the sample size is
@Igor: "But after reading the comments here I am now leaning towards skipping the normality testing as not useful, and going straight to nonparametric bootstrapping. Does this make sense to you?"
If you have a model (what typically means that you use the data to fit model parameters), then the model provides standard errors and/or confidence intervals for the parameter estimates. If the probability distribution used by the model matches the frequency distribution of the residuals, then the confidence intervals will have the expected frequency properties (i.e., in the long run, not more than 5% of such intervals are expected to miss the "true" values). But this is IMHO not essential; the confidence intervals can be interpreted as "highest likelihood regions", without claiming any frequentistic properties. This likelihood function tells us how strongly we can modify our expectations (about the parameters) after knowing these data. In this regard, if the residuals do show a frequency distribution systematically deviating from the underlying probability distribuion, then it tells us that the misspecification of the model is already visible using the available data (model are ALWAYS wrong! - but some are useful). This in turn can make us think about a better/different model, i.e., to *learn* something more from the data what we previousely might not have thought of (missing predictor, interaction, non-linear relationships, ...). But even if we have no idea or no chance to modify the model (probably because the data just does not provide information about other possibly impportant predictors), then the model at hand is the best guess we can do, being well aware that there is something more hidden in the data, what might bias our conclusions.
Bootstrapping makes an assumption: The data (or the residuals) are representative for the population, in all its properties. Instead of assuming that the population is a normal distribution with mean and variance to be estimated from the data, the entire shape of the population distribution (including mean, variance and theoretically infinitively many more parameters) are assumed to be identical to the parameters of the sample.
The uncertainty of some statistics can not (or just very hard) be analytically determined. Here, bootstrapping is the only option to get a guess about this uncertainty.
Thanks a lot, Jochen! Your explanation is very detailed and useful! I am thinking now of using two methods (1. nonparametric bootstrapping and 2. generating synthetic data assuming Gaussian errors), fitting the model with both techniques, and comparing the results (i.e. the parameter confidence intervals). Does this sound useful? Would be grateful for your suggestions and suggestions from any other contributors!
Nonprametric analysis is good suggestion. In case you still want to test normality, Shapiro-Wilk can be done and you may also obtain Normal Probability Plot of the data or residuals.
Igor, have you observed that Lorenz curves of normal distribution samples and normal models are very close to the diagonal line from point (0;0) to (1;1) ? This is due to their small dispersions. When the sample does not follow normality you may compare them by graphing the point values of
Fi= ln(Li)/ln(Xi), for data ordered frop top to low variable, with L= fraction of distributed mass and X=cumulative fraction of population. If you compare two samples of this (Xi, Fi) values of any size you get a graphic picture that shows if they have or do not have a similar distributive structure. Pareto distributions show horizontal graphs. Try it and obtain your own conclusions. This bring important consequences for non parametric modelling any kind of small and big samples. Thanks, emilio
I have performed Shapiro-Wilk Test on my result set. I do not have statistics background. I have attached excel file. I want to know how should I infer my data which algorithm is better in result set.
I had a look at your data and the analysis you have done. There is a clear evidence of non-normalcy in your data. I mean the variables you have do not follow normal distribution. Therefore parametric tests will not be valid. Hence, I advise you to analyze and present your results using non-parametric methods. If you are not that confident in statistics and you are conducting some research study requiring statistical analysis, I advise you get in touch with some experienced statistician to collaborate with you.
Thank you, that is very true my data is not normal distribution. I have done Wilcoxon Signed Ranks Test for one tail and two tail but i don't know how should i infer. I have attach file. please comment on it.
Your data shows values for the groups "MCT" and for "MET" and this for several "number of tasks" (100, 1000, 5000, and 10000).
The data within a group is unlikely to be sampled from a normal distributed variabe. However, it is not unlikely that such data was samples from a log-normal distribution. It the Shapiro-Wilks test is applied to the log values, no p-value is even close to any conventional level of significance. The smallest p-value, after Holm's correction for multiple testing, is 0.113.
Apart from testing the *data* I wonder what you want to analyze. The MET- and MCT values are highly correlated. Do you want to show that correlation? Or do you want to analyze that/how much MET- and MCT-values are systematically different (MET > MCT), and probably depends on the number of tasks? Or do you want to show that both values approximate some upper limit with increasing number of tasks, probably different limits or in a different speed? Do you have some model how MCT, MET and tasks should be related?
I attached diagrams showing log(MET) against log(MCT) for the different numbers of tasks and the normal-QQ plots of the residuals, once for all data and once for the data after the three "outliers" have been removed (just for illustration! you should think why these three values might be outliers and if they conceal or rather reveal important insights!).
Thank you for quick reply. I do not know much about all this. I have just start reading statistics, because I have submitted a paper in journal, where I compared my result using average of the above data set, and shown graphs(file is attached). They have asked me (In the simulations, the authors average over 10 trials. This can be sufficient or not, depending on the randomness of simulation parameters. Therefore, confidence intervals are of paramount importance to take any conclusion from the results.)
I do not know about confidence intervals. By reading here and there I understand if data is not normal non parametric test should performed like t-test or Wilcoxon Signed-Rank Test for Paired Samples. I have performed this using the following tool
first of all an important cue: do not confuse "tests" and "confidence intervals". A confidence interval (CI) is a region (or set) of non-rejectable hypotheses, whereas a test gives you a just p-value indicationg the probability of getting more extreme tests statistics as the one observed given a particular (null-)hypothesis. The connection between a test and a CI for is that a test based on alpha will lead to rejection of H0 whenever the (1-alpha)-CI does not include H0. I strongly recommend that you first learn what empirical science is about (builing models), what information is given in data and how statistics is used to elaborate the information content in data and to tell us what and how much we can learn from a given set of data. You should (at least) know what a likelihood is, what maximum likelihood estimates and likelihood intervals are. This includes to learn the meaning of probability(distributions), how they are derived and what they tell us. Then you can go one step further and plan experiments to give you data that can reasonably be interpreted and delivers most information about the interesting aspects. Then it will also be obvious how to analyze such data after the experiment has been done or the data has been collected. People in academics should more be trained in what science and learning actually is before they start to do experiments or to somehow analyze data (often in a particular way just because others did it this way). But now to your problem:
I will first show you how I would start (actually not knowing much about anything here, just looking at the data), and then I will also present a solution closer to your analysis you already started (as shown in your file rg3.xlsx).
As you can see from the picture I attached to my last post, there is a clear linear relationship between log(MCT) and log(MET). The residuals of the regression line through these points can be seen as normally distributed (as shown by the Normal-QQ-plots). The slope of the regression line is greater than 1. For the line log(MET)=B*log(MCT) the slope (B) is 1.017 with a 95% confidence intervals from 1.003 to 1.030. After removal of the three "outliers" with the large negative differences, B is estimated as 1.027 (1.019...1.035) (the estimate of the slope is steeper now)
This indicates that the log(MET) values are consitently higher than the log(MCT) values, and that this difference is larger for higher log(MCT) values. Going back to the original scale it says that the ratio of MET/MCT is greater than 1, and becoming greater for higher values of MCT. (Note: log(MET/MCT) = log(MET)-log(MCT))
Now you might want to separate this for the different groups(numbers of tasks). You could now calculate the mean difference in logs (i.e. the mean log-ratio) per group and state the confidence intervals (CI) for tehse estimates. Since there is no reason to believe that the distribution of the residuals is considerably different to a normal distribution, you can use the "standard procedure" to calculate the CI of the means. You data are the log ratios. Any conventional statistics program will calculate these CIs.
Given all the data (including the three "outliers") I get (using the natural logarithm)
tasks | mean| lower | upper
100 | 0.1174 | 0.0049 | 0.2299
500 | -0.1484 | -0.5829 | 0.2860
5000 | 0.2237 | -0.0228 | 0.4702
10000 | 0.5481 | 0.3846 | 0.7116
You can get the values for the ratios simply by anti-loging these values. For instance, the mean ratio for 10000 tasks is exp(0.5481) = 1.73, so the MET values are expected to be 1.73-times as high as (or 73% higher than) the MCT values. The 95%-CI ranges from 1.47 to 2.04 (or from +47% to +104%).
These group-wise analyses need to estimate the variances only for a subset of the entire available data, what is a waste of resources. We saw that the variance of log-ratios did not depend on the group (otherwise the distribution of the residuals was clearly not normal), so it would be advantagous to estimate the variance (and, hence, the CIs) from all available data together. That means we can calculate the CI for the residuals as a whole. To avoid a bias here, the residuals are calculated from the regression line including an intercept term (A): log(MET) = B*log(MCT)+A. The CI for the residuals on this model is -0.1423 ... +0.1423 (it is symmetric around zero). So better (more robust) estimates for the CI of the log-ratios can be obtained by adding and subtracting 0.1423 to the group means, giving
tasks | mean | lower | upper
100 | 0.1174 | -0.0249 | 0.2597
500 | -0.1484 | -0.2907 | -0.0061
5000 | 0.2237 | 0.0814 | 0.3660
10000 | 0.5481 | 0.4058 | 0.6904
Note two things: 1) the three "outliers" are included, leading to a relatively low log-ratio for the group "500". 2) the conclusion obtained from looking at all data together ("consistently higher MET-values than MCT-values") can not be seen if the data is analysed separately by group (the groups "100" and "500" have CIs that include the zero log-ratio or a ratio of 1, indication equality or a non-difference of MET and MCT).
You may calculate this after exculding the three "outliers" to see how much the results change. As I said before: there is no statistic that tells you something about the importance of these "outliers". Including them may invalidate your conclusions, or they might try to tell you the actually interesting story.
Now you have already presented the results as differences (not as (log-)ratios). So you might want to get CIs for these differences. Since the differences are clearly not nnormal distributed, CIs can not be calculated using standard techniques. I would recommend to use bootstrap to get the CIs for the differences directly. You will need a software that can calculate bootstrap-CIs.
My result (obtained with R and 100000 bootstrap-samples on the pairwise differences, including the three "outliers") is
Testing normality with less than 10 observations seems to me to be almost nonsensical. With such a small amount of data you could only ever detect rather gross deviations from normality. Certainly it does not make much sense to apply an "omnibus" test - a test having some power against any alternative at all ... which pays for this by having low power against every alternative.
If in advance of looking at your data you do have some idea of what deviations from normality might be expected in your field - or what deviations might be particularly harmful to the further analyses which you plan - then you should use a test specially designed to have high power against those particular alternatives. As long as the test is "exact" (does not depend on asymptotic theory) then it is reliable.
I have a similar question. Recently, a researcher published a paper in which he difference by size two populations of lizards. One with n = 2, an for the other the number of samples was not indicated. Normality was assessed with Shapiro-Wilk and the groups compared with t-Student test. I know this is really bad, but even if one group had an adequate number of samples, Is it possible to compare against another group that has only 2?
Glad to see your question here. In order to better assist with sample size determination, it would help knowing what your null hypothesis is? What is it that you are trying to prove or disprove?
Yes, more precisely, a hypothesis is either rejected (disproved) or failed to be rejected (yet to be disproved or accepted as a hypothesis) based on the data provided and assumptions of risk levels. The null or the alternate stays statistically significant till proven otherwise with supporting data.
Shree, I would not use words like "proven" and "disproven" in the context of hypothesis tests. The tests neither prove nor disprove anything. Tests only suggest how you should act when adhereing to a strategy that will balance the expected losses in a defined way. Nothing else.
The tests show whether or not a factor that is deemed influential is proved or disproved to be statistically significant at a chosen level of statistical confidence (risk). Does that sound better?
No, not at all. There is nothing like a prove. Statistical significance is only a measure how likely the data (ore "more extreme" data) is expected given a particular probability model and a particular hypothesis.
If it was correct what you were saying, then I could easily prove psychic powers in men:
Experiment: play lottery
Null hypothesis: the player is only gusseing
Observation: Mrs. F. won the lottery (she's a lucky millionaire now)
Significance = p = P(winning|guessing) < 0.0000001
That is statistically significant at any reasonable level.
You say: this disproves the null hypothesis or proves the alternative (="not guessing"). So People (at least Mrs. F) have psychic powers and can magically foresee the next lotto numbers. Really?
I'd say: no, by no means this is a prove of anything. It just says that I would have been quite sure that Mrs F. would not win the lottery. But she won, so I am quite surprised.
----
NB:
Surely, the mean trick here is that Mrs. F was not mentioned before. She could be one of many people who actually played lotto, and she was just selected because she did win. This is the multiple testing problem, and it demonstrated that the interpretation of a p-value requires a context, and changing the context means that the interpretation/meaning of a p-value changes. But is the interpretation depends on a context, it cannot be a prove.
However, we do not need to consider a multiple testing scenario. Consider that I have the hypothesis that Mrs. F has psychic powers so that she should be able to forsee the lotto numbers. Now we specifically ask Mrs. F to play lotto to test this hypothesis (well, actually to test the null hypothesis). Now she really wins. Wow - we are surely surprised by this result. Very very surprised if we think that she was just guessing. But even in this case we would not beliefe that we now have proven her to have psychic powers. Again the context actually defines what the result tells us, but the result itself is neither a prove nor a disprove of anything.
It is best to refrain from using tests for assumptions at all. It is long known that they have a flawed logic. One issue is that with small sample sizes, you will almost never get a "significant" rejection of normality, whereas with very large data sets, minor (i.e. negligible) deviations lead to rejection.
For large data sets the best strategy is using visual assessment, i.e. residual or Q-Q plots. For very small data sets, the only viable strategy is to think upfront whether the residuals can be normally distributed by any means. Generally, normality is an idealization, that strictly never happens in this universe. Why? Because all measures have one or two boundaries. Distributions with boundaries inevitably are skewed.
Practically, the researcher should first consider the type of measure (discrete/continuous, one boundary, two boundaries), select a fitting response distribution and use it in a Generalized LInear Model, for example:
Waiting time (not response time): Gamma
Counts: Poisson or negative binomial
Successes in a fixed number of trials: binomial (aka logistic regression)