I wonder about this because for such small data sets it is not always clear whether the data are normally distributed, so perhaps results (p values) from both parametric and nonparametric tests could be more useful than either test alone?
Many things were said bove and with most I can't agree. I see two key issues araising:
(1) why the heck is a formal test of a null hypothesis so important? The data was generated. It was surely generated with some aim in mind. This is what should count. Questions like "I do have some data now what kind of test should I apply" seem absolutly non-sensical (to me). However, I know well that bosses can push one into such a situation, and that reviewers require non-sensical analyses to get results published. That's one of the dark sides of the life in sciences... but then, any one stupid test of anything is ok. Why bother? If a reviewer wants a p value, give one. If he/she asks fo a particular test, do it. It doesn't really matter if it is "correct" or "wrong". This is not scientific, I know, but already the force and/or habit to apply some test just to apply some test is neither. Further, these kinds of experiments are usually explorative, not confirmatory. The question is: *what* effect do you see, rather than how likely can the observed (or stronger) effect be expected under a under some (typically irrelevant) null hypothesis. If the second question is answered based on rank distributions and not based on our expectations about the effect, it gets really strange.
(2) why is not discussed what the data and the results/effects mean? What is actually determined/measured? What do we know/expect about errors? What should actually be modelled or what models should be modified (supported or questioned [I try to avoid using the word "tested"]? Would it help to to know a long-run frequency distribution of rank statistics?
But to answer the original question:
If you really have to do some test, both are ok, but you should decide for one, actually *before* you do the experiment, but at least before you actually do the test. Therefore, reporting two p-values from two tests is a no-go.
Which test you use does not matter much. The t test gives correct results only if the *residuals* are normally distributed (what can't be tested or seen - it can only be assumed! [more or less reasonably]), However, violations of this assumption are not so serious. It is likely that the p-value is higher than the long-run frequency of such t statistics under the true null hypothesis. The U test gives correct results only if the shapes of the destributions in both group are the same (except for the location), what again is a harsh assumption, typically not true, and als impossible to know or test. Again, not-to-extreme violations won't be harmful. So it practically does not matter.
My personal favorite was just to present the mean difference (if this is the interesting effect) and to give a rough estimate of the precision. Here I'd prefer a confidence interval, what may be calculated using the t-distribution again, when the assumption of normal distributed residuals is not too unreasonable. Sometimes, errors are not additive but rather multiplicative, then the logarithms of the values can be analysed; instead of analysing the mean-difference, the mean (log)fold-changes are analysed then. I know that confidence interavals and p-values from t-tests are quite related, but the intervals are less prone to miss-interpretations and much easier to understand.
Apply the concentration index (based on U, you can find it in the attached file). The point in any case is not with normality but with the disproportionate effect tha a single outlier can exert in samll data sets. The CI you will find explained in the attachd paper (see methods at the end) solves exactly this problem and gives you bot a probabilistic statement (based on U) and a descriptive index of the distance between the two (small) groups.
Dear Igor. Doesn't make any sense to use parametric tests with such a small sample size. You should use the Mann-Whitney to test distributions or medians. If you are using SPSS, go to the independent samples option instead of legacy dialogs, in the nonparametric menu. I attached a nice document with some examples on which you can easily follow. Hope it helps. Cheers
Alessandro, thank you very much! The confrontation index you describe indeed sounds very useful! If it is used to sequentially compare multiple "treated" groups to a single "control" group, what should be used to solve the problem of multiple comparisons (e.g. some form of Bonferroni correction)?
I would focus on estimating the effect size and reporting its 95% CI using a bootstrap approach. A permutation test is also fine if simple yes / no conclusion is sufficient.
Dear Igor, if you have the intention of presenting a measure of effect size, r, can be calculated by dividing Z by the square root of N (r = Z / √N). Check out this document. Cheers
I do agree with some answeres, however, as far as the sample size is small and less than 8 -10, it goes without question to the non-parametric test as Mann-Witney-U test. Of course this test is strong and used under these conditions.
In place of asking questions on how to use a tool that is normally otherwise not applicable, it would greatly help if you can quote the specific situation of where you are restricted to only use small sets of data.
Would you please elaborate on what situation requires the application of inferential statistics that you would only have access to small sets of data?
Thanks and I am looking forward to hearing from you on this.
Shree, no problem! Consider the following example: there is a pilot study of a certain treatment on animals (e.g. mice), with small numbers of animals in each group (e.g. 6): control (untreated) and a few treatment intensities and/or durations. Suppose the analysis should compare certain parameters of the animals (e.g. body weight, concentrations of certain metabolites in urine, etc) between treated and control groups and between treated groups with different intensities.
Would be grateful for your suggestions!
Thanks Igor and this is common with the drug industry as testing within animals is usually the first stage. So the reason for testing in small numbers is primarily due to cost and/or time. Let us disregard time as we are focused on the efficacy of the drug so to speak even at the expense of time. Now we are left with the number of animals tested. We know that there is no scarcity of animals to be tested (no offense to animals though) here as I am also sensitive to their sacrifice to the benefit of human beings.
While we look for the minimum sample size we are facing the following parameters (say for the 2-Sample t-Test in your scenario:
A - Alpha - Type I Error (Risk to Producer), B - Beta - Type II Error, (Risk to Customer), C - Difference to Detect (difference between the signals you are looking for), Std Dev (Variation in your Process). All of these indicated a sample size of "n" that would be the minimum to demonstrate the drug efficacy I suppose.
I would also appreciate the rate at which data is available and the calendar time in which you wish to execute the project or be able to get measures of the results you are looking for. If the deadline for the project is much shorter than the time it duration it takes to measure the relevant sample size, we need to do a few other things which we can discuss if necessary.
Am I right in this process? Kindly correct me before I go to the next step. Thanks for being open on this and look forward to your response.
Shree, thanks for your interest! For simplicity, assume that the study was already done by someone with a certain small number of animals per group, and the data have been collected. The question is what analysis approach could give the most useful preliminary description of differences between groups?
Two main considerations for all data -- be they small or large sample sizes!:
1. Also always look at the distribution of the parameter -- graphically -- as this will give you an idea if it follows a more normal distribution which can happen often even with small sample sizes. If it looks more normal and your hypothesis is that it is/should be normally distributed (even if it looks skewed) -- use parametric tests as this will give you more power! You can even transform the data to make it more normally distributed to use parametric tests. However, if the data is bimodal or extremely skewed use parametric tests as will give you most power for finding a difference.
2. Good theoretical practice: What tests to use are based on if you think/hypothesize the parameter you are interested in/looking for a difference in is distributed normally (use parametric test) or non-normal (use non-parametric test)
So for any data my approach -- especially because I work with small samples:
1. Clean up data
2. Graph data and see if parametric or non-parametric.
3. Do descriptive statistics (parametric and non-parametric) -- yes, I do explore the data nowadays as I have become more Bayesian in my thought process of analyzing data.
Hope that helps. Any joy with your data?
If we know a priori that the measurand is a normally distributed value (such as in the case of repeated measurements), it is possible to use a parametric test. If the distribution of the measurand is unknown but it is believed that the data sets are belong to the same population, then the parametric test is also possible (with some caution). Otherwise the Mann-Whitney test is preferable. So, everything is depend on the task under study.
Many things were said bove and with most I can't agree. I see two key issues araising:
(1) why the heck is a formal test of a null hypothesis so important? The data was generated. It was surely generated with some aim in mind. This is what should count. Questions like "I do have some data now what kind of test should I apply" seem absolutly non-sensical (to me). However, I know well that bosses can push one into such a situation, and that reviewers require non-sensical analyses to get results published. That's one of the dark sides of the life in sciences... but then, any one stupid test of anything is ok. Why bother? If a reviewer wants a p value, give one. If he/she asks fo a particular test, do it. It doesn't really matter if it is "correct" or "wrong". This is not scientific, I know, but already the force and/or habit to apply some test just to apply some test is neither. Further, these kinds of experiments are usually explorative, not confirmatory. The question is: *what* effect do you see, rather than how likely can the observed (or stronger) effect be expected under a under some (typically irrelevant) null hypothesis. If the second question is answered based on rank distributions and not based on our expectations about the effect, it gets really strange.
(2) why is not discussed what the data and the results/effects mean? What is actually determined/measured? What do we know/expect about errors? What should actually be modelled or what models should be modified (supported or questioned [I try to avoid using the word "tested"]? Would it help to to know a long-run frequency distribution of rank statistics?
But to answer the original question:
If you really have to do some test, both are ok, but you should decide for one, actually *before* you do the experiment, but at least before you actually do the test. Therefore, reporting two p-values from two tests is a no-go.
Which test you use does not matter much. The t test gives correct results only if the *residuals* are normally distributed (what can't be tested or seen - it can only be assumed! [more or less reasonably]), However, violations of this assumption are not so serious. It is likely that the p-value is higher than the long-run frequency of such t statistics under the true null hypothesis. The U test gives correct results only if the shapes of the destributions in both group are the same (except for the location), what again is a harsh assumption, typically not true, and als impossible to know or test. Again, not-to-extreme violations won't be harmful. So it practically does not matter.
My personal favorite was just to present the mean difference (if this is the interesting effect) and to give a rough estimate of the precision. Here I'd prefer a confidence interval, what may be calculated using the t-distribution again, when the assumption of normal distributed residuals is not too unreasonable. Sometimes, errors are not additive but rather multiplicative, then the logarithms of the values can be analysed; instead of analysing the mean-difference, the mean (log)fold-changes are analysed then. I know that confidence interavals and p-values from t-tests are quite related, but the intervals are less prone to miss-interpretations and much easier to understand.
My suggestion is in this case, use other approach where the goal is measure the productivity, as DEA.
Igor, I had left a note to seek some questions from you, however, your "Thank You" reply tells me you would rather close this thread. Please let me know either way. Hope to connect and speak with you. How dow e do that? Have a pleasant week now.
Please visit this site http://www.uq.edu.au/economics/odonnell-chris and check it for my colleague Professor Chris O'Donnell.
Why would you want to test the normality of a small sample? Why not get a larger sample? I am yet to receive a response from Igor on this based on my first response to his feedback that they are testing animals for a specific drug efficacy.
Thank you again, everyone! Based on your responses I am inclined to go with nonparametric tests (e.g. U test) for small samples, and calculate the confrontation index suggested by Alessandro as an additional measure of the extent of data overlap. Does this make sense?
Shree: thanks for your interest, and of course you are right that designing an experiment in an optimal way is important, but my question was about a specific situation: how to handle small samples.
Dear Igor,
Yes, your approach is ok. You may not even need to use the confrontation index, just the U Mann Whitney test. Truth is that you always need a p-value, since you this is just a way to show how likely your results are just by chance or for real. And, for sure, it does not make sense to use parametric and non-parametric tests. You just have to choose the most correct one and use it. In this case, with six samples is crystal clear that you have to use non parametric tests.
Best wishes
Miguel,
I am sorry to differ that the use of non-paramteric testing is not clear to me as you post it. You just threw the entire concept of minimum sample size out of the window by justifying the use of non-parametric testing with 6 samples. Igor is working with animals on some kind of drug effectiveness or efficacy study where the risks could be high is they plan to convert this into a pharmaceutical product for use in human beings. Any kind of prototype testing at this stage is going to be susceptible to sample size as higher amounts of money, efforts, and time are needed when they take it to the field. At that point if they just used a sample size of 6, I am sure someone's head is going to roll (not literally) or someone is going to be shamed for this decision.
It is Igor's call anyway on how he would like to conclude this.
Dear Igor.
The report is depend on your result from normality test. I mean the data distribution wether your data set is normal distribution (like bell shape) or not. In statistical method, the reseacher should use parametric test if the data set is normal distribution and use the non-parametric test if the data set is skewed.
But for the established research with high index publication paper in the local & international journal, they always refer the previous published research paper to determine wether to use parametric or non-parametric test.
The number of research sample is very important to showing the accurracy of statistical analysis. More data variation in the same group of data set, will less the accurracy statistical analysis. For the less number of sample (ie. < 5 samples), most researcher generally use simple presented data like percentage, khi-square, etc.
This is my opinion based on my little experience in statistical analysis. Please refer the statistician for more suggestion and information.
Dear Igor,
in small samples or smaller than 10 subjects in each group must use a non-parametric tests. The reason for using a nonparametric analysis is to not incur in called "false-positive result". As for the non-parametric tests that you should use, this will depend on sample variables (your research), whether these variables are categorical or not.
Best wishes, Livia.
Dear Shree,
I am sorry but I don't catch your point. Of course, it would be great to have a bigger sample size, but Igor's question was about what test would be more appropriate with that sample size.
Best wishes
What a strange thread... Wrong things dont become right just because they are repeated over and over again. Sample size does NOT determine the kind of test. This is a stupid idea.
Thank you, Miguel, Ahmad, Livia, and Jochen! Generally the suggestions to use nonparametric tests with small samples make sense to me. The confrontation index also seems to be a good idea for showing the degree of separation between small data sets.
Igor, the main argument given pro non-parametric tests was that you can't confidently judge if the (freqeuncy!) distribution of errors is normal. What they do not say is, that these tests only test a location shift if - and only if - all other features of the distributions are similar. But this can't neither be checked in small samples. So these non-parametric tests test the equality of distributions (not only location). If this is intended, ok. But usually this is not intended.
Thanks again, Jochen! I appreciate your input. If I understand correctly then, with small samples the problem with parametric tests is that it is hard to tell (impossible to reliably test) whether or not the error distribution is normal. The problem with nonparametric tests is that it is hard to tell whether or not the data distributions are the same or not. I was not sure which is "worse", so that is why I asked does it make sense to use both tests. But it seems to me now that with small samples showing a "location shift" may be more intuitively useful than comparing the means by a parametric test. Does this make sense to you?
For me, showing a "location shift" is not very attractive, because it is lacking a proper model behind. Science is modeling. There should be a proper measure for the effect. How *big* is this shift? How precisely can you estimate this shift based on the data? How would you measure? How do you translate the information in the data into a "location estimate"? These are all far more interesting questions than "which test will I perform?". If you have a model (at least in your mind), such a model will explain what we can expect, and why we can expect this. A neat statistical handling of the data should use all information from the data to answer these questions and leave over as residuals/errors what can't be explained. If we talk about "location" it comes that we only need two assumptions to fully explain the data: the errors must be independent (knowing one will not tell us anything about any other) and positive and negative errors are equally likely expected (this symmetry is natural for unbounded values). Based on ONLY these two assumptions it turns out that the normal probability distribution describes our expectations, and, further, that the arithmetric mean is an adequate location parameter.
It may be that the values are bounded, then we get a gamma probability distribution. Or that the errors can be modeled as products of stochastic processes, then a log-transformation would solve the issue. Other ancillary conditions may change our probability model. But as far as we don't have any particular knowledge about such conditions, the normal distribution model is the least specific, most appropriate.
Igor, in addition to what Jochen says, you can get samples as small in certain types of experiments and therefore is required to observe certain assumptions (normality, independence). It also depends on the type of variable.
As a recommendation, also calculates the power of the test that sample, that helps clarify the impact of the results.
Hi Igor,
if samples are independent, I 'd suggest the nonparametric solution, mainly because the samples are too small. Namely, as mentioned above, the Wilcoxon-Mann-Whitney test to compare central tendencies or sample distributions.
But usually e.g. in economics experiments we deal with about 12 to 25 subjects / sample size, so I also maintain a small doubt about the statistical power..
K.
I also want to stress that sample size does not a priori determine the test you have to use! Non-parametric tests also have assumprions and I remember a case where results were strongly biased by differences in the distribution between groups (e.g. skewed or in-homogenous variances). Also, using non-parametric tests does not save you from a bias that could have been introduced by a small sample size. In general you should have determined the test to use a priori depending on your experimental design and the hypothesis you test - though I know this is often not the case.
Ok so regarding your question consider first what you are interested in... This is in most cases if you have an "effect". Using statistics people mostly assume that there is an "effect" if the the test becomes "significant". What you have to keep in mind here is that all frequentist statistical theory depends on stochastics and sampling theory. So significance does not necessarily indicate a meaningful effect!
Having "only" six samples you risk that you just by chance picked different samples in the two groups. What to do with that? Well, first recognized... then look how big the difference is between the groups - i.e. look at the effect strength - and how the data points spread - i.e. the variance. Having huge difference and small variances within both groups could make you more confident there is a difference. If you a have huge difference in the means but also huge variances... you are either unlucky, because with your sample size you are not able to "confidently" say there is a difference OR there is no difference in fact (Bad luck, no conclusion possible). I there is only a very small difference, well... does that really make a difference in your context?
Calculating a mean and confidence interval (With "standard formulas you can only do this if you can reasonably assume that your parameter under investigation is normal distributed) should gives you a good feel if there is a difference. In the end you are interested in the question if the effect you see has a meaning in your "world"! Statistics are just the tool to check if you can reasonably assume you dont see a random difference... and this depends strongfly on data-structure. So...
Finally, give it a good thought if your data points are really indipendent and randomly sampled! If thats not the case no standard statistical test - either parametric or non-parametric - will rescue you!
Go for meaning not for ** behind your estimates! Good luck!
This question really supports the use of the book "Common errors in statistics" (I leave the search to the interested user). Actually, when the sample size is less than 6-7 then only the parametric test is reasonable because permutation tests have very low power. Use of the Mann-Whitney should also be followed by the sign "proceed with caution". The reason is that the exact WMW has low power and the asymptotic one does not hold (asymptotic for n=6 is nonsense). So, if normality does hold for the small sample then the parametric test is very appropriate. If not, use both as you already do. Conflicting results need some elaboration (check better, understand your data). To conclude, many people in this thread supported (without any doubt!) the use of non-parametric tests. Although I am a huge proponent of non-parametric tests, very small samples (n
Thank you, Jochen, Carlos, Konstantinos, Christian, and Christos, for your stimulating comments! This is an interesting discussion! It looks like there are differences of opinion: for small samples some favor nonparametric tests (due to problems with determining normality on small samples), others favor parametric tests. Perhaps using both is not unreasonable after all? Or just using a subjective normality check (e.g. QQ plot) at the beginning to decide which way to go? Would be grateful for your input!
It is in fact a common practice to check a QQ plot to see if the residuals are not seriousely contradicting the expectations about them. Actually, this is one of several ways to find out if something more can be learned from the data beyond the model that was used and giving these residuals. But again, it is not a good idea to simply select the analysis based on the (on this) data. You can check the QQ plot to decide which way to go, but then go this way using DIFFERENT data. Otherwise you get a serious selection bias.
Thank you, Jochen! I think I get your point about selection bias. But what do you mean about using different data? Do you suggest using a QQ plot on just a subset of data first, making a decision about normality, and then applying this decision to the rest of the data? Also, I wonder what would be a reasonable strategy if in the small sample there are a couple of outliers (e.g. subjectively detected). Would this justify a nonparametric testing approach or, alternatively, transforming the data (e.g. log) and trying a parametric one?
It is next to impossible to judge if a value is an outlier in small samples. If the outlier is suspicious by other means (failed experiment, sick animal, something like this) or if it is an impossible, unphysiological value, then this value should (must) be removed. If the (possible, physiological) value is the only information that gives rise to call it an "outlier", then - in a small sample ! - it is unlikely to have one ore more (per definitionem!) rare outliers. If you do have then, then why? Have you been really unlucky? Or aren't such values as rare as you though (so actually NO outliers)? If the former is the case... well, this is one of these rare unlucky incidents. Shit happens. The precision of your estimates will be lower than it could have been. An analysis with and without these outliers could be performed, compared, and carfully discussed. In the latter case... *why?* is the question. This requires more research, more brains, developing a better model that better fits the data. A log-transformation changes an additive model into a multiplicative. Instead of analyzing differences it analyzes fold-changes.
Again: A non-paramteric test is justified by the wish to compare distributions functions. For the (very special!) case that one assumes that all properties of the distributions are the same exept one parameter (haha, a parameter), the such a test would specifically tell us something about this parameter. A parametric test is justified by a (resonable) model, since it tests such model parameters. Actually, there is nothing in the data justifying the test.
Thanks again, Jochen, what you are saying is a useful comparison (especially the last paragraph).
Dear Igor,
Methods for checking normality are regrettably of very little use with just 6 samples. As Jochen pointed out, it would be impossible to determine what's an outlier or whether sample follows a normal distribution or not.
Thank you again, everyone! So as I take it then, there is a lot of subjectivity on choosing parametric or nonparametric tests for small samples, and a lot depends on type of data and researcher's goals.
I am acoording with Miguel, so I should be adding the panel data if you want to compare it is necessary.
A very interesting discussion! First of all, in these situations, I call for the clinicians (and the bosses) to have informations about the measures that have been taken, the biological meaning, what was the aim, etc. and I show them a graphical description of the data and I explain to them what is a sample and what are uncertainty, clinical significance as opposed to statistical significance and p-value, finally expaining what is a type I error and what is a type II error; at this point I call for a larger sample size, that is the most appropriate answer to the problem. If this not possible I give measures of central tendencies and of dispersion with difference betwwen means and I assume that the data are parametric and I go on under this explicit assumption.
I think that double calculation of the p-value, by parametric and non-parametric test is not a solution, adding some confusion, but I confess that I do not resist to test the normal distribution hypothesis (with a Kolmogorov Smirnov test, recognizing that its result depends on sample size!),
When describing the data a log-transformation can help; lastly I cannot understand why Chebychev's ineuquality (quoted in the textbook of Pagano) is not used and is not calculated by statistical softwares.
I point out with my colleagues (I have worked as an internist) that few cases overlap with another type of scientific publications, namely case-reports, that carry their own weight of evidence, so that it is easier to think that evidence is on a continuum and few cases carry a limited evidence, whatever statistic, especially inference, can do!
Well said. So one of the most interesting question to an experiment in a clinical setting is: Is the precision of the estimated effect high enough to exclude an medically/biologically irrelevant/insignificant effect with high enough confidence?
This question implies: (i) a formal null hypothesis test is not appropriate/useful anyway and (ii) a medically/biologically meaningful effect has to be inferred anyway.
Two remarks: (1) The calculation of estimated precisions include very similar/same calculations as used in (parametric) null hypothesis tests. However, the interpretation is considerably different. (2) Type I & II errors are not errors known or expected to be comitted in any particular test. The tests only control the *rates* of such errors in the long run, given constant conditions. Rejections based on controlling these rates does not help to make inference about a particulat case/result. If H0 is rejected, HA is taken to be truth, and there is no (known) uncertainty associated with it.
In this estimation may be that tests only control the *rates* but I don´t know if it is possible make it in errors in the long run, given constant conditions.
I am acording with your Ho and alternative H. For precision of the effect estimated is necessesary observe the one more datas.
Thanks a lot, Carlos, Vincenzo, and Jochen, for your input!
Vincenzo: am I right to understand that you are in favor of normality testing even with small sample sizes? What would be your plan then if the test suggest a non-normal distribution? Also, why do you suggest the K-S test specifically (e.g. why not Shapiro-Wilk)?
[EDIT] I just saw that you sepcifically asked Vincenco. I should delete my post, but I will not. It might be interesting anyway, but please don't take anything personally. My bad. Sorry. [\EDIT]
Igor, I am *not* in favor of normality testing, if you mean any formal test. In large samples you might have a chance to see patterns in the distribution of the residuals (as deviations from the normal distribution), what would tell you that your present model misses to use some of the information provided by the data. Even then one has to think whether or not this unused information would touch important aspects of the model. Small data sets just can't contain that amount of information, so there is typically no chance to recognize any patterns in the few residuals one has.
Performing formal tests for normality to justify the further analysis is intelectual huddle [but not neccessarily the inspection of possible patterns, like in normal-QQ plots for instance]. The normal distribution of the residuals is never a fact anyway, it is always an ASSUMPTION, logically following from the 3 (actually 4) basic assumptions
1) the model used IS correct
2) the errors ARE independent
3) positive and negative errors of the same size ARE equally likely
4) errors can be of any size
Taken to the extreme, 3) and 4) are almost always violated. To what extend can we tolerate these violations? Nobody knows. 2) should not be violated and often the experimental design can assure this quite well. I don't consider this a serious problem. Then 1) is left, where we -by design!- must admit that this is also generally violated ("all models are wrong - but some are useful"). Again, to what extend can we tolerate this violation? We can look at the performance of the model and see if it is "useful enough". That's it. Having a small dataset may not allow us to check the usability, it might just be sufficient to provide a model that is reasonably consistent with the known data. If it happend that we get more data we may confirm or refine the old model. This is called "learning"...
So you have 2 options: either go for a quantitative but neccessarily wrong description of the world (i.e. modelling) or leave it (i.e. paint black&white pictures advising you decisions/actions you should take, for which you can, with a little luck, at least control long-run error rates of false decisions). Option 2 seems not very scientific to me (personal opinion! I might be convinced to change my mind). Going for option 1, one has to acknowledge that limited data only gives limited information that can be used to learn something, usually giving only a relatively blurred picture anyway. The great value of statistics is to quantify what and how much we can learn from a given set of data.
Once again: I DISFAVOR a formal testing of normality. The only scenario where I find it appropriate is to filter data sets (automatically, in a screen of many sets) that are *not* normal distributed, but even there I doubt the usability (and I don't know any practical reason for doing it), because the kind of deviation ("pattern") is much more interesting than the fact that a distribution is unlikely sampled from a normal population...
First I try to answer to Jochen; as you put the issue of precision of effect my answer is that I cannot exclude an effect with confidence, but I have the pendulum of evidence begins to prefer a "no" with a number of limitations and doubts. I can try to see for example, how many patients I have to enroll to test the opposite hypothesis (some effect existts): if I need thousands of observations to assess the point -say with a power of 0,8- why bother? (I recognize now that this can be misleading). This is somewhat different from the point I of assumptions (your second paragraph) because a formal null hypothesis is anyway "just round the corner", and in the first of your remarks you rightly recognized that calculations are in the same logic while interpretations are different, you say (I say are the same but from different points of view). The main assumption is that, after the first 6 observations of Igor the next 30 clarify that Igor is working with a normal distributio; the biologist should aware of this assumtion and has to add observation to proceed in ther direction of more evidenced conclusions.
Second question: probably my answer is yes (remember that I was a clinician). The biological/clinical scientist is questioning me, like a patient or a colleague question me about a health problem of a single patient, and I have to give my answer, after explaining the limits of it. I can call for further ascertainments, I can "wait and see" but I'm beginning my process of "learning" as you rightly say, also in the sense that I learn something something that I'll apply for the next patient, not for this one.
Last point (secon remark): "If H0 is rejected, HA is taken to be truth, and there is no (known) uncertainty associated with it": what about the cut-off level thet splits the world in only two categories, reject/no reject?
You have a drug that did not prevent death from digestive hemorrage in a well-done clinical trial, p-value was 0.05 with a non significant advantage in the patients treated (intention-to-treat analysis, power 0.9) versus not treated patients (significance with p
@Vicenzo:
>>
You have a drug that did not prevent death from digestive hemorrage in a well-done clinical trial, p-value was 0.05 with a non significant advantage in the patients treated (intention-to-treat analysis, power 0.9) versus not treated patients (significance with p
Thank you Jochen, I appreciate your framework of significance level, that I have never approached so deeply: I have now to think a lot; I hope that Igor has now a clearer perspective of how go on with his/our problem.
Just a little story: I clinician tells an epidemiolgist a surprising finding and wants some mathematical confirmation; the epidemiologists answers: interesting, return when you have 99 cases more.
The clinician knows a lot of the disease, the biology, the cirumcances, and why and how this finding is surprising. The epidemiologist (let it be a mathematician/statistician) does not have this background. The clinician, as an expert in this field, can well judge the "significance" of this finding. Actually, there is no data required at all. Having made this observation might just be a pimer to think in such a direction. The epidemiologist, on the other hand, can not make a judgement based on a single observation that - for him/her - is just floating in a vacuum. Only numerous replicate observations can enable the epidemiologist to calculate the likelihood of these observations under some given hypothesis. Still he's left with no clue if the hypothesis is meaningful in some way, but this is the only thing he can do.
The clinician can argue based on knowledge and logical arguments and maybe a good idea what might be responsible for this particular observation and he might learn/get aware of something new (being helpful in some way for some purpose). However, there is a (more or less high) risk that the clinician *wants* to see relations that would nicely fit to his concepts but are just not "true" (the conclusions will then turn out to be not very helpful). Here the epidemiologicst could help, given he had data from more observations, by estimating the chance to observe such surprising things under the hypothesis that the concept of the clinician was wrong. This is an insurance for the clinician not to be too over-optimistic with his conclusions. However, insurances cost something, and here they cost additional observations.
Thank you again, Jochen and Vincenzo! This is a very stimulating discussion. As a very simplistic "message", what I take away from this is the following: In situations where sample sizes are limited, one can look at the data say by QQ plot to assess normality subjectively. Unless this shows "obvious" inconsistency with the normal distribution, proceed with parametric testing. Does this make sense?
>>
Unless this shows "obvious" inconsistency with the normal distribution, proceed with parametric testing.
Thanks, Jochen! By saying to "re-think the model" do you mean trying known non-normal distributions (e.g. exponential) which would better represent the data?
This discussion confirm that is neccesary incluce in our models one more quantite of datas and organize it in a panel data for getting a significat validation. For me is convenience to use the DEA approach.
Igor: yes. It can also mean looking for missing/unconsidered predictors, interactions, non-linear relationships...
Jochen, thanks for the clarification! So if the data (e.g. seen on QQ plot) or model fit are inconsistent with normality (e.g. residuals are non-normal), you recommend altering the model function and/or assumed data distribution, before going to nonparametric testing?
Yes. But "going to nonparametric testing" would neither be the emergency-solution nor a stopgap. I would just stop with the best model I can think of and state that there are still some inconsitencies in the residuals indicating that there might be a still better model - which is impossible to derive at the present state of knowledge.
I see Jochen, thanks! Actually when I am saying "nonparametric testing" I mean also methods like developing a model and fitting it to the data, but assessing sensitivity to parameter values by generating multiple synthetic datasets by nonparametric bootstrapping and fitting the model to each of them. I guess the parametric alternative to the latter would be to generate synthetic data sets using measured means and standard deviations, and assuming the normal distribution. Does this sound reasonable to you?
Yes, this sounds reasonable.I am not aware about the differences in the draw-backs/shortcomings of the two methods. Surely, "parametric methods" rely on assumptions that might not be perfectly correct, but the generation of synthetic datasets also relies on assumptions - that might neither be perfectly true. I don't know where the advantage of one over the other method can be, but generating synthetic datasets is computationally more intensive.
Thanks, Jochen! To me these types of methods (bootstrapping and synthetic data sets) are attractive because they are in some sense conceptually simple, and I implement them writing my own Fortran code typically. In terms of computational intensiveness - you are right of course, but for small data sets doing say 10000 iterations is no problem at all, it is quick.
Hi guys:
I just registered in Research Gate and found this interesting discussion.
I work in a different field than you: marine ecology and aquaculture where experiments usually involve less than 5 replicates (sometimes 2 or 3). Two comments:
1. On t or U tests with small sample sizes. As some of you stated, both tests imply assumptions on the distributions, which can not be adequately tested with small sample sizes, because of power and other issues. So, if a formal Ho test is required either would be fine (or not). Anyway, significance tests attempt to make inferences about population parameters or differences based on samples or experiments, but valid inferences of this type usually can´t rely on a single trial (with small or large sample size or P-values), but rather on the repeatability of results in different conditions, labs, times, etc. For example, if a clinical trial with few patients suggests an important advantage of new drug over the conventional ("significant" or not), yes, repeat the trial with more replicates, but most importantly, try to repeat it with other kinds of patients (e.g. ages groups, etc.) and to convince other colleagues to do it in other labs and parts of the world to try to expand the scope of your findings and the potential benefits of this drug (certainly the publication of your results in a well respected journal would help here). This would be more helpful in reaching a general conclusion of the potential benefits of the new drug than a single "perfect" experiment, with lots of replicates and the best "state of the art" statistical tests.
2. On the "significance levels" issue. This is the part of the discussion I enjoyed more and I am glad to see this argument is spreading. Yes, I agree with you in that the "conventional" alpha value of 0.05 has been and still is damaging science "significantly". We don´t live in a black and white world. For a more thorough discussion on this issue please read the attached paper written by one of my Statistics teachers (certainly the most influential for me). I hope it goes through.
I would like to add to the Cornell Question that the significance for that sample is no clear. Compare small sample give the problem of heterocedasticy and autogregression. In fact it is the problem more finding in a some works.
Thank you, Ricardo, Patrice, and Carlos!
Patrice: most of my questions here are general in nature, mainly how to handle data which has not yet been produced. However, since you are interested, I attach a sample data set. Here there were 2 cell types treated with 3 different intensities, with 6 samples per intensity. Significant differences can be found using both parametric and nonparametric tests, and a simple model can be constructed and fitted. I would be curious and grateful to hear what you (and anybody else interested in this) think!
Dear Igor with that sample I think that negative value can to find some troubles for testing, so it depend of the model used, but is better to use positive value. I seem well the construction on panel data if they are the same source.
Patrice, thanks for your input and analysis!
The questions related to this data are the following:
1. Do the responses of the 2 cell types differ at a particular treatment intensity? What would be the right test for this (i.e. to compare 2 groups of 6 samples)?
2. Can the responses of the 2 cell types for all intensities be described by a simple model?
3. Do the model parameters differ for the 2 cell types?
I would be grateful for your thoughts, if you are interested.
No additional information when P values from two alternative (parametric vs. nonparametric) tests is used. When there is only six experimental units then you do not have strong declaration what kind of distributions you have, if interval data set is presented. Therefore, you can use one of alternative tests. If t-test should be used the natural-logarithm transformation to normalize the distribution of data set prior to analysis should be done. U-test can be used without any preparation before analysis. But in any case do not use dichotomic statistical inferences ( significant vs insignificant). The statistical inference should be presented as "It seems to be positive", "It seems to be negative", "The judgment is suspended". The statistical inference cannot be based on fixed alpha level (see Hurlbert and Lombardi , 2009, Final collapse of the Neyman- Pearson decision theoretic framework and the rise of the neoFisherian. Annales Zoologici Fennici 46:311-349).
Igor: I understand you had 36 experimental units to which treatments were independently applied. Am I right?
Thanks, Oleg and Ricardo!
Ricardo: yes, you are certainly right. But suppose you want to compare the responses to one particular treatment intensity - then there are 12 units, 6 in each group. What do you think is a good approach in this situation?
I would report confidence intervals for the difference (or ratio) of means or other summaries you are interested. The intervals convey the strength of evidence you have for these comparisons.javascript:
Igor, your data is fine. I started to work it today, so in a few days I will have results. However, I need to make different asumptions to deal with the samples. So far I have observed that the first two samples have completely different medias and somewhat different distribution structures. So please, give me some time. emilio
Hello Igor, your database is small, but that does not mean you can not use a parametric test. What you should know before using a t test or ANOVA is whether your data has a normal distribution. A good statistical software that can help you to clarify many doubts is the GraphPad Prism ®. Seek to know more about it at: http://www.graphpad.com/
I think you will benefit from this help.
Good luck with your search and its results.
All the best,
Livia Valentin
Igor We have only 6 data sample values. Let´s asume the following hard premises to make a rough test for the two samples, knowing that it can not be precise but it gives a proxy picture of their structures. Order data in descendig variable order. Premises are:
1) Each data has the same frequency of 1/N = 1/6; 2) Each value measured correspond to the average media of the quantil. Of course the chance of such premises is not big, but it a preliminar ideas.
2) a) The media U is the average of the six values known of each sample.
b) If you divide each quantil media by U, then you obtain its value in adimentional media units, (K@ i) and if you multiply it times frequence 1/6 you obtain the fraction of the total distributed mass for each quantil: Yi = K@ i * (1/N) where N=6. Then you add quantils to build the six points of Lorenz curve (Xac i; Li).
c) By estimating Fi=ln(Li)/ln(Xac i) fou obtain points of (Xac i; Fi) as proxy structural values for the two samples, graph them and analyse them. The lower the curve, the higher the dispersion.
d) Join conclusions from point c) with conclusions from point a), adding your own experience from the measurement process.
When the number of quantils N is higher results improve because you may compute the media of each quantil with better precisión, and frequencies get somewhat closer to real values.
It is very important the researcher experience and criteria about the minimum and maximum values of the distributed variable because it always occur that F(1)=K(1) in adimentional media units, and those points permit to fix the extreme values in some cases.
In the samples given by Igor there are negative values. This may be solved by adding the absolute value of the biggest negative variable of the two samples to make the mínimum value equal to zero. It is the same as when you convert negative temperatures in absolute temperatures by adding 273 centigrades. At the end you may transform them back to original units if you need it.
Try it with excel. The graphs obtained just for the two samples are shown in the anexed short archive.
Ok, thanks, Emilio
I would report and plot the 12 observations. Why don't list them in your question, and also tell us what is the scientific problem you are studying (i.e., what did you hope to learn from these 12 numbers)?
I am sure al of data will give a certain indication and should be used but the necessary thing that the precision of these data should be high in order to give a great accuracy
I wold like to add that the data panel seem be a good way to organize the data When comparing 2 small data sets
Patrice, I would appreciate your opinion after presenting my results one week ago. I recall your words: "Emilio. The data are so sparse that a) the underlying distribution cannot really be inferred with confidence and b) how could analysis of so little data take a few days?" Best 2014 wishes, emilio.
The data is so small, precision issues would be addressed for the sake of accuracy and the generalization issues too need to be addressed before finalizing, but there is no odds if you want to analyze it,
Muhammad, I wellcome your comment. It is clear that the more data points you have, the more accuracy may be expected. Eventhough, the confidence problem remains. The point I want to remark is that even in this case of only six data points they portray information that may be analized as a first step for preliminar explorations of some research project in its early stages.
I would also do a power analysis to determine the power you have to detect a difference, since the datasets are so small. If you're doing a T-test comparing the means of each group for example, a small p-value may be because of the small sample size and not because of a small difference between the two groups' means.
I think that small observations isn´t significant so I recoment you to add more observation because you will be on the situation that the other colleagues has been telling.