Empirical data collected by ecologists (and also scientists from other areas) are known to often fail to fulfil the assumptions needed in parametric hypothesis tests. After a long history of scientists trying to hammer their data from every side so that they would fit the assumptions, or even ignoring some "mild violations" because the parametric test would still be more robust, we are in a time where everyone has access to (more or less) powerful computers.
This has led to an increasing use of non-parametric statistics that are able to test hypotheses and make predictive models based on "real data", which can be achieved either through large datasets or data generation based on simulations, re-sampling, permutations, etc. My question comes from the fact that when I ask for advice on data analysis I often get a lot of answers regarding the fulfilment of assumptions or suggesting the use of parametric tests after data transformations.
Of course data transformations are still very important to standardise units of measurement or to reduce the effect of some variables, but do you think normality assumptions are still an issue? Homogeneity of dispersions can still be an issue of course, but it can also be seen as an important feature of the system itself (the fact that one "treatment" has more dispersed or erratic results than another). Are parametric tests becoming the floppy disks of data analysis? Or are there still areas where parametric approaches will always perform better?
I'm an Ecologist and I use both parametric and non-parametric methods depending on my datasets.
I am an Ecologist and 90% of my data are non-parametric. It is really a problem because sometimes I need to use some tests (i.e. two way anova) and teh data justo don't fit in the assumptions of the test. I am not a fan of data transformation, I mean you collect real dat in the field and then you "transform" your data into other numerical varialbes that do not reflect what you've found in nature. But the worse thing about using non-parametric stat is because sometimes, even two groups of data are very different, the test provides a P > 0.05. And then you do a parametric test just to check and P < 0.05. I think we need to rediscuss the application of normality and variance homocesdaticity in ecological studies.
To my mind, when you have some strong reasons (or evidences) about the distribution of your data, it is always better to use parametric tests, inference statistics etc. In any case I think it is essential to stay open mind and to use non parametric/parametric or Bayesian statistics in adequation with data (particularly their abundance), and with your personal conceptions (would you prefer to use your subjectivity or not ?). Also keep in mind that Bayesians stats are congruent with parametric stats when data are numerous enough.
Cheers.
What I was taught to do in subjects like biostatistics and numerical ecology 10 years ago was exactly what Vanda and Patrick said: Check your data, if it fulfils the assumptions, go parametric. But nowadays, like Estevão, I find myself using almost exclusively non-parametric approaches (I intend to go deeper into Bayesian approaches when I find some time to study them, Alejandro). It just feels more comfortable to let your data reveal its distribution, I guess. Even if it's ugly, it feels "real", with all the imperfections and oddities that were measured in the field / lab (you still have to clean your data to get rid of outliers).
I don't mean only rank-based approaches. There are lots of relatively modern techniques that perform tests based on distance matrices and calculate P-values by doing tens of thousands of permutations of the actual data to represent the null hypothesis. And I am not even mentioning complex machine learning algorithms that are being increasingly imported from the domains of artificial intelligence, such as artificial neural networks, self-organising maps, support vector machines, complex pattern recognition algorithms that can learn from past experience without the user having to input a single mathematical equation. Although the algorithms are often fitting several functions automatically (it's called a model after all), they are based on raw data and don't have to assume anything in advance.
Robots can nowadays look at patterns and make decisions without having to wonder if the data they collect has normally distributed errors or if the relationship among inputs is linear or not. They just learn based on real data with all the "imperfections" and can decide between good quality/bad quality products, civilian/army vehicle, sick/healthy patient, secure/risky investment, etc.
What I often find is that there is a language barrier between computer experts that develop such algorithms and ecologists who have a totally different background and need guidance or a more user-friendly interface than "The C++ code for this is in appendix A". I often find these things scary and tend to run away from them. :) Should all ecologists go and learn programming languages? Or should algorithm developers and computer experts make their tools available to a wider audience? I have tried some of these techniques but I had to restrict my choice to methods that were incorporated into software with relatively friendly interfaces and some guidance (anything beyond R scares me away with ease).
Substituting a probability distribution with a known mathematical function greatly decreased computation time in the past, when you had to calculate things by hand or with extremely slow computers, but nowadays you can take 100000 bootstrap samples or calculate a statistic 10000 times with permutations in a matter of seconds. If data happens to be normally distributed, there is still no problem with that, but it is often not a requirement of such approaches. As this sort of methodologies become more and more accessible will tests an models that require assumptions make sense?
I would argue that assumptions of parametric statistics can not even be tested appropriatley for many ecological datasets which are often characterized by a small number of datapoints (limited replication) and a large number of dependent variables (multivariate). How should we identify if a selected distribution corresponds to our data in such cases and why would we bother at all? There is a very nice framework for using permutational and null-model based statistics available in several commercial and freeware softwares. I would certainly not say that parametric statistics have no future, but their role is already declining in certain fields like community ecology.
Yes, replication is often a problem, as in some cases a single value can come at a great cost in terms of time an money. And when one tries to see if that chunky histogram fits a particular distribution it always leaves a bitter taste if we assume it does. However, as Ruben pointed out, sometimes we may want to understand the relationships between our inputs and outputs, and things like GLM are now widely used in ecology.
There are of course alternative models (like the ones I mentioned regarding machine learning) where we don't have to worry about the relationships among inputs because the models learn those relationships from your data. You simply introduce a large dataset with input variables and known outputs and the algorithm learns the relationships that go from one side to the other. Applying methods like single or multi-layer perceptrons looks like magic, since they are universal approximators and can learn complex relationships among several input variables and generate values for continuous or categorical outputs. The problem with these approaches is that they often need large datasets, which, as Klaus pointed out, are not usually available in ecology. If your dataset doesn't cover every possible combination in the behaviour of input variables, the model won't probably work well outside the dataset it learned from. The other thing is that interpreting the relationships can be hard and you kind of feel you're not in control. It's like you are replacing your brain with a couple of hundreds or thousands of artificial neurons and have to accept the output because it "magically" performed well at solving your problem. There are other alternatives which give you more control, but my knowledge is still limited on this subject.
So the view expressed by Ruben is that the classic normality and homocedasticity assumptions are probably a thing of the past, but modern parametric approaches don't need them and have the advantage of making us hypothesise about the mathematical relationship among inputs and outputs. This seems to be a good niche for parametric approaches to thrive, but can we deal with complex multivariate relationships in this manner? Can we build a model to generate values for 40 output variables based on 20 input variables, for instance, with a great variability of input-output relationships and complex interactions? Or should we just give up and deliver the problem to an almighty CPU? Or both (like with automatic distribution-fitting)? Maybe parametric models have clear advantages, but is that the case for parametric hypothesis testing?
It seems to me that this issue has been largely solved by GLM (the "generalized" rather than the "general" modelling) around 1979. Since this we are in position to run parametric analyses without the need to transform non-normal variables into artificially normal ones.
To follow up on Ruben's citation "parametric statistics has a bright future", here is a recent paper that might feed this very interesting and exiting debate:
Oswald S. A., Nisbet I. C. T., Chiaradia A., and Arnold J. M. (2012), FlexParamCurve: R package for flexible fitting of nonlinear parametric curves. Methods in ecology and Evolution, 3:1073-1077. Article first published online: 10 AUG 2012 | DOI: 10.1111/j.2041-210X.2012.00231.x
Hope this answer part of your question Miguel...
Parametric analytical statistics are always only a crutches in analyzing ecological data sets for many reasons - although many people insist in them. May be they don't know better. Try to go into Bayesian statistics or stay with good descriptions of your data by conventional methods (e.g. trends, ranges etc.).
May I remind here that both frequentist and Bayesian non-parametric machine learning algorithms are used in ecological studies. Bayesian non-parametric algorithms refer to models such as: relevance vector machine, probabilistic neural network, several semi-naive Bayesian algorithms (e.g. Lazy Bayesian Rule classifier), Gaussian processes. I guess it just takes a few more years/ a decade until ecologists/analysts get a bit more familiar with non-parametric machine learning tools. I also find myself using almost exclusively non-parametric approaches and, if possible I then compare non-parametrics with parametrics.
The normality assumption is still an issue because often times after data are transformation (e.g. using log-transformations), the data still will still give a skewed normal curve. It is in such occasions u test for normality using Kolmogorov-Smirnov’s test for Normality or any other suitable test.
I would like to say that despite the flexibility or more accommodating nature of non-parametric tests to environmental data that often don't fulfill the assumptions of normality, the importance of parametric tests should not be despised because of its strict requirements.
From experience parametric tests are more sensitive to environmental data and perform better than non-parametric tests particularly when the data fulfills most of the assumptions.
This discussion seems to start and continue on a high, sophisticated level. First and most of all one needs to look at is the beginning of all analyses: what is the nature of the data? Numerical? Categorical (e.g. type of niches)? Continuous? It sounds trivial but ecologists collect and analyse all kinds of data, and they will continue to do so; in fact in times of 'big data' more and more data are collected and analysed together, that are are suitable for one but not for another type of analyses.
Rather than trying to abandon one kind of analyses (parametric or non-parametric), we should better treasure all kinds of data, and think about how best to analyse them - independent of the nature of the statistical approach.
Ok there is enough material here for a defense on non parametric stats into a good paper :)!
Lucie Salwiczek: Sure! We do collect a lot of different kinds of data! The thing is that the beginning of the analyses often stays the same regardless of your approach. And you have both parametric and non-parametric alternatives for every kind of data. You can simply use a distance-based approach and you just have to select the distance (or dissimilarity) measure that best describes your type of data and then apply a test based on that distance measure that uses permutations to calculate a p value instead of a known distribution.
The question is centred on the point where you need to ask "how different is this from my null hypothesis?" "How certain can I be that something is happening here?". I think no-one will deliberately abandon parametric tests for no particular reason. In my case that just happened naturally. Maybe because I felt some type of discomfort when I had to transform my data to fit the assumptions. Or I felt bad for ultimately giving up and ignoring the assumptions, once again assuming that the parametric tests were still more robust than their non-parametric counterparts.
But there are now loads of non-parametric univariate and multivariate analyses that don't need to be rank-based and can be performed using any type of data and I only have to worry about transformations that are ecologically meaningful. I can still transform my data so that rarer species have their importance increased in the final result, because the numbers would simply not do that by themselves and abundant species would dominate the analyses. Those kinds of transformations bring mathematics closer to reality and I am more comfortable with that that with the opposite. You are right when you say we shouldn't discard anything a priori and it is ultimately a choice. But is that choice (parametric vs non-parametric) based on our data or based on our "beliefs" or "comfort zones"?
Andy Royle and Bob Dorazio wrote a very nice book on hierarchical models in ecology. In this book, they also comment on parametric statistics. Here's the quote from chapter 13.4.
13.4 NO SUCH THING AS A FREE LUNCH
inference in ecology. But this flexibility and generality does not come for free.
We rely exclusively on parametric inference in our approach to ecological analysis.
This means that we must behave as though the parametric model is the data generating model (i.e., ‘truth’) when computing inferences or predictions from data.
While parametric inference is something of a unifying concept for both Bayesians
and frequentists, many practitioners and statisticians are uncomfortable making
explicit model assumptions because assumptions of a parametric model are not
always testable, and, besides, we know that they are false a priori (‘all models are
wrong’, right?). In response, this can lead to a focus on procedure-driven analysis,
procedures that are poorly defined or contain vaguely stated assumptions, or models that are overly complex and cannot be understood. We think a word of caution is in order against these free-lunch approaches.
One reaction against parametric inference is to adopt procedures that are
described as ‘robust’ or ‘non-parametric’. Design-based procedures are also
enormously popular. But these procedures are often just Red Herrings that
effectively produce analyses which are irrefutable, unfalsifiable, or immune to
criticism. The analysis of Little (2004, p. 550) in the context of the Horwitz–
Thompson estimator (HTE) refutes the non-parametric ‘free lunch’. Little notes
that the HTE has a precise, model-based justification. While the assumptions of
the implied model need not be stated explicitly in order to justify the estimator,
he notes that ‘.... the HTE is likely to be a good estimator when [the model] is
a good description of the population, and may be inefficient when it is not’. The
point being that models are not necessarily irrelevant (or their effects innocuous)
for procedures that are claimed to be ‘model-free’ or ‘robust’. Because a procedure can be developed absent an explicit statement of a model does not render it robust against model pathologies.
Some statisticians routinely engage in the development of complex models under
the guise of ‘hierarchical modeling’ to introduce complexity – hierarchical modeling
for the sake of hierarchical modeling. While the motives are pragmatic, the end
result is a model which is unassailable, unrepeatable, unfalsifiable, and beyond
comprehension – metaphorically, if we may, a hierarchical Rube Goldberg device.
In Chapter 1 we quoted Lindley (2006) from his book ‘Understanding Uncertainty,’
"There are people who rejoice in the complicated, saying, quite correctly, that
the real world is complicated and that it is unreasonable to treat it as if
it was simple. They enjoy the involved because it is so hard for anyone to
demonstrate that what they are saying can be wrong, whereas in a simple
argument, fallacies are more easily exposed."
Words to live by.
We are enthusiastic proponents of model-based, parametric inference as a general framework for ecological analysis because, in the words of our colleague W. Link (Link, 2003), ‘Easily assailable but clearly articulated assumptions ought always to be preferable’. In matters of scientific inquiry, simplicity is a virtue. Not necessarily procedural simplicity, but conceptual simplicity – and clearly assailable assumptions.
The problem with data transformation to adjust to parametric assumptions is then to interpret the results.
Many of the comments make excellent points. Suggest looking at Wilcox, R. R. (2012). Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction to get more details about when and why transformations can fail, the many new and improved parametric methods and modern non-parametric methods.
Briefly, all of the classic methods, including non-parametric methods, perform well in terms of Type I errors when distributions do not differ in any manner. But under general conditions, when distributions do differ, they perform poorly. One reason is that the wrong standard error is being used. Modern methods for dealing with outliers can make a substantial difference in terms of power and getting a deeper and more accurate summary of the data. Modern methods are being used by many researchers, but it is evident that most are now aware of them. The attached paper (Wilcox, R. R., Carlson, M., Azen, S. \& Clark, F. (in press). Avoid lost
discoveries, due to violations of standard assumptions, by using modern,
robust statistical methods. Journal of Clinical Epidemiology) gives you some sense about the practical utility of modern techniques.
Kikvidze, Z. and Moya-Laraño, J. 2008. Unexpected failures of recommended tests in
basic statistical analyses of ecological data. – Web Ecol. 8: 67–73.
Ecologists, when analyzing the output of simple experiments, often have to compare statistical samples that simultaneously are of uneven size, unequal variance and distribute non-normally. Although there are special tests designed to address each of these unsuitable characteristics, it is unclear how their combination affects the tests. Here we compare the performance of recommended tests using generated data sets that simulate statistical samples typical in ecological research. We measured rates of type I and II errors, and found that common parametric tests such as ANOVA are quite robust to non-normality, uneven sample size, unequal variance, and their effect combined. ANOVA and randomization tests produced very similar results. At the same time, the t-test for unequal variance unexpectedly lost power with samples of uneven size. Also, non-parametric tests were strongly affected by unequal variance in large samples, yet non-parametric tests could complement parametric tests when testing samples of uneven size. Thus, we demonstrate that the robustness of each kind of test strongly depends on the combination of parameters (distribution, sample size, equality of variances). We conclude that manuals should be revised to offer more elaborate instructions for applying specific statistical tests.
Non parametric statistics have been developed as an alternative to parametrics to provide statistical interpretation when the data quality disturbs results of parametric tests. I agree that use of each test should be well justified. I personally always use parametric and non parametric alternatives simultaneously, but in papers I talk only about one which gives the most robust results, and confirms or declines a hypothesis in the most clear and reliable way.
There is a recent trend to use effect size effects instead of hypothesis testing. It is an interesting iea that should be looked at more carefully.
David Vaux published this interesting note about statistics & scientists in one of the recent Science issues (attached). I think he is right about the need to go back to basics in statistics. Bayesian methods & multivariate analysis seems to be the dominant methods in ecology these days, but to master those kind of complicated analysis you have to have your basic statistics toolbox clear & straight. Parametric analysis will always be an important part of the toolbox. It is the user & the data that will make the analysis powerful or meaningless.
Ruben, I agree with your comments about the note in Science, but I don't think is as bad as "misleading". The author made some points that are valid, regardless of your criticism. I mentioned the note as "interesting" not authoritative. Here are two articles about the topic of hypothesis and significance testing, that I do think are authoritative. Best regards.
Here is the other one (it seems you can upload only one at a time here). Sorry for the quality of the copy.
Indeed, Kathryn. I find that this is happening so fast that postgraduates are often teaching these things to their supervisors. :) If you know about papers that can contribute to the discussion or even help anyone get introduced to all these "new" approaches, please share the references or upload them here!
I am glad that I asked this question. My RG score is growing steadily so I don't even have to publish papers anymore. Hahaha!
Jokes aside, I really think this is an important issue. More and more I regret not having an expert on these matters sitting next to me all the time. Every science department should have a team of experts in these methods that could dedicate their time to keeping up-to-date with what can be done nowadays. Usually these advanced algorithms need some time to get introduced into more user-friendly software, and most of the times they aren't at all and there is always a language barrier. And ecologists venturing into those areas may make mistakes due to a lack of solid background experience and lead to more confusion.
I guess parametric is not dead, from what I've read, but it also feels like it is a rather subjective choice. The only thing that can be objectively better or worse is your sampling design and your power to detect a given effect size with the tests you choose. Perhaps ultimately that is what matters in the end. It is fine if you go either non-parametric or parametric, as long as you can objectively support your decisions regarding your hypotheses or models. Did I get it right?
Ruben, maybe it is just a case of too many papers and not enough care?
There are many methods of making inference about the world but they differ in their assumptions, their ability to accommodate existing information relevant to the topic under study, their ability to accommodate information regarding how the data was collected (sampling design), their ability to accommodate how the measurements were summarized ('response design'), and the 'quantities of interest' for which inference is desired. Some questions are best addressed through fitting models, some through simply summarizing characteristics of the underlying distribution of the observations, some through thoughtful data visualization or interval estimates of differences or even p values. From my perspective, these various methods are all part of Statistics since they are concerned with inference. The challenge is to help our colleagues and collaborators understand the requirements and tradeoffs among this ever growing suite of tools.....and not to forget the fundamental lessons of the intimate relationship between how the data is collected, how the data can be analyzed, and the questions of interest. These fundamentals seem to get lost more easily as the method choices multiply and splinter into 'interest groups'.
I've enjoyed this discussion - thanks for initiating it.
Cheers.
Great discussion...I feel that I am somewhat in agreement with Miguels opening statement. Parametric statistics are specific and often totally useless in the face of huge, weird and interesting ecological data sets. Yet they are very powerful if you have to answer the particular question they were designed to answer. If I have normally distributed data with equal variances and I want to see if the means are different, you are damn sure I will do a t test, its quick and easy and powerful. I wont be designing a likelihood based or Bayesian test from scratch every time I need to check means are different.
But, if you need a specific tool to handle a difficult problem (one of my recent ones was spatial clustering with non normal diffusion) you cant just try to fit it into a premade box (we ended up using a Bayesian dirichlet process mixture model). Parametric of the shelf tests can save you time but they can also waste your time as you struggle to meet requirements and find the right transformations.
I agree with Kathryn we need a range of different tools in our toolboxes. Ranging from the blunt blunt practical hammer of the t test to the laser guided surgery of designing your own Bayesian approach, with all of the likelihood and AIC based stuff in between.
Personally Im a pretty heavy Bayesian (you may have noticed!) I think the main reason for this is a philosophical one, (I think inductive inference, or forming hypothesis given data is really what we do when we do science, not just fail to be wrong a certain number of times) and like all philosophy I think you shouldn't go around bullying other people because you feel a different approach is true. It was the bullying and shouting down of other approaches that meant we only got Fishers parametric statistics for years anyway!
Mark - I unfortunately don't know enough about Bayesian methods to be able to use them. I am afraid it is not something you learn in a couple of days and I suffer from a suffocating lack of time as I approach the last months of my PhD grant... (*screams*).
Anyway, nowadays I find myself using almost only distance-based methods that calculate significance with permutations. I don't even check for assumptions to consider doing a t-test or an ANOVA. I just check for outliers and I do a permutation-based test to check if multivariate dispersions are similar (that can still affect your results). I deal mostly with community ecology and I often have to test the response of 40+ species to multiple factors, so it is much simpler to me to translate everything into a distance measure of my choice (Bray-Curtis, modified Gower, Euclidean distance, etc.) and then use PERMANOVA instead of ANOVA and permutation-based t-tests instead of a traditional t-test. The only difference is that I am not comparing my data to a theoretical distribution, but I am generating a distribution from my data by shuffling observations between groups ten thousand times or so. The only assumption here is that observations can be shuffled to represent a null hypothesis. For me this feels more "real" and down-to-earth, because I'm shuffling real field observations. If one group is not significantly different from another, then I could have taken any of my samples from any of the groups.
Even if a fully parametric alternative could be more robust (and it is not guaranteed), I would spend much more time trying to make everything fit the assumptions and transforming and checking and double checking. But this is a personal choice in the end, I guess. It is a philosophical matter, as you said, and ultimately a matter of personal comfort or even a matter of tradition in your institution or research team.
Oh, and if I saw a Bayesian dirichlet process mixture model walking towards me, I would move to the other side of the road. I don't know what it is, but it sounds scary. :)
Miguel, I am sure that some people find your bootstrapping? methods scary, but then this question thing is a teach-in, não é?
Claro que é, Peter! But the fact that a particular subject is interesting doesn't make it less scary on a first encounter! :) I was exaggerating of course. I do want to learn as much as I can to increase my analytical toolbox, but complex methods tend to have a rather steep learning curve that takes some serious climbing to get to an acceptable level of expertise.
Just an FYI re: permutation methods - Julian Besag, who very much knew his history of statistics, pointed out to me once that much of the normal-distribution based hypothesis testing tools were originally developed as analytical approximations to the permutation distribution approaches. I.e., the permutation approach motivated the test statistic's development but could not be implemented computationally except for very tiny data sets.
Thank you all participants.....I enjoyed and learned ...I am too a user of Statistics in Biology/ecology....not a statistician....I also offer a simple course to my PG students......I love to use statistics because it is very useful to reach a conclusion......there is so much variation in biological data....often we want to compare the sources of variation and want to reach a conclusion ....a conclusion which more reliable to make our point......
So I use t-test or AVOVA for data which are approximating normal distribution.and Non parametric like Contingency Chi-square, rank tests ... for data which are non-normal....it works for me and many of my students...
of course there are more complicated issues/data and also more complicated tools....what I used to emphasize is ..we have to understand why we need the tool...what is the logic of using statistics...and also which particular tool we need, the tool's purpose/ability........
and the simple the logic the more appropriate will be the use and interpretation.......
I hope to learn more from you..the discussion is simple but useful..at least for a novice like me...thanks again
This comment is only indirectly related to this topic, but anyway… Knowing real distributions and comparing them to the expected ones (expected distributions can be normal, Poisson, uniform, etc, or generated by randomization null-models) can be very useful because most theories and models in ecology and evolution are based on statistical mechanisms. So it is important to look at distributions not only for knowing how legitimate the use of parametric tests is.
I believe both parametric and non parametric tests have their place in science or research, but as you rightly said, to meet the assumptions of parametric tests are becoming elusive even in social science but I still believe the two are very important. Perhaps we should look at our sampling techniques as well
I can see that the actual mathematical model underlying our observations can sometimes be the main question, so it must be answered through parametric methods.But can we say that non-parametric methods are worthless? Can we totally put them aside and figure out everything using parametric alternatives, adjusting our data to distributions or vice versa? In Ruben's view, if I got it right, these underlying processes (response patterns or probability distributions) should always be known and described, regardless of the question we ultimately want to answer.
The need to describe "laws" of nature, in terms of mathematical models is, by definition, parametric. So this seems to answer the part of my question where I asked if there were any areas where parametric will always be required. Is there any case where non-parametric brought us an advantage (other than a "free lunch")? Are non-parametric approaches just a computer-intensive, complex way of being lazy, or can we actually achieve things that are not possible with parametric approaches?
There must be many multivariate problems that can only be approached using computer power. Some of these methods need to take care to use appropriate minimum data structures, (ie. not default data structures) together with clever programming, if they are to run in real time. My own experience with this was in the area of species identification using a reference base of more than 50 traits and 500 species. The problem was how to compare identification performance between tree groups having different numbers of species. See free download of my 2009 paper. The related area of phylogenetics have far bigger and less tractable problems using similar techniques.
I agree with those who argue the 11th Commandment -- Thou shalt not worship the normal curve and two standard deviations. Parametric statistics, as I understand it at an unfortunately limited level, assumes normality. My position is not based on statistic expertise, or lack of it, so much as it is based on the generation of a large set of data demonstrating that major areas of morphological (i.e. morphometric) data taken from Nature do not provide parametric, i.e. normal distributions. The data plots are often multi-modal and there is not correlation between mean and primary mode, mean values in replicate populations are rarely similar (in fact may indeed represent chaos, if I truly know what chaos is) and the location of (multi-) modes varies and is unpredictable. Parametric statistical probability is not productive (not appropriate? not applicable?) whereas non-statistical comparison of patterns of distribution of variation (i.e. Variametric Analysis) can reveal critical insights not revealed by parametric statistical analysis. Variametric analysis is advantageous, is not necessarily complex, not a free lunch or lazy. Moreover, advocates of parametric statistics should not be threatened by the observation that statistical analysis is not a universal panacea. We must accept the likelihood that major areas and aspects of Nature are not "normal."
Poisson distribution is not normal, and a good deal of modern statistics is based on assumptions of Poisson distribution. And this statisitic is parametric (e.g., modelling species abundance along gradients).
Also, computer intensive methods such as permutations and randomisations are not "non-parametric" in the usual sense, but rather free from assumptions on distributions because these methods produce empirical distributions under a given null model and from given data.
To answer quickely, I think "no future for parametric in biology in general"
Parametric test need normal data... I have never seen convinient data in biology, either non normal or, non homogenous variance.
My response is not to add some new insight into the problem. All scholarly responses have given deep insight into the problem and suggested invaluable solutions. In fact, my effort is to summarise all the respected opinions expressed below:
1. The first point that needed to be emphasised is that statistics cannot be divorced from its mathematical foundation that is based on the assumption of continuity in space and time. Spatial and temporal lags are not uncommon in ecology.
2. Natural systems are complex, heterogeneous and diverse. If we look in detail, in fact, we see that each system is unique, differing from all others in various characteristics. Scientific investigation is largely a process of simplifying and selecting from such systems a small set of key components, governing factors, and relationships that are sufficient to describe how the system works. Ecological systems consist of the biological community and the abiotic environment. Many simultaneous and complex interrelations exist among the environmental variables, between the environment and the community, and within the members of the community. Since, multitude of relationships typically implies redundant information. Important predictors either have to be selected or information must be compressed. Selection means simplification or in philosophical terms, a reductionist approach, while compression of information means loss of information.
3. The major problem in ecological analysis/ modelling is not only complexity as pointed out above but also uncertainty. Uncertainty in the present context concerns the uncertainty of data and vaguely defined expert knowledge. A large inherent uncertainty of ecological data results from the presence of random variables, incomplete or inaccurate data, approximate estimations instead of measurements (due to technical/financial problems or failure/ malfunctioning of measuring instruments) or incomparability of data (resulting from varying measurement or observation conditions). In most cases, data may simply guess or estimates as number of species aquatic organisms and their respective population in a lake. Or, there may be lag in cause and effect. Thus, large amount of data uncertainty of has resulted to what you have pointed in your post, “Empirical data collected by ecologists (and also scientists from other areas) are known to often fail to fulfil the assumptions needed in parametric hypothesis tests.”
4. The best solutions for ecological analyses may be found in multiple non-linear regression analyses, Fuzzy Logic, ANNs etc.
Ruben- Indeed the methods I tend to use more frequently (multivariate anova using permutations or permutational t-tests and things like that), which I described in a previous answer, may be more accurately called semi-parametric I guess. The thing is that the null hypothesis is represented by permuted observations from real data, so it is like generating hypothetical values from empirical data to represent the hypothesis of no difference (which is then compared ten thousand times or so with the real scenario to generate a probability distribution).
Regarding my question, I guess they both still stand. In the beginning (due to some preconception, perhaps), I only stated two options, which were "non-parametric only" or "both". Assuming that, like myself and a lot of people I work with, non-parametric approaches were becoming more and more frequent. However, I saw that some people (Ruben being an avid supporter) go for "parametric only", which is a new viewpoint that I think should be brought into discussion by more people. :)
Perhaps if I had asked "what is the future of normality and homoscedasticty assumptions in ecology?" the answer would be more unanimous (I am asking this now, though).
When working with species distributions (besides past experience of convincing myself that the probability distribution looked normal), I already tried to run an automatic distribution fitting routine to the abundance of every species, and in most cases there is nothing that comes close to the actual histogram generated from field data... The other options always seemed a bit far-fetched... Often the histograms don't have many values, because one sampling unit can be costly (in terms of time and/or money) and we end up with 5 or 10 or 20 values per sampling site, leading to a very awkward histogram full of holes in the middle. And you may say "you have a small sample size", but sometimes we either have that or nothing... Sampling design has always lot of constraints and limitations in (usually) time and money that need to be taken into account. If a single value on a spreadsheet costs 5 hours of work in average, you may not be able to afford having 100 or 200 or 1000 data points to work with. And, while we may have enough power to detect a meaningful effect with a given sample size, we may not have enough evidence to look at the probability distribution and say it looks like this or that theoretical shape...
This is an excellent discussion. Thanks all of you for it. I think that beyond the point of parametric statistics and its future, one thing that is central to statistics and ecology is that despite many recent -and not so recent- calls to use statistics wisely in ecological research (But see Stephens et al. Inference in ecology & evolution TREE. doi:10.1016/j.tree.2006.12.003, Romesburg 1982. Wildlife science: gaining reliable knowledge. JWM) , many journals in ecology and conservation keep publishing questionable research with a lot of statistical problems -some already mentioned here- such as false inference (type I or II errors), pseudoreplication, lack of fulfilment of "assumptions" for applied tests, among many others problems. These problems, end up rendering this research unreliable or even spurious. A good recent example is this recent paper in Conservation Biology (Ramage et al. 2013. Pseudoreplication in Tropical Forests and the Resulting Effects on Biodiversity Conservation. Conservation Biology. DOI: 10.1111/cobi.12004). The authors found in their review of the literature, that from 77 studies, only 7% were free of pseudoreplication!. So, to me, the key point is statistical inference. Your can use whatever statistical tool you want, parametric or other, but you have to use it wisely and with quality data (otherwise as C. J. Krebs says "garbage in, garbage out") …and please, please, Journal editors, do not publish statistically unreliable results.
Thanks to all for this discussion, it is really excellent. I will try to contribute with an abstract comment by restricting myself topics relation with ecology. I cannot see any reason to compare reliability and/or benefits of parametric or non-parametric statistics since all are tools for analysing a set of data. Moreover, I also think that their potential to deal with high or low degree of complexity and/or heterogenity is not a criterion for one speriority or inferiority against the other. The problem lies under our preference (?)on what type of statistics to be applied on the data (?) set we deal with. This is particularly a longlasting challenge in ecology. Because most of the data we obtained by several methods or tools, despite the continuous technological improvements, has still contain defects in accuracy and precision, mainly which originatied by the extreme differences range of spatial and temporal scales among the components of a given ecological processess (nannometers to tousands of kliometers/mlliseconds to tousands of years). Unfortunately, not all but most of our measurements are not "real data" but "approximations" with a degree fo heterogenity. Thus, I think that ecologists need to have more advanced data measurement technology and they have to be persistant to demand data sets representing the exact temporal and spatial scales of ecological processes that they investigate, namely better (appropriate) accuracy and presicion.Consequently, the statictical tools, within the context of the mathematical bases, are very well defined engines working under very well defined conditions. It ,s not fair to value their merits or benefits when they fail to extract any information from our data set. But ofcourse this does not necessarily mean that there is no need to improve statistical tool currently available.
Very good discussion. I think that the kind of statistics you use depend on your sampling strategy (short term, long term), the characteristic of the variable to be measure (categorical, continuous...), objectives (parameter estimation, comparison...) etc. The key to use parametric statistic is that you consider your sample as a "photo" of an actual situation, a view in an instant of time. So, the population parameters exist as unique (fixed, unvariable). If the phenomenon implies following some trend, the time could affect our assumptions of "constant parameters". Another important aspect is the sampling design, and you know that if you have a good sample (sample space well defined, good sample size, randomization...etc.), a probability distribution arises in a parametric way, independently of the distribution of the variable in the individuals used to take the data.
I'm not a statistician and have little idea about this discipline's intricacies. My involvement with this subject is that together with the fishing industry I find myself at the receiving end of its often fallacious products.
The art of fisheries management represents in fact a branch of marine ecology. It employs statistics as a tool for fish stock assessments, for which many models are available. But, I think that they all are plagued with process errors, uncertainties and other maladies. If half the money and effort spent on working out more and more sophisticated computer models had been directed on improving and developing collection and laboratory processing of fishery and environmental data, models containing more and more detail values that represent the plethora of factors acting in marine ecosystems might lead stock assessment closer to a real science.
Independent scientists for various reasons have increasingly expressed skepticism regarding the present status of the computerized basis for fishery management. In my opinion, true scientific approach requires that the models comprise all the major variables that affect stock size, including all that’s today covered by the term 'natural mortality' (i.e. all that's not 'fishing mortality'), which should be split into its main elements, such as predation, food scarcity, major hydrographic (temperature, salinity, flows) shifts, and pollution (permanent or sporadic), etc. All the more that in most assessment the value of natural mortality among the whole variety of commercial fish populations is assumed to be some a sort of a constant, chosen by a scientist some 100 years ago.
Where quantification is impossible, or can only be approximated, techniques that can handle ranges (from… to…) rather than precise figures must be involved. Obviously, truly scientific fish stock assessment models cannot produce precise figures. For the application of statistical modelling to marine and fisheries ecology to resemble real science and make more practical sense, its products must be presented not in precise figures (today's practice), but rather in the form of uncertainty estimates, fuzzy logic being one of possible methodologies. MB-Y
I am not an ecologist but I think this question is about something all scientific areas have in common. Is there a future for 'simple ' models when one can take a computer to crunch the data. I guess that 'simple' models in the sens much easier to understand the mechanism is always good. The point being that one should try to rely on the simplest possible model even semi-empirical than on huge computation power which always shadows the real understanding and always has hidden 'tricks' that can create wrong answers.
Phillipe's comment is a "bull eye" that displays the need for improvement on mathematical analyses for dealing with huge computation power without hidden tricks! and thus, the essence of advancing with slow, modest but understandably during analysing efforts under currently prevailing circumstances.
Please note if you are going to use the R mt.MaxT package for permutation testing and you have computational performance issues then you should try the R sprint() package. This has a parallel implementation of this that has been shown to produce good performance across a range of platforms from Linux PCs to supercomputers and the cloud.
You can find out more about its performance in the Concurrency and Computation journal article at
http://onlinelibrary.wiley.com/doi/10.1002/cpe.1787/full
While a great deal of advancement has occurred in computer intensive non-parametric analyses, these advances are generally related to frequentist inference with strong emphasis on hypothesis testing.
In contrast, much development in statistical literature as applied to ecology has also been focused on the fallacy of an over-reliance on hypothesis testing with recommendations to focus more intently on estimation and model development followed by careful statement of uncertainty bounds for models and their predictions.
The latest advancements in this area have come in two areas, 1) Classification and Regression Trees, and 2) Bayesian Hierarchical models. The CART methods are purely predictive and have limited application for identifying cause and effect relationships, whereas the Bayesian advances are strong for both prediction and causal analysis, but these both come at the expense of very heavily model based (read that parametric) analysis.
These tools in addition to the old standbys will remain with us, and emphasis should be on study designs suitable to support intended statistical methods, followed by careful selection of statistical methods suitable for the complexity of the problem under study. A simple permutation test may be all that is needed to evaluate a simple one factor designed experiment, whereas a much more complicated hierarchical model may be required to integrate multiple data sources from an observational study.
The skillful will be adept at matching the proper tool(s) to the problem at hand, recognizing that not all problems are nails requiring a hammer.
may be of some help..
https://www.researchgate.net/publication/292330226_Non-Parametric_versus_Parametric_Methods_in_Environmental_Sciences
Article Non-parametric versus parametric methods in environmental sciences