Data analyzers inspecting tables or figures might decide to exclude from statistical analyses unusual data points sometimes called 'outlier' data points. Statistical patterns and conclusions might differ between analyses including versus excluding outliers.
The exact underlying mechanisms that create outlier data points are often unknown. People might always find arguments to exclude or keep data in analyses. How important is familiarity with model species or model systems in the justification of data point selection, or the definition of statistical rules in general?
Dear Marcel,
An outlier is an observation that appears to deviate markedly from other observations in the sample An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly.
If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible).
In some cases, it may not be possible to determine if an outlying point is bad data. Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we should not simply delete the outlying observation before a through investigation. In running experimdnts , we may repeat the experiment. If the data contains significant outliers, we may need to consider the use of robust statistical techniques.
An excellent book on the subject:
Rousseeuw, P. J., & Hubert, M. (2011). Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 73-79.
http://onlinelibrary.wiley.com/doi/10.1002/widm.2/abstract
Dear Marcel,
An outlier is an observation that appears to deviate markedly from other observations in the sample An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly.
If it can be determined that an outlying point is in fact erroneous, then the outlying value should be deleted from the analysis (or corrected if possible).
In some cases, it may not be possible to determine if an outlying point is bad data. Outliers may be due to random variation or may indicate something scientifically interesting. In any event, we should not simply delete the outlying observation before a through investigation. In running experimdnts , we may repeat the experiment. If the data contains significant outliers, we may need to consider the use of robust statistical techniques.
An excellent book on the subject:
Rousseeuw, P. J., & Hubert, M. (2011). Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 73-79.
http://onlinelibrary.wiley.com/doi/10.1002/widm.2/abstract
I agree that outlier data are not always 'errors' (e.g. resulting from experimental artifacts or typing errors in data files), but just the result of an unusual event/factor that was missed during the study.
Is it scientifically wise to define outlier data when the data analyzer only has access to the data distribution pattern? This will happen when data analyzers are not familiar with the study system or model species involved.
You could also say: How important is it to reveal what people call 'statistically significant patterns'? Each individual data point in the cloud of points will probably be influenced by a unique cocktail of underlying mechanisms.
Dear Marcel,
When we have a bad data, it is easy. we just delete the outlier(s).
You are right that in some cases the outliers may results from an unknown/unusual factor.These cases are hard to deal with as keeping or deleting the outliers result in very different conclusions!. As I wrote in my previous post, I think that one possible way is to apply robust statistical methods. I am going to close the RG session for now. good day!
Perhaps people just increase the sample sizes until statistically significant patterns are obtained, and one reason might be to reduce the statistical impacts of so-called outlier data points without excluding them from the data set.....
The question then is: Should the study or sampling period stop once statistically significant patterns are obtained, for instance to avoid the risk of the reappearance of new so-called outlier data points?
Another point for discussion:
Can the same data point be considered as an 'outlier' in one study domain, but accepted as a 'normal' data point in another study domain? Any examples?
An outlier is an observation that is distant from the mean of observations. In SPSS, we may delete outliers if they affect the results.
Dear Marcel,
I believe an outlier of same data in one study domain can not be accepted while in another domain it may be accepted. The sensitivity and risk of data collected for medicine studies is different than data collected for marketing studies.
Dear Mahfuz,
Concerning your point related to SPSS, you assume that only the pattern is important, not the underlying mechanism. A data point caused by an error or an unusual event is therefore considered of equal importance.
Concerning your second point, let's take following two examples for more discussion:
1) A study in human sciences interested in the underlying causes of human errors may assume that individual errors are just a part of human nature. Unusual data points in a data cloud where body length is plotted against body mass could in this framework be considered as a 'normal' data point, even if the points results from measurement errors or typing errors. However, the same unusual data points (e.g. an unusual length) might not be accepted in a business study interested in marketing and cloth manufacturing. Producing cloths for people with unusual dimensions would perhaps be considered too expensive.
2) When body length is plotted against body mass there may be a nice positive relationship with scatter when different human age classes are combined or when data from men and women are combined or when data from a single age and gender class are considered, etc.... But we also know that each individual data point representing body mass and body length from a single individual may be influenced by a unique cocktail of underlying biology-based mechanisms related to the rearing environment, genetic background, culture based diet, etc.... . Perhaps in these conditions a data point A representing an individual A that is situated in the middle of a data cloud plotting body length against body mass might have been caused by an unusual biology-based unidentified mechanism.
The definition of 'unusual' might be scale-dependent.... either 'pattern-based' (e.g. SPSS) or mechanism-based (research domain dependent) or .....
If you believe, for a good reason, that the result contains measurement error.
If you know that the result is associated with a one-off event that is unlikely to ever happen again.
Yes Marcel. Concerning my point related to SPSS, I assume that only the pattern is important, not the underlying mechanism. In Humanities, the most important thing is the pattern of data. As for the second point, you have mentioned a good example.
So one option might be to systematically exclude before the statistical analyses the upper (5%) and lower (5%) extremes from a data set, accepting that there is always a high probability to make errors just by chance alone....
@Marcel
I think,that in zoology upper and lower 5% are the most interesting - or at least, could be the most interesting.
I exclude outliers if only mistake/error is obvious. My last exclusion, and fully recalculated statistics - error in the recorded body length of trapped mice. it was 67 mm, while body weight - over 40 g. being in the field for more than 30 years i know, that this species simply cannot be such small in size (or obese). So, I excluded both measures from data, as error cannot be corrected.
On the other hand, if 5% of the mice with highest body condition are all from control zone, and those with worst body condition are from most heavily poluted zone, this is nice finding, don't you agree?
Dear Linus,
I fully agree. We use similar procedures of judgement before the data are entered in the long-term data base of small passerine birds. When the values are extreme based on >30 years of observation, they are not included in the data base. For instance, a great tit has a wing length of 70-78 mm (no values below 70 mm). If in the note books there is written '66 mm' this must have been a writing error and therefore not considered. On the other hand, there may indeed be border cases and there may be exchange between populations differing in phenotype, like wing length... .
Thus, how many 'outlier data points' that are found in field note books will not end up in electronic data bases, and how do data managers varying in background information about model species handle these special cases when the files are constructed?
@Marcel, geographic differences in small mammals may be exciting; we just get new species, where diagnostic characters are out of the range comparing to southern populations. There were not outliers, just smaller values. paper is accepter, soon I will present it to RG
outlier is not always out of the analysis unless the results will be out of the context or meaningful. some times we should be casious in considering outlier as error. in biomechanics research, filter can be used to determine the range of real values from fake one due to error.
I do not like to exclude data just because it is an outlier, especially if it is based on enough data. However, I deal with consumer review data and often some of the data points are based on one review. In this circumstance I do two analyses of the data, one with this data and one without and explain the difference. That way the user of the data can see both sets but are aware of the lack of data supporting some of the data points.
So we end up with the question what we consider as outlier, how we define an outlier. In descriptive statistics exclusion of a few extreme observations within a large mass of data can be very helpful: The difference in the results with and without the extreme observations might be the issue of interest. More general, comparing results from robust methods with standard methods follows the same idea.
In applied statistics, learning from evidence often is the focus of the exercise. In such a situation, the extreme observations are possibly those observations from we learn most.
Extreme outliers will affect the mean a lot, but will not affect the median. So you can include outliers (if there is no other compelling reason to remove them) if you are computing a median, or a mode.
As others have said, if an outlier is too extreme to be believable, such as being likely due to measurement error, then it is best to exclude it. If the outlier is plausible, it may be best to analyze the data both with and without the outliers.
In logistic regression, it can be useful to show the risk factors that predict them. But including outliers in the data may also mask the effect of predictors on less-extreme data that are not outliers. In linear regression, outliers can greatly affect the regression (the slope, r-value, and r-squared). It may be best to remove them from linear regression, and then explain and describe them separately in some other way.
Dear Jerry,
thanks for your advice.
The impact of outliers will depend on the proportion of outliers in a data set (thus sample size dependent) and the values of the outliers in relation to the values frequently observed (median). Perhaps one outlier is enough to create a biased (statistical) pattern when the value is really extreme. Extreme values can be found out just by looking at the values in a data set, also based on past experience with a model species/system.
Potentially there are several types of data filtering at different levels:
From observations to field notes (did I really see that phenomenon, probably not? What observations will be noted done?)
From the field notes to the computer file (did I really note this down in the field? it must have been an error)
From the computer file data set to the data set used for statistical analysis....
Perhaps some extreme values just result from typing/copy errors.
I prefer to share data sets/statistical outputs among potential contributors so that each person has the possibility to have a look at the note books, the computer data set, the statistical analysis,.... If different persons come up with the same findings/conclusions, and if results are repeatable in time, I think the analysis is OK
Dear Srikanta,
very good remark. There were some examples mentioned above that when two populations significantly differ in phenotype (e.g. L=Large versus Small), and there is some exchange between the two populations where one individual L ends up in a population of S, L might be defined as an outlier because of biological reasons, not methodological reasons.
So what is attributable to real physics/biology and what is attributable to an experimental/methodological artifact?
Intuitively, I guess you can use a simple test. Include the outlier, see what you get and the exclude it and see what you then get. If it affects the mean significantly, then it must be eliminated from the sample.
Dear Mohamed,
Imagine two populations A and B that you would like to compare. Interesting is that your decision to keep or exclude an outlier value in population A will depend on the mean value and SD in population B. If there is a large difference in mean between population A and B you will decide not to exclude the outlier from population A because the conclusion that population B is larger than population A will not be altered when analyses include versus exclude the outlier from population A. However, when there is a small difference between the two populations, analyses with versus without the outlier from population A may change conclusions.
Thus, in this case, it is not the population that contains the outlier that will decide whether you keep or exclude it from the analysis....
Personality profiles and selection/exclusion of outliers?
Data analyzers inspecting tables or figures may decide to exclude from data sets unusual data points named ‘outliers’. The outcome of statistical analyses will probably differ in analyses with versus without outlier data points. The identification of outliers may depend on statistical rules taking observed variation into account (e.g. points exceeding or not exceeding standard deviation values, which may be sample-dependent) or familiarity with model species or model systems (e.g. data points considered to be biologically impossible, perhaps caused by copy errors). However, underlying mechanisms creating outlier data points are usually unknown. People might always find published or unpublished arguments to exclude or keep ‘unusual’ data points in analyses. Perhaps selection of data points will depend on baseline knowledge of data analyzers and, why not, personality profiles, like more or less critical people independent from education background.
if you find data which is higher or lower than +/- 3 SD from the usual data points, is useful a second review of the original data in your laboratory notes, to see if you find an explanation of the appearance of this unique point which can be originated by a methodological or copy error. In this case you can see this unique point as an outlier. Of course , data obtained by duplicate or triplicate are useful to find logical outliers. Comparison of the A group with the possible outlier with the B group without it, is a simple way to detect outliers, but a set of "outlier" points may be the result of another unknown biological mechanism and the source of a doubt on the first work hypothesis. After writing my answer, I see that I agree with other RG members .
In reference to Regina's comment, simply plotting the data in a graph can show you how extreme any potential outliers really are. If you have a lot of datapoints, there will always be some that are beyond +/- 3 standard deviations from the mean. So you cannot automatically consider anything beyond three standard deviations as an outlier. (if you remove such points, you will then have a new mean and standard deviation, with new points that are beyond 3 standard deviations!) Something may look like an outlier in a small sample, but as the sample gets larger it becomes less like an outlier and simply fits somewhere on the bell-shaped curve, in a normally distributed sample.
For some basic measures such as median, I do not ignore any. Other than that a treamed data with 5% cut from above and below won't hurt, unless someone can show me mathematical contradiction.
OK to exclude outliers if a) you have good reason (see above) AND b) you clearly state what you are excluding and why (to avoid loss of data, allow for alternate interpretations of data, and otherwise protect readers from being mislead).
The case of the outlier should be treated with much care in dependence of the nature of the data and of the knowledge that one has on the process of its obtaining. In occasions the outlier is erroneous simple data, but in others they represent important deviations of the behavior average of the sample.
Excusion would not be the first stepm to think about, remember much as it may be an outlier, it may have vast influence on the overall outcome. I would suggest that you first ascertain the influence of that particular outlier before you think of exclusion. obtaining the influence will not be a one step, you may have to work out several procedures, since that outlier may as well be influential when combined with some other explanatory variables. In other wise to exclude will largely be determined by the level and importance of the model being built.
I think, noting exceptions to the general rules will take care of all concens mentioned.
I exclude the outlier if it is produced by a measurement error, otherwise I prefer to use methods that are less susceptible to outlier
Dear Stefan,
I agree with your comments. All what we call 'replicated' phenomena deserve more study, even when they are defined as outlier
In the physical sciences, Chauvenet’s Criterion - a test based on the Gaussian distribution with the aim of assessing whether one data point which lies many error bars from the mean
(an outlier) should be regarded as spurious and hence discarded - is often used. This procedure should be applied with care, and one should always ask if there is a possible reason why the outlier occurred.
I would look into the outlier data by case and try to verify if the outlier value corresponds with other variables in the same case, for example if an income is too high, it is reasonable to check if the person owns a land and son on. That is it data makes sense. Often outlier data might not be meaningful to interpret and easy techniques mentioned above might be more helpful.
in addition to being a problem to be fixed, outlier can bring a unique phenomena that can lead to new theoretical insights.
Hi Marcel,
Outlier values can come from poor recording of data during data collection in the field or during data coding and input in the office among others. There are cases where it is difficult to identify outlies, and there are case where they can be easily identified. When the possible range of data is known such as in age of humans measured in years. May be an age of 25yrs was erroneously recorded as 250 yrs. A human height value of 4.2 meters most be due to error and definitely will be discarded. In this case it is statistically justifiable not to include such data points in the statistical analysis. The researcher must know what sounds reasonable and justifiable data for his analysis. Plotting a Boxplot diagram helps to identify where the outliers are coming from.
In other words, you have to know somehow the model system which helps to decide what is 'normal' versus 'not normal', ideally before the data analyses start?
Thanks!
Yes, but not necessarily before the analyses start. It could be during the analyses or preliminary analyses. For instance the boxplot which is part of the analyses can suggest to the researcher where outlies might exist in the data. This could be handled before the final analyses begin
Thanks
But then you do not know the model system very well, but start to discover it during the course of the analyses?
Dear Marcel,
You pose an interesting question. It is interesting because you assume to know that a certain observation is an outlier. How do you know that?
In my understanding, an outlier is an observation whose generating process is different from the one of the rest of the data. The first issue: How can we identify an outlier? Various techniques have been designed for this task; ask, e.g., Google for outlyer detection. The second issue is your question: How can/should this observation be used in the analysis of the data. Options are: to exclude or not to exclude; or you can use robust methods which are designed such that the influence of outliers, e.g., on parameter estimates, is limited.
However, the crucial point in this situation is that we should understand the relevance of the outlier: Is it bad data, e.g., erroneously recorded data, is it due to a local change in the data generating process, is it an indication that my understanding of the data generating process is incorrect, etc. The outlier can teach us more about the data generating process than the rest of the data. In any case, the first thing to do when an outlier is indicated: Try to explain and understand how this observation has been generated.
What is an outlier for individual A is not an outlier for individual B. This make analyses of the same data set individual-specific?
What is an outlier for individual A is not an outlier for individual B. This makes the details of the analyses of the same data set individual-specific and unreplicable across research teams?
The analysis of data is in any case a subjective task. Considering an observation as an outlyer or not, choosing a distribution for the disturbance term of a model, using a parametric or a non-parametric approach to analyse the data: The analyst should give enough metainformation (why is observation A for me an outlyer, etc.) to allow the consumer/reader of the results understanding the relevance of the analysis.
Are all these details allowed to be presented in methods sections of journals, and even when they are presented, do you think that readers are interested in exploring/replicating them?
If you give the same data set to 10 researchers and ask them to analyse that data set to test hypothesis A, do you think that the details of the 10 analyses applied will be exactly the same? And I do not even mention the fact that the 10 researchers might not have access to the same statistical tool/program (e.g. GLIM versus SAS versus R versus Statgraphics versus.......)
Interesting that is that statistical tools apparently follow fashions. More than 15 years ago it was GLIM, right? Is GLIM still used today?
https://en.wikipedia.org/wiki/GLIM_(software)
What will be the future of R in 20 years from now?
There is no standard and universal definition of "outlier" so you have to first ask what criterion is used to label a value as an outlier. Note that in the case of spatial data, i.e. where each value is associated with a location, being an outlier also relates to where it occurs (in relationship) to values at other locations.
Some statistical computations are moderately insensitive to "outliers" whereas others are very sensitive. Also note that in general the validity of the results of a statistical analysis are always dependent on whether various statistical assumptions are satisfied. For some problems it is possible to design the experiment to ensure that the assumptions are satisfied but for other data sets it is not.
In some cases the appearance of "outliers" may indicate the sample is from more than one population.
Statistical analyses are often used to make a decision or to take an action, a key question about whether to include or exclude "outliers" is whether that action would change the decision, what would the consequences be of an erroneous choice?
It is important and useful to ask about "outliers" but don't expect simple answers that are appropriate in all situations.
Context is critical to the formulation of the research problem. That context often decides the fate of outliers. If in fact even without outliers apparent in the graphical data analysis one feels the boundary data at minimum and maximum are not valid and share a different relationship one may winsorise the data or consider a tobit regression. If in fact we have paucity of data we may find outliers showing up significantly and changing the meaning of the relationship between the dv and the iv. Here one has to decide on the meaningfulness of the relationship including both direction and relationships..correlations and eventually causation. I would strongly advise use of subject matter expertise and a consideration /caveat that your answer changes in the presence of outliers or there are outliers.
Thus one is likely to find the problem of outliers to be more tractable if sufficient data points are available in which case the outliers seen best in graphical analyses including the box plots and qq plots must be removed from the analysis
Dear Marcel,
Outlier detection is not an easy subject. In regression context, robust regression methods are recommended if there are outliers in the data. There are many references in statistical and data mining journals which deal with outlier detection. I have attached a few:
Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial intelligence review, 22(2), 85-126.
Rousseeuw, Peter J., and Annick M. Leroy. Robust regression and outlier detection. Vol. 589. John Wiley & Sons, 2005.
Aggarwal, C. C., & Yu, P. S. (2001, May). Outlier detection for high dimensional data. In ACM Sigmod Record (Vol. 30, No. 2, pp. 37-46). ACM.
Zhang, K., Hutter, M., & Jin, H. (2009, April). A new local distance-based outlier detection approach for scattered real-world data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 813-822). Springer Berlin Heidelberg.
Aggarwal, C. C. (2015). Outlier analysis. In Data Mining (pp. 237-263). Springer International Publishing.
Akoglu, L., Tong, H., & Koutra, D. (2015). Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery, 29(3), 626-688.
Wang, Y., Wang, X., & Wang, X. L. (2016). A Spectral Clustering Based Outlier Detection Technique. In Machine Learning and Data Mining in Pattern Recognition (pp. 15-27). Springer International Publishing.
von Brünken, J., Houle, M. E., & Zimek, A. (2015). Intrinsic Dimensional Outlier Detection in High-Dimensional Data. Technical Report 2015-003E, NII.
Kim, Y. G., & Lee, K. M. (2015). Association-based outlier detection for mixed data. Indian Journal of Science and Technology, 8(25).
Jiang, F., & Chen, Y. M. (2015). Outlier detection based on granular computing and rough set theory. Applied Intelligence, 42(2), 303-322.
http://www.jstor.org/stable/3054624?seq=1#page_scan_tab_contents
http://arxiv.org/pdf/1404.4679
Conference Paper A Comparative Study of Outlier Detection for Large-scale Tra...
All these statistical methods assume somehow that outliers can be treated in a similar way in quite different model systems, whatever the underlying mechanisms of the outliers?
I would be worried if conclusions of statistical analyses would change when a single data point would be removed from the data set
Wikipedia has a good page on outliers:
Outliers can have many anomalous causes. A physical apparatus for taking measurements may have suffered a transient malfunction. There may have been an error in data transmission or transcription. Outliers arise due to changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. A sample may have been contaminated with elements from outside the population being examined. Alternatively, an outlier could be the result of a flaw in the assumed theory, calling for further investigation by the researcher. Additionally, the pathological appearance of outliers of a certain form appears in a variety of datasets, indicating that the causative mechanism for the data might differ at the extreme end (King effect).
Retention
Even when a normal distribution model is appropriate to the data being analyzed, outliers are expected for large sample sizes and should not automatically be discarded if that is the case. The application should use a classification algorithm that is robust to outliers to model data with naturally occurring outlier points.
Exclusion
Deletion of outlier data is a controversial practice frowned upon by many scientists and science instructors; while mathematical criteria provide an objective and quantitative method for data rejection, they do not make the practice more scientifically or methodologically sound, especially in small sets or where a normal distribution cannot be assumed. Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known. An outlier resulting from an instrument reading error may be excluded but it is desirable that the reading is at least verified.
https://en.m.wikipedia.org/wiki/Outlier
There is another recent question (a more harmonious discussion and without DOWNVOTEs!)
https://www.researchgate.net/post/when_an_outlier_must_be_relevant_to_an_investigation_and_when_should_be_discarded
https://www.researchgate.net/post/when_an_outlier_must_be_relevant_to_an_investigation_and_when_should_be_discarded
As per my view researcher must try to search the reasons why it is outlying instead of excluding it. Thanks
A good article:
Wiggins, B. C. (2000). Detecting and Dealing with Outliers in Univariate and Multivariate Contexts.
http://eric.ed.gov/?id=ED448189
I am currently doing a project using MANOVA, and in my data set I have an outlier that skews the data by a rather large amount. However, it doesn't really affect the outcome, my test is still not significant. If I exclude the value, p is much closer to .05, but still non significant. Any ideas?
To expand a bit on some of the answers you got, if there values that are outliers among the dependent variable, but valid, removing them and applying standard methods to the remaining results in using the wrong standard error. There is a vast literature on how to do this in a technically sound manner and it is easily demonstrated that this can make a substantial difference in the conclusions reached. For details see
Wilcox, R. R. (2017). Introduction to Robust Estimation and
Hypothesis Testing 4th Edition. San Diego, CA: Academic Press
Congratulations to Rand on the appearance of the 4th edition of this successful book. Otherwise, I wonder whether something new can be expected after 1,5 years of discussion of the outlier issue. Does somebody know any mechanism to finish the "question", ideally with a summary of the most relevant answers?
What about unequal sample sizes? I am trying to run a MANOVA with two levels on the IV, I have n=14 on one,and n=36 on the other...
Take all at n=14 or include only those with n=36 and see if the results change. Here again, present the two analyses
To correct for multiple tests on the same data set? How often do people analyse their own data sets in private (e.g. trying out several analyses to see what happens) before exposing often only one analysis to the public? How to handle situations where individual A analyses data set A in year 1 and individual B analyses data set A in year 2, perhaps also using different methods of analysis in different years, as is often the case in long-term studies focusing on a single study population?
I don't know how to handle this at the between publication level. For instance, analyses of data set X from long-term study X has been presented in 10 publications, by adding one more of study year or by adding one more factor. In addition, the data from factor X have been used 10 times, whereas the data from factor Y only once..... You also 'peek' at the data just by reading the former publications dealing with the same data set...
If you sample data, calculate a pvalue, sample data again, then calculate a pvalue again, you need to correct.
This occurs across publications dealing with the same long-term data set, right? You will not correct your p value in publication 3 based on the p value exposed in a former publication 2, do you?
Example
In a long-term data set, you 'sample' every year in the same population/study plot.
In publication 1, you analyse the sample from years 1995-2000 and calculate a p value. In publication 10, you analyse the sample from years 2014-2015 and calculate a p value. Will you correct the p value in publication 10 because there is the published p value from publication 1?
If it is all about 'philosophy', statistical specialists that do not know the model system can get it all wrong? Some start to use one-tailed tests simply based on the results and hypotheses of former publications?
Indeed, NEVER. try our new paper
Article TO DETERMINE SKEWNESS, MEAN AND DEVIATION WITH A NEW APPROAC...
It is justifiable to exclude 'outlier' data points from statistical analysis for significance level of 0.005 or less according R.A. Johnson and D.W. Wichern (2007) Applied Multivariate Statistical Analysis. However, choosing a value of significance level for outlier detection is one of the problems. The second problem is that well-known statistical methods are used to detect outliers in a data set under the assumption that the data is generated by the Gaussian distribution. But this assumption is valid only in particular cases. The second problem can be solved by applying the normalizing transformations. For example, follow the links below.
Conference Paper Statistical anomaly detection techniques based on normalizin...
Conference Paper Detecting bivariate outliers on the basis of normalizing tra...
Conference Paper Multivariate Outlier Detection Technique Based on Normalizin...
What to do in the framework of studies that wish to promote exact replication (see Kelly 2006)?
Exact replication implies the use of the same study system, the same model species, the same methods in different studies. What to do if one study presents data distribution A (with or without an outlier) and the other study aimed to be an exact replicate presents data distribution B (with or without an outlier)? Should statistical methods also be replicated in exactly the same way in different studies aimed to promote exact replication, whatever the data distributions of the different samples?
PS: 'Exact replication' becomes impossible given the fast evolution of techniques and methods. Given the fast technical advancements during the last 20 years (e.g. to identify outliers), will the methods and analyses conducted today be outdated tomorrow, and if so, why doing the analyses today if we know they will be outdated tomorrow?
The original question really pertains to a very specific data set which none of the responders have access to. The various responses are conditioned on specific but different data set(s) that the responder has encountered or abstract circumstances that are not applicable to most data sets. Thus none of the responses, including my own, really provide any reliable guide to the query. It is like saying we have a variety of statistical tools at hand, hunt around among them maybe one of them would be useful but there is no way to tell (perhaps it should be one we don't know about).
re-read all thread today... why?
In my data, one outlier was found, and I was recommended to exclude it.
I recalculated statistics - nope, conclusions are the same, differences near p=0.05 and p
On a different note, in qualitative studies, outliers are very important for us. They provide new insights and therefore we do not exclude them.
If the results of your statistical analyses will depend on the presence/absence of some unusual data points, how reliable/robust are the interpretations of these results? Does this imply that your data set is not big enough?
PS: People spending a lot of time to track one 'error' in their data set, but ignore at the same time the errors that have been commited during data sampling and data interpretation
Accepting that your statistical results will depend on the statistical tool used (e.g. SAS vs R), e.g. to discover subtle effects, what will be the probability to find the same results in a new study?
“Data analyzers inspecting tables or figures might decide to exclude from statistical analyses unusual data points sometimes called 'outlier' data points.”
Any scientific data must represent phenomena real. What is a data analyzer that inspects tables and figures? Should not such expert rather investigate the phenomena under consideration?
Linas
You say you found one "outlier", what criteria did you use to determine it was an outlier?
One that comes to mind (but which would almost never be usable) is that you have non-statistical evidence that the value is incorrect, e.g. a typo, a mis-reading of an instrument, an impossible value (violating a physical condition).
Certainly for a given data set, an investigator may know enough about possible outcomes to recognize a data value as being suspicious but that is not the same as an outlier.
statistical criteria, value out of range.
If there are non-statistical evidences, I'd call it "mistake", not an outlier, and definitely such values should be excluded
Why not accpeting that a statistical outlier might be caused by an identified unusual natural event, e.g. an unusual weather event, an unusual mutation event, an unusual interspecific interaction, etc... ?
It has stable isotope value. Of course, deviations are understandable and acceptable
You should familiarize with the statistical models that you will apply to your data. Sometimes outliers can cause the statistical assumptions of a procedure to be violated. The most obvious is violations of normality (i.e., having a non-normal distribution); it will skew your data, can cause residuals (from regression) to be non-normally distributed; and of course can cause a mean to be misleading. But there are statistics for describing non-normality too: medians, tests of normality, tests of skewness, to name a few. It is often best to do your statistics both with and without the outlier(s) and see if the results are appreciably different. Sometimes you can do a procedure anyway and simply note the presence of outliers and how they could affect the statistical validity or interpretation of the results. Most readers of research study results will want some kind of familiar statistical analysis to be presented, even if the data violate the required assumptions to some degree. One way to constructively use outliers is to incorporate them into the analysis of a trend. Outliers may ruin linear regression results but might be perfect for an exponential curve or a test of trend.
Let's discuss when to decide to exclude outliers.
Example:
Outliers decrease the probability to find statistically significant results. When you analyse your large data set that includes outliers and that provides statistically highly significant results will you decide to exclude the outliers versus when you analyse your small data set that includes outliers and that provides statistically non-significant results will you decide not to exclude outliers?
INTERLUDIUM
Daniel Meurois et Anne Givaudan. Récits d'un voyageur de l'astral. J'ai Lu, Aventure secrète
Page 222: La science est ... neutre…. Seul ce n'est pas celui qui la manie.
La science est neutre. Seuls ceux qui le pratiquent ne sont pas
If it is not a human or machine error, and by excluding it from the data set the solution does not deviate drastically, I would keep it, as there is no perfect criterion to define an Outlier. One must weigh the Risk of losing information (by excluding) against Deviation from estimated solution.
@ Marcel, good point!!! I definitely keep outliers, when possible.
There are two reasons for outlier measurement. The first is an erroneous measurement and the second is a low probability correct measurement. The decision is statistical. First, I would calculate the probability of the "strange outcome" according to the distribution of the other points. For example, if the distribution is a Gaussian and the probability of the outlier is 10-80 the point should be considered an error. However, if its probability is 10-3 and I measured 10 points it should not be deleted. In other distributions like the long-tail distribution, there is a high probability to outliers. Therefore I would not delete i.e. the bank balance of Bill Gates as an error.
The simple answer is that you should not exclude outliers unless you determine a specific reason that they are not valid. Any other action would just introduce bias into the process.