Outliers should be rare. If they are not rare, the method (and hence the entire data set) is bad and/or not trustworthy.
If outliers are rare, they have no statistical impact. In small samples they will be extremely rare (what is not a statistical problem, although they may have a considerable impact these particular case where they in fact occur), in large samples they won't have any considerable levarage or impact - so why care?
In small samples there is another "problem": values may be outlying - but not because these outlying values are "wrong" but rather because the rest of the values cumps together more tightly as they should. So the "outlier" is actually the only datum "putting things right". Removing it would unneccesarily introduce bias, what is a rather bad thing.
Clearly, outliers with considerable leavarage can indicate a problem with the measurement or the data recording, communication or whatever. In *such* cases it is absolutely recommended to remove these values. But the judgement about this is based on reasons external to the data. Either the values are known to be "impossible" (e.g. recordings of a persons weight of 87653 kg [the mistake could be that the weight was wrongly given in grams] or the duration of a hostpital stay of more than 153 years and such). Other outlying but not-impossible values might be caused by special circumstances, like a disease study subject, the change of the operator (because the original operator was sick that day the suspect measurement was recorded), a power failure, something like this. But these resons have to be identified to know if the removal of the outlying value would improve the results or possible introduce bias. It may not alway be so easy to find the reasons, especially when looking at multivariate outliers.
The example given in Fig.1 of the paper linked by Witold is a good example for all this, although this is unfortunately not well adressed there. The example shows about 30% "outliers". Given you had no idea what has caused this strange pattern, it would be risky to decide to irgnore these 30% or the other 60%. If I you had the information that more than 100 million phone calls per year in Belgium is factually out of range, then you do have an external reason to decide for the remaining 60%.
If there was only one year with a mistaken recording the impact on the regression line would have been really small (negligible, as I would say).
I assume that you have the method how to detect outliers. For removal a simple and good approach is to delete them. If you try to replace them with mean, median or something else you can easily create a bias. And if you are into the KDD it is not much use to discover what bias did you create yourself.
If the data set after deleting outliers still forms a representative sample you can carry out your planned analysis.
"Please understand my question. What if I want to ask, what is the procedure (techniques) for removing outliers in a data set."
Well, possibly by removing them, by not using the in your analysis?
Do you want technical examples?
In Excel, select the cell contaning the "outlier". Press the delete-button on the keyboard.
In R, given the data.frame containing the data is named "df" and row i contains the "outlier", you get the data.frame witht this line removed by df[-i,]. If you identified the "outilers" by a comparison that gives you a logical vector b, the removal of the lines fulfilling your comparison criteria is achieved with df[!b,].
Other software may have some menu item to mark values or lines as not to be used for subsequent analyses. Reding the manuals will help.
You have to use plot beween X value and Y value in excell program cause by checking variation. and you can delete them(outliers) far from central value. it can be showed row and colume. you can find it.
Outlier detection is highly correlated with your analysis you want to do afterwards. For example in variance based algorithm like PCA, a small amount of outliers wont have a huge impact. In distance based models or classifiers outliers will have an impact on the robustness of the algorithm. So removing outliers can be important. In other algorithms like Archetypal Analysis (aka. Principal Convex Hull) outliers will have a huge impact.
So if you are sure the outliers don't add any valuable information to the dataset you can removed them base on the optimisation criteria of your analysis. If you decide to use a distance based analysis like the clustering algorithms k-means or k-medoids you can use the Mahalanobis distance to detect outliers (see ‘mvoutlier’ package in R)[1].
If the analysis have another optimization criteria than you can use M-estimators to detect outliers within the objective function [3]. This is done for e.g. archetypal analysis [2].
regards,
David
Book Outlier detection
Article Weighted and robust archetypal analysis
Article Optimization Transfer Using Surrogate Objective Functions
An outlier is supposed to be a datum which is included by mistake; a point outside of the population of interest, or perhaps with so much measurement error it cannot reliably be used. You should not remove a datum lightly, but there are times where including an outlier would invalidate results. ("Data editing" can be an area of importance.)
An outlier may or may not look like your "legitimate" data points. If one also happens to fall in an extreme tail of your legitimate population, it may cause a lot of trouble. On the other hand, the tails of that distribution contain data you do not want to throw out if you collect them.
So, you find data in the tails of your sample, and keep them if you have no good reason not to keep them. Those points may need special attention, to be fairly certain that they are supportable by subject matter theory, and data collection methodology review. If you do remove a datum or data, your report should explain why, and perhaps show the overall results in an appendix, which would have been found had you kept those 'potential outliers.'
.....
You might pick a reasonable confidence limit, and examine the data which fall outside of it to see if there are any legitimate problems. But don't just throw out "inconvenient" data. (I assume you are talking about continuous data. Similar ideas may apply elsewhere.)
.....
Cheers - Jim
PS - For example: In official statistics, one may find data that were reported in one set of units at one point, say thousands of gallons of oil, and then the wrong units in a few cases at another time, say barrels of oil, and that shows up in a graph. Data editing is important for official statistics, comparing a population's previous period data to current values, for example, but if you overedit, removing or imputing for suspicious data that were actually correct, then you will likely indicate less change in a market than there actually was. You should assume data are correct unless you have good reason to think otherwise. Thus, as one expert in data quality I remember once saying, there is no substitute for good data collection in the first place.
I would like to answer Manoj's question, the best way to remove them is to erase them, but it does not always respond to what is needed.
Humbly, the question should be what to do with the atypical values ?.
but before doing something with them.
1. study its nature, 99% of the time it is human error.
2. Analyze the possibility of incorporating new independent variables, usually this explains cases of outliers. They usually decrease.
If you analyzed the above and there are still outliers, then keep in mind the following.
- if n is small, the impact of the outliers is significant in statistical terms, usually, therefore remuestree
- if n is large, then the impact will depend on the number of outliers. for one or two, eliminate them will not impact, if there are many, analyze point 2.
- All cases are different, analyze without considering similar cases.
- to detect, depends on the analysis you want to carry out. if it is univariate, Chebyshov Theorem is one option, the other is based on interquartile distance. if it is a bivariate analysis it must be analyzed by scatterplot
- Consider analyzing the atypical value according to the distribution of the data. (unbiased or skewed)
In fact, samples that are far from the median of the whole data are considered as unwanted samples or outliers. This is a common equation for removing outlier points :
X-median(X)> constant *STD
Where, X is the position of the pixel, and "median" is the calculated median of all data , "STD" is the standard deviation value and constant is a value between 0-1.
The answers given here to the original question fall into two categories: how to detect (usually by interquartile range or standard deviation) and what to do when the outliers are found using the answerer`s personal experience. At this point I would like to repeat the link to the article recommended by the user whose profile is deleted (it's a pity)
Article Best-Practice Recommendations for Defining, Identifying, and...
It includes all the answers given here so far and much more in the nice categorized way.
literature, you can find as a reference, Tukey 1977, who talks about the analysis of dispersion and asymmetry, as well as the conceptualization of Tukey hinges. Personally, I have it as a reference when I study (the degree of statistician) at the University in the 80s. the other is that it is not a method to eliminate outliers, but to detect and analyze them
This is highly subjective. Adding another facet to same question, is the dimensionality of the underlying data. While most of the methods listed here, are valid for univariate outliers. The problem becomes more complex for multivariate ones. A nice start would be to check Mahalanobis distance for each row of your data and then find the extremes (compare with chi square test statistic). It gives much better result especially, if the data is approximately multivariate normal.
I do not know, if I misunderstood the question, Outliers are not eliminated in the first instance, since depending on its context it can give a lot of information (Frechet's distribution for example).
They are identified and analyzed, then decides what to do with the outliers.
Since the Hampel test is resistant, which means that it is not sensitive to outliers, and it has no restrictions as to the abundance of the data set, and it also is not necessary to use statistical tables.
In general, a Hampel method that includes improving outliers could be a better strategy.
The use of Least Absolute Deviations or L1-Norm Method for fitting data with possible outliers is much more effective in dealing with data outliers than those methods based on the Least Squares Method. Particularly, when the data follows heavy tails distribution.
The classical approach to screen outliers is to use the standard deviation SD: For normally distributed data, all values should fall into the range of mean +/- 2SD. Observations that are outside 2SD may be considered outliers, and some may even use 3SD to rule out outliers.
However, I have been reading about some articles that critique the use of the SD method because mean and SD are greatly influenced by the outlier and thus are unreliable.
Some articles suggest the use of MAD (median absolute deviation), which is similar to SD, instead of mean and SD, it uses median and MADe. So it assumes all values should fall into the range of median +/- 2MADe. MADe=1.483*MAD, MAD=median (|xi – median(x)| i=1,2,…,n) . If the values are greater than 2MADe, it is considered as outliers.
Here is one reference, A note on detecting statistical outliers in psychophysical data by Pete R Jones. You should be able to find more.
When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences.
For univariate data: examines the distribution of standardized observations, for small data sets (n < 80) the guidelines indicate that those cases with observation values with values > 2.5 are outlier, for higher data sets, a standardized value between 3 and 4.
For bivariate data: Examine through a scatter plot, you must overlay a specified confidence interval (varying between 50 and 90 percent of the distribution) on the plot, for a bivariate normal distribution. this provides a graphical display of confidence limits and you can identify outliers.
For matrix (multivariate): the Mahalanobis measure D2 is a measure of the distance of each observation in a multidimensional space with respect to the mean center of the observations, because it provides a common measure of multidimensional centrality, it also has statistical properties that have in count tests of significance. Given the nature of the statistical tests, a conservative level (0.001) is suggested as the threshold value for the designation of an outlier.
You can use Excel, R , SAS or Python, but these are the basics to understand.
Best wishes
Hair J, Anderson R, Tatham R, Black W (Eds). Multivariate analysis. 5 Edition. Person, Prentice Hall.
Jorge Mauricio González Who says so??? Outliers are the most interesting points to test your models! Removing them because you think they are "too far from the model" is just plain dangerous. Outliers should only be removed when they are really erronous measurements or typing mistakes.
Koen Van de Moortel I totally agree, in practice it is the researcher and his knowledge of the study population, which will determine if the data should be removed.