Which is the best method for removing outliers in a data set?

30 November 2015 43 3K Report

In statistically analyzing a data set, suppose we have to found some of the outliers, if necessary to remove them which method is appropriate?

Witold Orlik

Hello there,

Please see PDF attached

Regards

Jochen Wilhelm

Outliers should be rare. If they are not rare, the method (and hence the entire data set) is bad and/or not trustworthy.

If outliers are rare, they have no statistical impact. In small samples they will be extremely rare (what is not a statistical problem, although they may have a considerable impact these particular case where they in fact occur), in large samples they won't have any considerable levarage or impact - so why care?

In small samples there is another "problem": values may be outlying - but not because these outlying values are "wrong" but rather because the rest of the values cumps together more tightly as they should. So the "outlier" is actually the only datum "putting things right". Removing it would unneccesarily introduce bias, what is a rather bad thing.

Clearly, outliers with considerable leavarage can indicate a problem with the measurement or the data recording, communication or whatever. In *such* cases it is absolutely recommended to remove these values. But the judgement about this is based on reasons external to the data. Either the values are known to be "impossible" (e.g. recordings of a persons weight of 87653 kg [the mistake could be that the weight was wrongly given in grams] or the duration of a hostpital stay of more than 153 years and such). Other outlying but not-impossible values might be caused by special circumstances, like a disease study subject, the change of the operator (because the original operator was sick that day the suspect measurement was recorded), a power failure, something like this. But these resons have to be identified to know if the removal of the outlying value would improve the results or possible introduce bias. It may not alway be so easy to find the reasons, especially when looking at multivariate outliers.

The example given in Fig.1 of the paper linked by Witold is a good example for all this, although this is unfortunately not well adressed there. The example shows about 30% "outliers". Given you had no idea what has caused this strange pattern, it would be risky to decide to irgnore these 30% or the other 60%. If I you had the information that more than 100 million phone calls per year in Belgium is factually out of range, then you do have an external reason to decide for the remaining 60%.

If there was only one year with a mistaken recording the impact on the regression line would have been really small (negligible, as I would say).

Md. Asif Ahsan

A very simple way to remove outliers is first to identify the outlying observation and replace them with the median value.

Kangho Lee

You had better used exell program and you can find outliers.

Best regards.

Manoj Kuppusamy

Please understand my question. What if I want to ask, what is the procedure (techniques) for removing outliers in a data set.

Priit Võhandu

I assume that you have the method how to detect outliers. For removal a simple and good approach is to delete them. If you try to replace them with mean, median or something else you can easily create a bias. And if you are into the KDD it is not much use to discover what bias did you create yourself.

If the data set after deleting outliers still forms a representative sample you can carry out your planned analysis.

Jochen Wilhelm

"Please understand my question. What if I want to ask, what is the procedure (techniques) for removing outliers in a data set."

Well, possibly by removing them, by not using the in your analysis?

Do you want technical examples?

In Excel, select the cell contaning the "outlier". Press the delete-button on the keyboard.

In R, given the data.frame containing the data is named "df" and row i contains the "outlier", you get the data.frame witht this line removed by df[-i,]. If you identified the "outilers" by a comparison that gives you a logical vector b, the removal of the lines fulfilling your comparison criteria is achieved with df[!b,].

Other software may have some menu item to mark values or lines as not to be used for subsequent analyses. Reding the manuals will help.

Carlos Jimenez-Gallardo

is a question, that no easy response, first, analyzing the nature of outliers

really is outlier, is fail human or measurement, or is real data, as response to a experiment

Manoj Kuppusamy

Thank you all for useful information's.

Demetris Christopoulos

Hm, it depends on the main method of Analysis that you will use in your work.

Sometimes outliers are the carriers of valuable information (example: Archetypal Analysis).

Anyway, if you are using R, look here for example:

http://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a-dataset

Kangho Lee

You have to use plot beween X value and Y value in excell program cause by checking variation. and you can delete them(outliers) far from central value. it can be showed row and colume. you can find it.

Best regards.

David Boho

Outlier detection is highly correlated with your analysis you want to do afterwards. For example in variance based algorithm like PCA, a small amount of outliers wont have a huge impact. In distance based models or classifiers outliers will have an impact on the robustness of the algorithm. So removing outliers can be important. In other algorithms like Archetypal Analysis (aka. Principal Convex Hull) outliers will have a huge impact.

So if you are sure the outliers don't add any valuable information to the dataset you can removed them base on the optimisation criteria of your analysis. If you decide to use a distance based analysis like the clustering algorithms k-means or k-medoids you can use the Mahalanobis distance to detect outliers (see ‘mvoutlier’ package in R)[1].

If the analysis have another optimization criteria than you can use M-estimators to detect outliers within the objective function [3]. This is done for e.g. archetypal analysis [2].

regards,

David

Book Outlier detection

Article Weighted and robust archetypal analysis

Article Optimization Transfer Using Surrogate Objective Functions

Mehmet Guven Gunver

the best method is to keep your outliers. try our new paper

Article TO DETERMINE SKEWNESS, MEAN AND DEVIATION WITH A NEW APPROAC...

James R Knaub

Manoj -

An outlier is supposed to be a datum which is included by mistake; a point outside of the population of interest, or perhaps with so much measurement error it cannot reliably be used. You should not remove a datum lightly, but there are times where including an outlier would invalidate results. ("Data editing" can be an area of importance.)

An outlier may or may not look like your "legitimate" data points. If one also happens to fall in an extreme tail of your legitimate population, it may cause a lot of trouble. On the other hand, the tails of that distribution contain data you do not want to throw out if you collect them.

So, you find data in the tails of your sample, and keep them if you have no good reason not to keep them. Those points may need special attention, to be fairly certain that they are supportable by subject matter theory, and data collection methodology review. If you do remove a datum or data, your report should explain why, and perhaps show the overall results in an appendix, which would have been found had you kept those 'potential outliers.'

.....

You might pick a reasonable confidence limit, and examine the data which fall outside of it to see if there are any legitimate problems. But don't just throw out "inconvenient" data. (I assume you are talking about continuous data. Similar ideas may apply elsewhere.)

.....

Cheers - Jim

PS - For example: In official statistics, one may find data that were reported in one set of units at one point, say thousands of gallons of oil, and then the wrong units in a few cases at another time, say barrels of oil, and that shows up in a graph. Data editing is important for official statistics, comparing a population's previous period data to current values, for example, but if you overedit, removing or imputing for suspicious data that were actually correct, then you will likely indicate less change in a market than there actually was. You should assume data are correct unless you have good reason to think otherwise. Thus, as one expert in data quality I remember once saying, there is no substitute for good data collection in the first place.

Farshad Somayehee

The 3 sigma test is the best way I experience in operational tasks.

Guru Bhandari

this link can assist you, https://www.mathworks.com/matlabcentral/cody/problems/42485-eliminate-outliers-using-interquartile-range

Prakash Kumar

Three step to remove outlier from your data set.

Step1. First order the data and find median, Q1 (quartile 1), and Q3 (quartile 3) of the data set

Step 2. Find Cutoff value C=1.5*IQR(Inter Quartile Range=Q3 − Q1)

Step 3. If Observation(Q3+C)=Outlier

Snehanshu Saha

Interesting discussion. Outliers may not be removed always. Be careful. Sometimes, they warrant modeling.

Carlos Jimenez-Gallardo

I would like to answer Manoj's question, the best way to remove them is to erase them, but it does not always respond to what is needed.

Humbly, the question should be what to do with the atypical values ?.

but before doing something with them.

1. study its nature, 99% of the time it is human error.

2. Analyze the possibility of incorporating new independent variables, usually this explains cases of outliers. They usually decrease.

If you analyzed the above and there are still outliers, then keep in mind the following.

- if n is small, the impact of the outliers is significant in statistical terms, usually, therefore remuestree

- if n is large, then the impact will depend on the number of outliers. for one or two, eliminate them will not impact, if there are many, analyze point 2.

- All cases are different, analyze without considering similar cases.

- to detect, depends on the analysis you want to carry out. if it is univariate, Chebyshov Theorem is one option, the other is based on interquartile distance. if it is a bivariate analysis it must be analyzed by scatterplot

- Consider analyzing the atypical value according to the distribution of the data. (unbiased or skewed)

Mahtab Faraji

In fact, samples that are far from the median of the whole data are considered as unwanted samples or outliers. This is a common equation for removing outlier points :

X-median(X)> constant *STD

Where, X is the position of the pixel, and "median" is the calculated median of all data , "STD" is the standard deviation value and constant is a value between 0-1.

Allen Mhagama

Outliers are the most interesting component of your data, you don't need to erase but incorporate in your model

Mohammad Saleh Ali-Taleshi

Can be the Grubbs test for outliers in xlstat as a good method for investigating the outliers in data (like air pollutant data)?

Khairun N. Kamarudin

mean +/- 3SD (normal distribution)

median +/- 2.5MAD (non-normally distribution)

Priit Võhandu

The answers given here to the original question fall into two categories: how to detect (usually by interquartile range or standard deviation) and what to do when the outliers are found using the answerer`s personal experience. At this point I would like to repeat the link to the article recommended by the user whose profile is deleted (it's a pity)

Article Best-Practice Recommendations for Defining, Identifying, and...

It includes all the answers given here so far and much more in the nice categorized way.

Samantha Ehli

Prakash Kumar thanks for your excellent recommendation. Do you have any literature for this method by any chance I could cite?

Carlos Jimenez-Gallardo

Hi Samantha

literature, you can find as a reference, Tukey 1977, who talks about the analysis of dispersion and asymmetry, as well as the conceptualization of Tukey hinges. Personally, I have it as a reference when I study (the degree of statistician) at the University in the 80s. the other is that it is not a method to eliminate outliers, but to detect and analyze them

Martin Brestovansky

https://www.youtube.com/watch?v=DDpym2j_ILY

Ashutosh Karna

This is highly subjective. Adding another facet to same question, is the dimensionality of the underlying data. While most of the methods listed here, are valid for univariate outliers. The problem becomes more complex for multivariate ones. A nice start would be to check Mahalanobis distance for each row of your data and then find the extremes (compare with chi square test statistic). It gives much better result especially, if the data is approximately multivariate normal.

Ali S Mansour

Use L1-Norm Method

Carlos Jimenez-Gallardo

I do not know, if I misunderstood the question, Outliers are not eliminated in the first instance, since depending on its context it can give a lot of information (Frechet's distribution for example).

They are identified and analyzed, then decides what to do with the outliers.

Jouini Abdelhafid

Since the Hampel test is resistant, which means that it is not sensitive to outliers, and it has no restrictions as to the abundance of the data set, and it also is not necessary to use statistical tables.

In general, a Hampel method that includes improving outliers could be a better strategy.

Ali S Mansour

The use of Least Absolute Deviations or L1-Norm Method for fitting data with possible outliers is much more effective in dealing with data outliers than those methods based on the Least Squares Method. Particularly, when the data follows heavy tails distribution.

Omar Nait Mensour

trimming and Logarithm method

Prakash Kumar

filter with box plot and trimming out fron “maximum”: Q3 + 1.5*IQR

“minimum”: Q1 -1.5*IQR where interquartile range (IQR): 25th to the 75th percentile.

Cheng Li

The classical approach to screen outliers is to use the standard deviation SD: For normally distributed data, all values should fall into the range of mean +/- 2SD. Observations that are outside 2SD may be considered outliers, and some may even use 3SD to rule out outliers.
However, I have been reading about some articles that critique the use of the SD method because mean and SD are greatly influenced by the outlier and thus are unreliable.
Some articles suggest the use of MAD （median absolute deviation）, which is similar to SD, instead of mean and SD, it uses median and MADe. So it assumes all values should fall into the range of median +/- 2MADe. MADe=1.483*MAD, MAD=median (|xi – median(x)| i=1,2,…,n) . If the values are greater than 2MADe, it is considered as outliers.
Here is one reference, A note on detecting statistical outliers in psychophysical data by Pete R Jones. You should be able to find more.

Ali S Mansour

It is better to use Least Absolute Deviation Method to fit the data or Robust Statisrical Method as Least Squares Method is often affected by outliers

Koen Van de Moortel

It's dangerous, because what you percieve as "outlier" might be a good data point. I would study the graph or the residuals after the fitting.

You can try with this easy software:

www.lerenisplezant.be/fitting.htm

Chuck A Arize

When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences.

Prakash Kumar

# These are python code to detect outlier in our datasets

# First find the mean and standard deviation in our data converted a list of observations in variable data1

data1_mean, data1_std = mean(data1), std(data1)

# identify outliers

cut_off = data1_std * 3

lower, upper = data1_mean - cut_off, data1_mean + cut_off

# identify outliersb by flowing code

outliers = [x for x in data if x < lower or x > upper]

# remove outliers

outliers_removed_data1 = [x for x in data if x > lower and x < upper]

Jorge Mauricio González

For univariate data: examines the distribution of standardized observations, for small data sets (n < 80) the guidelines indicate that those cases with observation values with values > 2.5 are outlier, for higher data sets, a standardized value between 3 and 4.

For bivariate data: Examine through a scatter plot, you must overlay a specified confidence interval (varying between 50 and 90 percent of the distribution) on the plot, for a bivariate normal distribution. this provides a graphical display of confidence limits and you can identify outliers.

For matrix (multivariate): the Mahalanobis measure D2 is a measure of the distance of each observation in a multidimensional space with respect to the mean center of the observations, because it provides a common measure of multidimensional centrality, it also has statistical properties that have in count tests of significance. Given the nature of the statistical tests, a conservative level (0.001) is suggested as the threshold value for the designation of an outlier.

You can use Excel, R , SAS or Python, but these are the basics to understand.

Best wishes

Hair J, Anderson R, Tatham R, Black W (Eds). Multivariate analysis. 5 Edition. Person, Prentice Hall.

Koen Van de Moortel

Jorge Mauricio González Who says so??? Outliers are the most interesting points to test your models! Removing them because you think they are "too far from the model" is just plain dangerous. Outliers should only be removed when they are really erronous measurements or typing mistakes.

Jorge Mauricio González

Koen Van de Moortel I totally agree, in practice it is the researcher and his knowledge of the study population, which will determine if the data should be removed.

Greetings

William C Hutton

Just use Bayesian Methods. Competent choice of objective, prior-probability, density functions essentially makes the term outliers obsolete.

Badges
Science topic

More Manoj Kuppusamy's questions See All

How to do pca analysis of c-alpha atom of the protein?

i m interested in pca analysis of c-alpha atoms in gromacs for that i used the following gmx_mpi covar -s mdca.tpr -f mdca.xtc -o eigenvalca.xvg -v eigenvecca.trr -av average.pdb -n index.ndx but...

30 July 2024 1,607 1 View

How to do calculation of Louvain methods calculation for excel sheet and and is there any software available?

The Louvain method – named after the University of Louvain where Blondel et al. developed the algorithm – finds communities by optimizing modularity locally for every node’s neighborhood, then...

23 July 2024 6,659 0 View

Getting this error : gmx_mpi': corrupted size vs. prev_size: 0x0000000001d1b320 ***?

i am using the trjconv command in gromacs gmx_mpi -debug trjconv -s md.gro -f md.xtc -o utr.xtc -fit rot+trans and after few steps i got the error gmx_mpi': corrupted size vs. prev_size:...

17 July 2024 6,216 1 View

What is information diffusion in the social network?How a message got viral in social network?

Please give answer. Also explain mathematical equations behind this.

22 June 2024 6,869 2 View

Is there any drugs or natural ingrdients with similar mechanism like that of low dose Amisulpride?

Is there any approved drugs or natural ingredients with similar mechanism like that of low dose Amisulpride. Is there any specific dose or borderline dose reported at which amisulpride is both...

02 June 2024 5,586 0 View

Is there any way to separate sodium sulfate with water without adding heat to the solution?

20 May 2024 3,239 1 View

What is the physical significance of critical exponents( Beta, Gamma, and Dealta) at the Curie transition region?

06 May 2024 5,236 0 View

How to get dataset in social network analysis for decision making?

Decision is an important concept in social network. If any one have website link , please give link.

03 May 2024 4,367 1 View

WARNING: FENE bond too long: 1 7 8 32968539.141547877 (src/MOLECULE/bond_fene.cpp:83) ERROR on proc 0: Bad FENE bond (src/MOLECULE/bond_fene.cpp:85)?

# Polymer chain simulation with FENE and LJ potential in NVT ensemble # Initialization units lj # Use LJ units atom_style bond # Atoms and bonds boundary p p...

16 April 2024 5,828 0 View

Bond broken in fene potential?

i simulating a single chain polymer with 10 particle with fene and lj but if i increase the no of particle my bonds are being broken . im using parameters as below 20 50000.0 0.1 1000.0 1.0...

15 April 2024 2,748 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View