I am looking for tested methods to exclude outliers statistically, if someone could help with suggestions of methods. Links to instructions / clarifications will be appreciated
Certainly, there are several statistically tested methods for identifying and excluding outliers from a dataset. Here are some common approaches along with links to resources for further instructions and clarifications:
Z-Score or Standard Deviation Method: Calculate the z-score for each data point, which represents how far each point is from the mean in terms of standard deviations. Data points with z-scores beyond a certain threshold (e.g., 2 or 3 standard deviations) are considered outliers and can be excluded. Resources: Z-Score and Outliers
Modified Z-Score: This is a variation of the standard z-score method that uses the median and median absolute deviation (MAD) for calculations instead of the mean and standard deviation. It's less affected by extreme values and is suitable for non-normally distributed data. Resources: Modified Z-Score
Interquartile Range (IQR) Method: Calculate the IQR (the range between the first quartile and the third quartile) and identify data points outside a certain multiple of the IQR as outliers.Resources:Interquartile Range (IQR) Method
Tukey's Fences: Similar to the IQR method, Tukey's fences involve setting thresholds based on the IQR. Points beyond these thresholds are flagged as potential outliers. Resources: Tukey's Fences
Grubbs' Test: A formal statistical test that determines if a single data point is an outlier. It calculates a test statistic and compares it to critical values from the t-distribution. Resources: Grubbs' Test
Dixon's Test: A method for identifying a single outlier in a dataset. It uses ratios of differences between values to detect an outlier. Resources: Dixon's Test
Mahalanobis Distance: A multivariate method that calculates the distance between each data point and the mean of the dataset. Data points with high Mahalanobis distances are considered outliers.Resources:Mahalanobis Distance
Remember that the choice of method depends on the nature of your data, your specific goals, and any assumptions you're making about your data distribution. It's also important to consider the potential impact of excluding outliers on your analysis and results. Always document and justify your outlier removal process.
Whether a measurement is an "outlier" depends totally on the model you use. A point may be far a way from the "best fitting" line through your data, but it may be close to an exponential function through your data.
The only reason to remove outliers I know, is if they are really wrong mesurements. A visual inspection of a scatterplot may suggest which points could be bad, although those can also be in the middle of your data cloud...
Statistical methods can help you to identify points that are "far off from the rest". If these values are should or should not be excluded is not a statistical question but a subject-matter question. More often than not in research such "outliers" are desperately trying to tell you that your assumptions are bad. Excluding such values just make your wrong assumptions look better but your conclusions more wrong.
Generally, outliers should be rare. If they are, then they are usually not a problem in your analysis. If they are not rare, your whole experiment/assay is in doubt.
There is no statistical method that excludes outliers. It is a matter of Decision based on the background of the respective data. It is easy if you detect that it is a single case or a measurement error, but if it is real data, the decision is difficult.
advice, change the question, determine exclusion criteria.
For example, in Chile land of earthquakes, which is considered an atypical value. according to the measurements any over 8.0, and reaching the 9.5, however these are values that appear, but the 9.5 only appears 1 every 60 or 70 years maybe more, but those of grade 8 are more frequent in time, then what decision do you make?