What are the limitations of maximum outliers in a dataset?

Manoj -

This depends upon your purpose, and how you determine what you consider to be an outlier. An outlier is either a datum that includes an unacceptable amount of measurement error, or was out-of-scope, that is, it did not belong to the data population with which you intended to work. If you look for cases in the tails of your distribution, you may incorrectly assume good data are outliers, and leave in 'true' outliers that are not so far out in the tails, but that may be unavoidable to some extent. There is no substitute for careful data collection, but if, say, a number were obviously collected in the wrong units, for example, then one 'bad' data point may overwhelm any results, and one could hardly let that stand.

Note that normality is most common when the central limit theorem comes into play, or when looking at estimated residuals from regression, but even then it is probably not often very crucial. Many different distributions occur naturally. It depends upon your application.

In the energy establishment survey finite populations in which I worked, if data editing, using different methods, including scatterplots to compare data sets, found very suspicious data, the energy industry establishment was contacted to ask them to confirm. If "good" data could not be found, then imputation or reweighting of data was important, or one would otherwise effectively be substituting a zero, which would bias estimated totals downward. I suppose that in a laboratory experiment from an effectively infinite population, removing an outlier would be like doing one less experiment, but it may have been for a unique condition. In any case, the real sample size is reduced by one, and standard errors of parameters grow, and population standard deviations, though constant, are determined less accurately. Also, if a data point that was considered an outlier was collected badly because of conditions under which you could not consider it to be missing at random, then not collecting that data will bias results. Perhaps there was a reason that that datum was hard to observe. Stratifying or categorizing your data by characteristics that are common within that group, but different from others, can dampen the impact of such bias.

Basically, if outliers occur at random, and can be effectively recognized, the only problem remaining is your smaller sample size. So you are "limited" in that your sample size is smaller, and you have to assume outliers occur at random, or else group data more homogeneously. You may want to research "response propensity" groups.

You could try an experiment: Identify outliers at three levels based upon how suspicious they are, even if all you can do is to look in the tails at three different cuts. See how much difference it makes to whatever you do with your data. (Report this information in an appendix, perhaps.) This may give you some basis upon which to judge.

Cheers - Jim

Efraín Antonio Domínguez Calle

Quite a difficult question, I think even with a small quantity of outliers they could mislead to wrong probability distribution fitting. So the question is not only about normality but also about correctly defining how data is distributed. If you are suspicious about data quality and you are expecting a probability distribution type to take place and suddenly you are having extreme type distributions as the best fit I recommend to pass an outlier detection test first i.e. the lag plot test, Median Absolute Deviation test or the Grubbs test. We developed one that use the Kullback - Leibler distance as well.

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

How to generate a citation of my paper from ResearchGate?

How to increase simulation box size?

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to fix background error in rietveld refinement of one XRD peak using GSAS-II?

How can I add own Henry coefficients in Aspen Plus?

Why might the impedance values for DI water and 0.1X PBS buffer solution exhibit a decreasing and increasing trend, respectively over time (HP 4194A)?

Can usage of AI tools like chat GPT in research work is recommendable ?

Usage of internal standards in LC-MS/MS analysis?

How to test multivariate outlier in STATA?

How can I prepare the csv. dataset to find the flood triggaring factor using ANN?

How to deal with zero/negative growth rates in tree growth models?

Dealing with outliers?

What are "normal" PEP successive values?

Finding the differentially expressed genes?

Ion ratio value in LC-MS/MS?

Should item-level data or total scores be used for detecting multivariate outliers using Mahalanobis distance method?

Outliers continue to appear after removal. What should I do?

Can anyone suggest a good way of testing for an outliers for an experiment ran in triplicate?