Dear all, I have total respondents of 285. After data screening, I found my data to consists of 33 and 8 outliers at univariate and multivariate level respectively. Is this figure acceptable? My data is normally distributed after normality test.
What is really realistic is the permissible error you set for your research. That is the range of accuracy you want to achieve. That relates to the issues of z values which are standard deviation values.
The range of +-3 is very approximate. According to Grubb's test in your case the range is +-3.709 of the standard deviation for N = 285 and 0,05/(2N) significance level. If you apply this test to your data, the number of outliers will be reduced.
By the way look my answer on the your quation in link
Outliers tell you something - they are not out and out liars - they're part of the data set.
Consider an instrument that takes one measurement a second. If you try and deal with such a situation using the +/- 3 sigma (or 6 sigma QC protocols) then 3 out of every 1000 measurements are 'out-of-specification'. This means that in 1 hour (3600 seconds) there'll be a predicted 3 'failures' or more on average. You need to decide (as Sergiy and Harold have indicated above) what accuracy you desire on your final result. In the search for the Higgs Boson they looked for events with less than 5 sigma chance of being random. On first appearance (33 and 8 outliers from 285) it would seem that these 'outliers' are part of a wider data set. Have you forgotten some factor here? Are you convinced that the full data set is normally distributed?
Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution. In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate 'correct trial' versus 'measurement error'; this is modeled by a mixture model.
In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. This can be due to incidental systematic error or flaws in the theory that generated an assumed family of probability distributions, or it may be that some observations are far from the center of the data. Outlier points can therefore indicate faulty data, erroneous procedures, or areas where a certain theory might not be valid. However, in large samples, a small number of outliers is to be expected (and not due to any anomalous condition).
Outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low. However, the sample maximum and minimum are not always outliers because they may not be unusually far from other observations.
An outlier is a datum that is out-of-tune with the rest of the data. It is too distant or different from the rest of the data and is often unexpected and unjustifiable. For instance, Mathematics is often seen as a hard subject in my country Nigeria. A student in our our University failed other courses but excelled in Mathematics. This Mathematics performance was seen as an outlier, not just because it was too different from the rest of the data but also because of its suspicious nature. The student was not offering Mathematics as major but Biology. The lecturer was querried. Many outliers arise from measurement errors or intentional falsification of data. However we cannot rule out the possibility of the occurrence by chance. Their occurrence often calls for intervention. As Chalamalla has contributed above there are formal methods of detection and treatment of outliers when the occur.
This business of handling outliers is problematic at best. I wonder whether we agree on what an outlier is. When we rely on a computer program algorithm we become more and more removed from the data itself and it may lead us down a very dark path. Ignoring suspect points can be problematic, as well. In one critical application I am aware of a regression equation being changed in direction due to one data entry error. What would otherwise have been a very good predictor of a future event, actually predicted the opposite result. There are some outliers that matter and have a great influence on calculated results. There are also outliers that have low influence because they do not produce results that change our conclusions. One simple example would be extreme data points that fall on the regression line of one variable regressed on another. Another situation might occur involving data close to the multivariate center of a swarm of data points. These dataum will not likely disturb calculations very much. We might call those points, in-liers rather than outliers, but the all perpetrate a lie.
Lets adopt a clear definition of what an outlier is. An outlier is a datapoint that is sampled from some distribution other than the one you think it is sampled from. Are you really dealing with outliers or with extreme, but legitimate, observations. If the data really an outlier, then the only proper thing is to throw it out. If the datapoint is just extreme, then you have to make some determination about whether it is over represented (oversampled) in your sample. On the other hand you may have outliers (or in-liers) that don't matter much in the computation because they have a low impact on computational results. The best solution is to stay close to the data, and not let abstractions from computer programs determine whether some point is an outlier or not.
As a very practical method for handling data when we cannot make good decisions about where a data point comes from may be found in the methods of robust statistics. The idea is to summarize the main body of data in a way that ignores overly influential observations. There are several principles that guide this field. Two of the simple ideas are trimming and Winsorizing. Trimming involves ignoring some percentage of extreme observations by just removing them from the distribution of data points. A 10% trimmed mean, for example would simply remove the top 5% and the bottom 5% of the data points and calculate a mean of what is left. Similar strategies exist for dealing with multivariate data.
Winsorizing ls a little different, in that new data values will be imputed for extreme observations. In the case of a mean, one simple Winsorizing strategy involves replacing the extreme data value with the next most extreme data value. This is expected to remove some of the overly influential impact that a particular data point has on the results. Other possibilities also exist using the concept of nearest neighbor, smoothing, reweighting, applying loss functions other than least squares, etc.
Each robust strategy involves changing the data to something that is less impactful if it is an outlier, but it begs the question of what kind of data do you really have? In a sense, a median is just a degenerate case of a maximally trimmed mean or maximally Winsorized mean. In both cases the data is being clipped at the extremes. Similar univariate and multivariate strategies exist for many different situations.
How about making a determination of the influence each of your suspected outliers has on the overall result. John Tukey used to recommend that you always use a robust method anytime you a conventional statistic based on least squares. If they agree, then don't worry about the estimation method. If they don't agree, then go find out why before proceeding. In other words the issue is more about auditing data quality than computational method. If you cannot decide what kind of data you have, then it is probably prudent to use a robust technique. Good statistical analysis is often not about how much data we have, but about the quality of the data we are working with.
Whether your data is normally distributed is usually irrelevant, the issue is most often whether the population from which you sampled normal. I have no theory about your data or where it came from, so I could not reasonably answer the question you have about outliers or extreme data points. This needs to be answered on substantive grounds, not on statistical grounds.
As others have indicated, it's the required level of confidence in your statistical interpretation or conclusion that can or may be used to indicate outliers. You may also want to look up articles on the 'Fat Tail' - for example in hurricanes, Hurricane Sandy was so far removed from the 'standard' hurricane. See: https://gwagner.com/wsj-was-hurricane-sandy-the-fat-tail-of-climate-change/
Personally I do not throw away any acquired data to make the remainder 'look good'. Here's something on six-sigma (which I am not a fan of, BTW) and outliers:
I have a data set of 400 data points from an engineering study and have developed an expression to describe them. On a plot of predicted versus measured values there are 22 points that lie well away from the line of equality. I consider them outliers, ten coming from one of the sub sets. From this discussion I have calculated z values for them using the expression
z = (outlier value - predicted value)/ standard error of estimate. The z values calculated thus range from 17 to -3.3, 16 being positive and six negative.
I have adapted the equation for z based on the discussion. Is this faulty reasoning? Is there a better way of justifying (by quantifying) omitting these 22 data points?
In response to a question from John via PM (with a couple of typos corrected):
As you'll perhaps easily discern, I am somewhat (understatement!) cynical about manipulating data - in particular, removing data points - to get it to conform to some sort of simple (e.g. 2-parameter) model. We normally would use such models to allow predictions as to what we may expect in the future. Z-values are a standard way of calculating the spread of data and, if the true population was truly normal, for example, then 17 standard deviations (z-score of 17) from the norm would be way outside statistical expectation. Multiple samples from the whole population are invariably considered normally distributed in most statistical analyses.
If the data points were legitimately acquired then it's clear that the normal model approximation doesn't fit and some other factors need to be taken into account. Data that lies in these extremes does need careful examination and there are plenty of examples of the so-called 'fat tail' where the deviation of a single event (hurricanes/tides are a good example to research on) is so way from the norm that one could consider it impossible to occur such as Katrina. But Katrina happened....
You ask an interesting point about the permitted number of 'outliers' in a set of data. Clearly it depends... It depends on your understanding of the experimental set-up and the expectation of the form of the distribution plus the desired 'accuracy'. If you consider the data set of the numbers 0 to 9, then each digit has equal probability in a random set of numbers (e.g. the first 100,000 digits of pi) . One could not (or should not) approximate this to a normal (or log normal, or Weibull) distribution even though the data possesses a calculable mean and standard deviation. Now if you had a set of numbers 0 to 9 and you suspect that the data are not random, then a statistical test such as chi-squared could be used to pick up the deviation on the basis of whether the differences were significant to some proscribed standard error.
Not sure this helps, but I appreciate the discussion and your further thoughts.
The data are not whole numbers; the variable varies between 0.950 to 1.628. From an internet search, several references suggest the Chi-2 test is not appropriate for such data. Is this correct? If so, can you recommend an alternative test?
Hi John Not sure I fully understand your response. Chi-squared will simply look at the differences between actual results and those predicted by any model you use. The test does not require whole numbers (are integers pointless? Sorry for the bad joke...). What am I missing? If the number of sample sets is small then Student's T-Test comes into play...
I agree with your point about exercising caution over eliminating data and want to be able to justify any such action. I have removed 8 of the 22 data points as they were well away from the line of equality and were all in one of the 24 data sub-sets. I believe that I can identify the likely source of the error from some physical relationship laws applicable to the type of data being analysed. This leaves 14 data points scattered across the other data sets, which possibly suffer from the same source of error.
Using the 14 data points (potential outliers), I calculated chi-2 using the measured values and those predicted by the relationship developed. Chi-2 is 0.586 for 13 DF, which indicates that the measured and predicted values are independent. As a cross-check, I took 14 data points close to the line of equality and calculated Chi-2 as 0.0256 for 13 DF. This also indicates that the "good" measured and predicted values are also independent, despite being well fitted by the relationship.
When scouring the internet for inspiration, I noted several sites said the measured values should be whole numbers and thought that this may be responsible for the apparent anomaly. As it is, I am at a loss.
It may already be clear, but for me, statistics is a struggle, so I appreciate the time taken to provide your answers.