I have calculated the p-values as (1-Beta(6,1)) for each rescaled (15/14^2)*MSD. MSD were calculated from estimated parameters mean and covariance. I have obtained 9 outliers with p
Alessandro, Using your method the answer is yes. However, that is not what I would do. I would try the methods advocated in the MultRobBank papers and the paper On Robust Partial Discriminant Analysis... in the zip file. I apologize for the size but I was having a computer glitch today. If you have questions, please contact me. Best wishes, David
Computationally, Professor Booth has verified your concern. However, I think that you may want to rethink your issue. Relying on a computer program algorithm may lead you down a very dark path. An outlier is a datapoint that is sampled from some distribution other than the one you think it is sampled from. Are you really dealing with outliers or with extreme, but legitimate, observations. If the data really an outlier, then the only proper thing is to throw it out. If the datapoint is just extreme, then you have to make some determination about whether it is over represented (oversampled) in your sample. On the other hand you may have outliers that don't matter much in the computation because they have a low impact on computational results. Some people have relied on a compromise method akin to trimming or Winsorizing. Both involve changing the data to something that is less impactful if it is an outlier, but it begs the question of what kind of data do you really have? In a sense, a median is just a degenerate case of a maximally trimmed mean or maximally Winsorized mean. In both cases the data is being clipped at the extremes. Similar univariate and multivariate strategies exist for many different situations.
How about making a determination of the influence each of your suspected outliers has on the overall result. John Tukey used to recommend that you always use a robust method anytime you a conventional statistic based on least squares. If they agree, then don't worry about the estimation method. If they don't agree, then go find out why before proceeding. In other words the issue is more about auditing data quality than computational method. If you cannot decide what kind of data you have, then it is probably prudent to use a robust technique. If you argue in terms of cost of obtaining a data point, I would counter by asking the question about what it costs to knowingly be wrong. Good statistical behavior of data is often not about how much data we have, but about the quality of the data we are working with.
I would suggest that any statistical method used to find extreme values and to then delete those values is invalid.
At times mistakes are made. Numbers are entered incorrectly, machines break down, people get tired. If you can identify a mistake, then you should delete that value.
I am very concerned that 9 outliers are over 50% of the data. I hope I am making a mistake interpreting the question.
A couple of general rules to guide thinking about outliers.
1) If your data analysis problem is situated in a broader theoretical context, you should use that information to guide what is 'normal' for outliers, either in the sense of how many are typical in datasets or what values constitute extreme observations.
2) If you don't have substantive theory to guide you, and you consider your situation more exploratory, then you can probably expect about 5-10% of your data to represent some type of outlier. So from that point of view, 50% seems extreme. Ronald Pearson (among others) has a very good book on understanding exploratory data analysis (EDA) generally and outlier detection specifically within the EDA framework. His book also has a companion blog (link below).
Dear, after long time this topic seems still to be of interest. So I would like to share the dataset in a spreadsheet of excel(TM). Inside there are my MSD calculations and p-values. Please let me know what do you think about it.