I am working on validating a questionnaire and I need to ensure that there are few (or no) outliers that might affect the factor analysis process. Is the outlier labeling technique (Hoaglin, Iglewicz) applicable to non-normal data?
outliers can be identified through many techniques:
- you can run frequency for each variable and examine by eye the normal range
- you can do box plot and exclude the data beyound 3 SD
- you can run skewness to maintain the value within acceptable level {textbooks vary in the upper range, some recommend skewness not to exceed .2, other go up to 1.0}
I would advise against removing outliers in this way unless you have reason to believe that they are invalid. Perhaps there is a robust version of the questionnaire validation technique you are using which will handle them within the data.
It sounds like you've come up with a good solution. If you're still interested in the question of detecting outliers with a non-normal distribution, I found this article helpful: "Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median," by Leys et al. (2013) in the Journal of Experimental Social Psychology, vol 49. Using 3 SD around the mean is technically inappropriate if the distribution is non-normal. I would be interested if anyone has used the absolute deviation around the median and found it helpful.
I understand your position about keeping the outliers to maintain generalizability, but I'm not sure that I agree with it. In particular, is your original theory designed to be generalizable to the entire population, or was it stated in ways that would apply to nearly everyone nearly all of the time?
In other words, most theories are generated, explicitly or implicitly without any thought whatsoever to outliers. That means that when we start to observe samples that represent entire populations, we encounter cases that fall outside the boundaries of our theory.
So, for parameters such as estimating a mean, it is probably more important to maintain outliers, than it is when your goal is theory testing.
Although Mahalanobis distance is originally thought to be used for multivariate normal distributions, there has been statistical-deductive efforts made to clarify its application within a non-normal context
See Ekström's (2011) article "Mahalanobis' Distance Beyond Normal Distributions":
It has been awhile since I visited this thread of discussion. I am grateful for all the answers given. I managed to get a few papers published after taking into consideration all the comments here. My infinite thanks to all.
You can just use upper and lower quantiles. We use nonparametric statistical methods to analyze data that's not normally distributed. In the same way, instead of using standard deviation, you would use quantiles. That is, you can say to assign NaN to values greater than 95% and less than 5% of the data set.