It is easy to find methods to detect outliers and influential points in regression models. When detected what to do with them? We welcome some references.
You have not provided much information so I can only provide somewhat vague assistance. But, I would suggest you look at robust statistical methods. The following reference may prove useful:
@book{Maronna2006,
author = {Maronna, Ricardo A and Martin, R Douglas and Yohai, Victor J.},
If there is a reason for the outliers that is grounded in a real situation, your model may need modification. If the reason is error, you may be able to correct the data, giving justification. Otherwise or in addition, try sensitivity analysis and see if your conclusions are the same, with or without these data points - if yes, that strengthens your conclusions.
Thank you A.K. Singh, Shane McGee McMahon and Miland Joshi. I am reading the sandwich vignette.
Sensitivity analysis is a good way to say something about outliers and their influence in the model, but, what model? with or whitout outliers if they are true data?
Robust methods apply some punishment to data showing a different behavior in order to obtain a beautiful model. Is the objective to obtain a beautiful (well-adjusted) model even if it hides potencially important aspects revealed by the outliers?
Miland suggests to modificate the model in some cases. I would like to know more about this option. What kind of modification? Thank you for your opinions and references.
This has been a very intriguing question for me and the answers I have found have not been fully satisfactory. I appreciate the readings you have recommended me. I will start to read from now. Thank You
Viewing from the practical application instead of statistical theory, you should try to find the reasons. As M Joshi said, If the outliers result from errors, you may be able to correct the data, giving justification. If they are real data, you may need some modification or just delete them in the case of more enough data. Based on the nomal distribution, the amonut of deleted data or outliers should be controlled strictly, such as less than 3% even 1%. When your data are not so much, and you are not like to remove the outliers from the modeling data, you can try to correct or modify them according to your experience or the nomal dixtribution ranges of the data.
How to handle high influence points and outliers is complex. It is not a good idea to delete them. There is no unique answer. We need do lot of work to study the data to make decision. Following is an example, which was post on researchgate several weeks ago, I join the discussion, but I forget who posted the question. The suggestions I gave is one of many ways to handle the outliers. The main idea is to check the assumption of linear regression.
Here I can't insert the plots and nice table, please see attachemnt
A linear regression by using famous data set found in Freedman et al. (1991) in Table 1: ‘Statistics’ refers to the percapita consumption of cigarettes in various countries in 1930 and the death rates (number of deaths per million people) from lung cancer for 1950.
Table 1: Death rate data in in Freedman
Obs
Country
Cigarette
Deaths per million
1
Australia
480
180
2
Canada
500
150
3
Denmark
380
170
4
Finland
1100
350
5
GreatBritain
1100
460
6
Iceland
230
60
7
Netherlands
490
240
8
Norway
250
90
9
Sweden
300
110
10
Switzerland
510
250
11
USA
1300
200
The question was posted in Researgate is
“A confidence bound on the slopes that could occur with OLS would be wide at each end. However, it appears that it might go through the origin, which would make confidence on a slope appear as two straight lines, forming a v-shaped wedge. But, there are people who die of lung cancer without smoking, so one would expect to see a positive intercept. Regardless, larger x would generally mean larger residuals for y. What are your thoughts on this graph? What would you tell people who might ask you to evaluate what may be (or have been) happening?”
Fig 1 shows the regressions of deaths on cigarettes with and without USA:
We will use this data set to discuss some common issues, which are often ignored when we perform linear regression. First we know, there are only 11 points, hence the sample size is too small to make a meaningful conclusion about the association between the death rate and cigarettes. Second, there are only two useful factors, it is probably improper to use cigarettes as the only predictor for the deaths rate from lung cancer. Hence we only use this data as teaching tool to discuss the regression diagnosis. There are no problems to believe the death rates are independent from different countries and they are normal distributed, which can be seen from P-P plots. When we perform the linear regression by using all 11 records, it is clearly show that the USA is an outlier, its CookD=2.56, leverage =0.43 and dffit=-4.32. Should we delete it? USA has the largest population among the 11 countries in the data set, the conclusion without USA is obviously improper; even it has better fitting and high R-square (0.94 without USA vs 0.54 with USA).
The scatter plot including USA clearly shows that the homogeneity of the variance is not hold. Hence it is improper to fit the data to a simple linear model by assuming the hom0ogeneity of variance. In general, data do not always come in a form that immediately suitable for analysis. Transformations should be based on certain objectives such that to assure the linearity of the dependent and independent variables. The logarithm, the square root, the square, the exponential, etc. are commonly used to transform the variables according the data characteristics before performing the analysis. The transformation is also applied to multiple regression, but it requires more effort and care. When one or more of the standard assumptions are violated, transformation is needed. As this data showing, linearity and homogeneity of the variance are most commonly violated. In general, if the linearity is hold, but the variance is not constant, we usually transform the dependent variables, if linearity is violated, we may transform both either of them or both.
From Figure 1, it can be seen that the larger death recorder has larger variance: two transformations of deaths could be considered: logarithm and square root. It also can be seen, the variance of deaths is a function of cigarettes. Using Y represents deaths and X represents cigarettes, we may consider the following transformation forms
1.
2. ,
3.
To verify the transformation 3, we group the counties by cigarette using into five groups (2,2,2,2,3 ) respectively, the scatter plot of cigarette and the standard deviation of the 5 groups in Figure 2: almost perfect linear relationship through origin.
Five regression models are considered, which are
Model 1 is the classic regression. The systematic component in Modes l and 5 is the same, but Model 5 has non-constant error term. The variance in Model 5 is proportional to square of cigarette. The other three models are non-linear regression. The corresponding regression parameter estimates are list in Table 2,
I was once told that in "modern quality control," it is far better to expend effort to collect data well from the beginning than to try to 'fix' it later. That would certainly seem to apply to 'outliers' caused by unacceptable nonsampling error, such as measurement error. I too would not like throwing out data just because I assumed it was of unacceptable quality/reliability, as that is a slippery slope. We should also be careful to define an outlier as a point that does not really belong to the population of interest, not just that it is inconvenient.
In work I did for many years, in establishment survey sampling for continuous data, linear regression through the origin was usually most appropriate. Substantial heteroscedasticity is naturally present, and thus weighted least squares regression is appropriate. In such a case, rather than the usual influence points where it makes sense to have an intercept term, the most 'influential' points, in there own way, are points near the origin, which have the larger regression weights. The estimated variances of the prediction errors are particularly at risk of being greatly overestimated by the relatively inflated residuals that commonly occur for smaller establishments who are not as sophisticated at reporting data (ie, the 'ma and pa' shops). This can also influence slopes as well, but variance is particularly vulnerable. Thus it is generally better to use the classical ratio estimator (CRE), even though heteroscedasticity in such a situation is generally greater than that inherent to the CRE.
Thus the CRE is robust in a reasonable way. It allows for the fact that y predictions in general have greater relative measurement error for smaller x values, which partially counters the natural heteroscedasticity. (I say 'natural' heteroscedasticity, because one would not expect a number a hundred times larger than another to have the same variance.)
This then, weighted least squares (WLS) regression, is what I consider to be an interesting, and often useful corollary, to the usual considerations of influence and outliers when discussing OLS regression, particularly with an intercept term.
The paper at the attached link provides more information.
Cheers - Jim
Article Ken Brewer and the coefficient of heteroscedasticity as used...
A good reference is Montgomery and Peck's Introduction to Linear Regression Analysis (ISBN 0470542810). An alternative might be N.R. Draper's Applied Regression Analysis (ISBN 8126531738).
As for what you should do: assuming that they aren't the result of errors in collection or recording, so that they are present in the cleaned data that you are satisfied is authentic, I suggest reporting them, and doing a sensitivity analysis - show that their omission makes no difference to the findings.