The LS methods aims to minimize the sum of squares: the squared distances from the regression line to the data-points. The regression line obtained via LS will terefore be attracted to these outliers because it aims to make the distance to this point small: a large distance squared is even larger en LS will remedy this by putting the line closer to this outlier. To remedy this, you may want to use another loss function to define 'the best line'. You may for instance use the absolute value as a loss function rather than the quadratic function. (However, then you are targeting the conditional median rather than a conditional mean). The absolute value will put slightly less weight to outliers.
The trouble with outliers in the least-squares method is that the least-squares method only "knows about" data points composing a sample in terms of their mean and their squared differences from the mean. The idea case for LS methods is thus a symmetric, preferably normal distribution with preferably narrower tails. Outliers will distort (either amplifying or radically diminishing) means in the first place. Then, in the second place, a distorted mean when entail that the distances between individual data points and the mean. Squaring these differences will only accentuate the distortion. LS methods do not know to anticipate interesting patterns but can only really provide the best solution to the set of squared differences under a premise of normal distribution.
The main problem of least squares with respect to outlier is, that one single value can have an arbitrary high impact on the estimates. If you for example consider the estimation of a location parameter $\mu$ by least squares you mimimize
$\sum_{i=1}^n (x_i - \mu)^2$ in $\mu$.
If the data is uncontaminated everything is fine. But if you introduce one outlier $x_0$ into the data set, you can bias your result to an arbitrary extent by $x_0 \rightarrow \infty$. This you can see for example by regarding the explicit solutions for the estimator.
The LS methods aims to minimize the sum of squares: the squared distances from the regression line to the data-points. The regression line obtained via LS will terefore be attracted to these outliers because it aims to make the distance to this point small: a large distance squared is even larger en LS will remedy this by putting the line closer to this outlier. To remedy this, you may want to use another loss function to define 'the best line'. You may for instance use the absolute value as a loss function rather than the quadratic function. (However, then you are targeting the conditional median rather than a conditional mean). The absolute value will put slightly less weight to outliers.
There are other types of outlier: outliers in the explanatory variable. Such points have high leverage. This means that the responses at these points have a greater effect on the slope of the regression line than points in the middle of the data. Therefore, even if the responses at these points are not outliers they have too much effect on the line.
As noted above, least squares can be perturbed by outliers. However, a more robust method may leave them in, and results still be made less accurate. Measurement and other nonsampling error can cause us to use 'bad' data. The best way to define an outlier might be a point that does not even belong to the population frame you want. Perhaps it is best to let least squares analyses find the data that would most reduce the accuracy of results, however you would define those points.
Often, especially in regression through the origin (RTO), there is substantial heteroscedasticity, so we should use weighted least squares regression (WLS) rather than OLS. If we plot points on a scatterplot and then put confidence bounds about the regression line through them, above and below each predicted y value, then we can use that to investigate the quality of the data.
We must be careful not to just throw out 'inconvenient' data. I have always been very slow to ignore any collected data, but using such scatterplots, we can see which data need confirmation.
In WLS RTO, the points nearest the origin get the largest weights. Unfortunately, they are also ones that may be most susceptible to (relatively speaking large) measurement error - say in an establishment survey where the larger companies may have more expertise at supplying information for official purposes, than the small ones. Thus a point that is an 'outlier' near the origin may throw the confidence bounds out very widely. Such a point would need to be investigated first. Then the bounds on a scatterplot would be more reasonable for looking at the remainder of the data. Thus OLS bounds might be good for a first look at 'outliers' near the origin, and then use more reasonable and robust WLS bounds, such as using a weight like 1/x, for the classical ratio estimator (CRE). That can be done using software such as SAS. The first attached link will show an example using a spreadsheet.
So, summarizing, maybe the fact that 'outliers' can be a problem for least squares estimates is a good thing. We can use that to identify and investigate potential data quality issues. We then need to use the most appropriate regression weights in the end.
Cheers - Jim
Data CRE Prediction 'Bounds' and Graphs Example for Section 4 of ...
an outlier can be interpreted as an un-modeled (unknown) parameter. Least-squares results of an under-parameterized model are not optimal eg. fitting a circle with points, which are located on an ellipse. From the perspective of the circle-model, most of these points are "extreme". Their residuals exceed the confidence interval, which is given by the covariance-matrix, and will be interpreted as outliers. Because of the unknown characteristic of outliers, it is difficult to impossible to choose the right model - outlier detection is a model selection problem.
To follow up on Terry's relevant point. It is not really outliers that are problematic for LS, it is influential observations (a concept introduced by Diana Cook if my memory is not letting me down). Leverage is a way of measuring influence. Outliers may often also be influential observations, but not always. To manage the problem with influential obsevations, resistant methods have been developed (see e.g. the works of Paul F. Velleman). Sometimes the word robust replaces resistance, but normally robustness refers to a statistical methods ability to deal with deviations from the assumed error distribution and therefore it is preferable to use resistance.