How to identify outliers on a plot of measured values versus values predicted from an expression determined by non-linear regression?

John -

That might be OK, but for the standard error you mention, which I guess is the square root of the estimated variance of the prediction error, have you considered heteroscedasticity? For different predicted values, you should have different estimated variances of the prediction errors. See https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity. (Increased variance for y with larger predicted y should also apply to nonlinear regression.) The estimated variance of the prediction error, by the way, uses an estimated sigma which is impacted by bias, so the estimated variance of the prediction error can be a good overall measure of the accuracy of a prediction.

Perhaps you should provide prediction intervals to form curved, flaring out 'bands' about the predicted curve from the "expression" noted in your question here. If the prediction intervals are done for each of the 24 subsets, and say several percent of the points fall outside of the, say, 99% prediction interval 'bands' in a given subset, especially if it is asymmetric, then you may have some suspicious points which may be outliers. But remember that, ideally, one percent of points SHOULD fall outside of 99% prediction intervals, and 10% should fall outside of 90% prediction intervals. Also, a point that does not belong could fall inside a prediction interval. But if you have around 20 data points which fall far outside of reasonable prediction intervals, then I think that you have good reason to consider them likely outliers. You could at least then document graphically why you considered them to be outliers.

Anyway, I think that's one way you might look at this.

Cheers - Jim

Jochen Wilhelm

If the values are just "far away" but still plausible, then look up if there was something strange happening during the generation of these values. If you find something strange, remove them. If not, keep them.

If the values are implausible/impossible (so there must obviousely something have gone wrong), remove them.

John M Wheeldon

Jochem, Thank you for your answer. All the "outliers" have z- values greater than +/- 2.58, the 99 percent confidence limits. Having 22 points outside these limits seems unreasonably high, so I am inclined to remove them. I believe that I can identify likely sources of error for 16 of the "outliers" and three of the others are those furthest from the line of equality. However, my understanding of statistical analysis is such as to make me dangerous and I am reluctant to delete the points without some quantifiable criterion. Do you see flaws with the equation proposed in my question? Would you recommend an alternative approach, that is readily understood by an elderly novice?

John M Wheeldon

Jim, Thank you for taking the time to answer my question. Statistics is always a struggle for me. I use a package routine for analysis and it does not include the approaches you recommend. In any case, I assume that you consider the equation in my question to produce erroneous answers. Are the errors likley to be large? Could I assume that the equation will give a good indication of whether or not a data point is an outlier?

Considering my declared shortcomings in statistics, I hope that you do not regard the following question as impudent. You wrote, "....one percent of points SHOULD fall outside of 99% prediction intervals, and 10% should fall outside of 90% prediction intervals." An internet search disclosed the following. "A 95% confidence interval is a range of values that you can be 95% certain contains the true mean of the population. This is not the same as a range that contains 95% of the values."

Are the two views expressed at variance? I look forward to your reply, Sincerely, John Wheeldon

Jochen Wilhelm

The confidence interval (CI) refers to a statistic (a parameter estimate). If can happen that even all of the observed values are outside the CI. That is, by itself, no indication of any problem. As Jim suggested, I would check the predicton interval instead. Especially when the variance is not independent of the predicted value (but wrongly assumed to be), it's likely to find values rather far away from the model fit in regions with large variance (that's actually quite obvious).

You can also check other measures like Leaverage and Cook's distance to get an impression how much impact your suspected outliers actually have. If this is not much, it might not be worth to think about removing them at all.

James R Knaub

John, you wrote "This is not the same as a range that contains [Exactly] 95% of the values," and that's true. It is not how either a prediction interval or a confidence interval is technically defined. But it will generally be close.

If you assume all prediction intervals are the same, no matter what the predicted y, that is, if you assume homoscedasticity, then you will be liable to think you have more outliers among the larger cases, and fewer among the smaller cases than you actually do. That doesn't mean that will happen, just that it is more likely. The greater the actual heteroscedasticity, the worse that possibility will become.

Jochen noted to beware of "...something strange happening during the generation of these values." In having been involved in many years of production of official statistics collected at a statistical agency, I definitely endorse that approach. Just as model selection should be informed by subject matter expertise, data quality needs to be informed by process and methodology. What happened?

John M Wheeldon

Jim and Jochen, Thank you for your guidance on this question. Based on a recommendation from Alan Rawle, I used the 6-sigma approach to determine that 20 of the 22 data points lay outside these limits and so I eliminated them. The remaining two “outliers” fell outside the 99% confidence limits, where 3 or 4 of the 380 data points might be expected to lie. I retained these two data points. Of the 20 data points eliminated, I can identify the likely source of error for 18 of them, the source being the same for all 18. The source of error for the other two could be for two additional possible reasons, but I eliminated them anyway as they were amongst the worst offenders.

Does this seem reasonable to you?

Jochen Wilhelm

Yes, as long as you clearly indicate how many values you excluded and because of what reason. It is an important information that you excluded 2 values without haveng any other reason except "they looked bad". However, these 2 values are just about half a percent of the values in your study, that should not have any considerable effect on your analysis or conclusions (if you just keep them in). To me, it just looks better to note that there are a few outlying values, but that they have not been excluded because these values were not per se implausible/impossible and no other (external) reason could be identified why they might not be trustworthy.

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

How to analyze multiple phosphorilation sites?

I need the datasets of Microgrid for system identification?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Can the limit of quantification (LOQ) of an analytical method fall outside its linear dynamic range, or must it always be within it?

What change would occur in physics if the three different sizes of the proton and the two sizes of the deuteron accepted as new physical constants?

Standard curve of H2O2?

Which file formats are accepted for supplementary material?

Dataset of synchronized cardiac angiography and ECG?