If there is evidence of heteroskedasticity from plots, should you plot the squared residuals vs explanatory variables to understand the functional form between them and choose the best test?
Based on the linear regression model analysis, to verify the regression conditions, it is possible. so
How to detect heteroscedasticity and rectify it?
One of the important assumptions of linear regression is that there should be no heteroscedasticity of residuals. In simpler terms, this means that the variance of residuals should not increase with fitted values of the response variable. In this post, I am going to explain why it is important to check for heteroscedasticity, how to detect it in your model? If is present, how to make amends to rectify the problem, with example R codes. This process is sometimes referred to as residual analysis.
Why is it important to check for heteroscedasticity?
It is customary to check for heteroscedasticity of residuals once you build the linear regression model. The reason is, we want to check if the model thus built is unable to explain some pattern in the response variable Y,
that eventually shows up in the residuals. This would result in an inefficient and unstable regression model that could yield bizarre predictions later on.
How to detect heteroscedasticity?
I am going to illustrate this with an actual regression model based on the cars dataset, that comes built-in with R. Lets first build the model using the lm() function.
I think you have the right idea to consider residual analysis graphics. However, I would normally keep the error structure intact.
With regard to the various regressors you mention, based on my experience, and others, such as noted in Särndal, CE, Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlang, you can use any regressor or combination of regressors as a size measure in the regression weights, but the best one may be the same combination and format for regressors that predicts y. (This is for prediction more than explanation.) In practice, at the US Energy Information Administration, where I led a group of statisticians developing electric power survey data applications, we used a preliminary prediction of y for the size measure, as part of the official data estimation procedure.
Econometrics texts and others often treat heteroscedasticity as an anomaly that must be removed in order to do hypothesis tests. However, it is often a natural part of the error structure, and hypothesis testing in practice is at best often not very clearly interpretable, and at worst, in practice, may often be misleading. Standard errors are often all that is needed. The estimated variance of the prediction error is designed to estimate variance, but is also impacted by bias due to the way sigma is estimated, and can be very useful. Also, test data can be useful. You could research the terms "model selection" and "model validation."
There are times that heteroscedasticity can be a symptom of a problem. If data that should be modeled by two separate models are mixed, then the (basically compromise) regression will show heteroscedasticity.
[By the way, caution: not knowing the nature of your data, etc., some things here - above and below - may or may not be of various levels of appropriateness for your application.]
Heteroscedasticity may show in time series applications, but especially in finite population sampling with regressing through the origin, you should expect substantial, naturally occurring heteroscedasticity. After all, for example, does it make sense to expect 1,000,000 +/- 100, 1,000 +/- 100, and 5 +/- 100 in many applications?
There are various ways to consider heteroscedasticity, but I think it most straight forward, especially for prediction, to consider a regression weight based on a size measure and a coefficient of heteroscedasticity.
Attached are some files on understanding heteroscedasticity and how to consider it in applications, mostly from the point of view of establishment survey applications, where there is one regressor for a finite population with regression through the origin. However, consider this: Heteroscedasticity is on the predicted (i.e., dependent, y) variable, which (using GS Madala notation of * for a WLS estimate/prediction) is y*. So multiple regression can be written as y = y* + e = y* + (e_0)y*^(gamma), where e_0i is the estimated random factor of the ith estimated residual, and gamma is the coefficient of heteroscedasticity. Using that, you can apply analyses that apply to y = bx + (e_0)x^gamma when assessing heteroscedasticity and its impact on predictions.
Cheers - Jim
PS - Note that sometimes you may be looking for an outlier, and David's document is a way of addressing that, though in a simple example you may have found that quickly from graphical residual analyses.
From the answers above, are they suitable to apply on the dataset that has more than one Gaussian components?If so, how to check heteroscedasticity and homoscedasticity for each components? For instance, the Old Faithful Geyser Data which has 2 gaussian components.
Ty for your answers. I invite all you to read the article "An Empirical Model for River Ecological Management with Uncertainty Evaluation ". I used an econometric analysis to develop the model described in the article.