I need a data set of a practical example about a simple linear regression with heteroscedasticity to do my M.S. thesis. Could you please suggest such a data set?
There is a small data set given at the end of https://www.researchgate.net/publication/261947825_Projected_Variance_for_the_Model-based_Classical_Ratio_Estimator_Estimating_Sample_Size_Requirements, to start for you.
Most papers available from my RG pages (see contributions starting at https://www.researchgate.net/profile/James_Knaub) show methodology I developed for handling heteroscedastic data from establishment surveys, used extensively at the US Energy Information Administration (EIA). A great deal of such data may be obtained from http://www.eia.gov, and/or by writing a request to [email protected].
Many papers I loaded on ResearchGate were developed and used at the EIA, for estimating missing (nonresponse and out-of-sample) data with regard to electric power, natural gas, and other energy establishment surveys.
This is with regard to finite population sample surveys - often monthly sample surveys - with regressor data from less frequently gathered, often annual, census surveys of energy establishments, which may be used with any such highly skewed finite populations.
Data requests may be made to [email protected], but you might first want to look at the EIA website and explore the data collection survey forms and aggregate data reports available. There are thousands of aggregate level values reported each month, and a great deal more microdata from surveys are collected and used to obtain this.
Cheers - Jim
Conference Paper Projected Variance for the Model-based Classical Ratio Estim...
Note that the paper attached to my previous post, and a number of others at https://www.researchgate.net/profile/James_Knaub/contributions use the level of heteroscedasticity which Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold: London and Oxford University Press would associate with a cluster from Cochran, W.G(1977), Sampling Techniques, 3rd ed., John Wiley & Sons with independence between the elements in the cluster. It (the classical ratio estimator, CRE) appears often robust against data quality issues for prediction of y when x is small.
Note also that although heteroscedasticity also occurs in time series regression, the work you see on my ResearchGate pages will be for predictions involving finite populations, not time series.
If you are looking to estimate the level of heteroscedasticity in a given data set, rather than default to the CRE, there are multiple methods. The Iterated Reweighted Least Squares Method is a common one, and is explained well in Carroll and Ruppert(1988), Transformation and Weighting in Regression, Chapman & Hall, Ltd. London, UK. Here are some other ideas:
And here is a paper showing usefulness of weighted least squares regression, and as with my other papers, not just for predicting/estimating individual cases, but also for predicting/estimating totals for categories or groups or whole populations in finite population statistics:
PS - For multiple linear regression, or even more general multiple regression, one can find regression weights involving a coefficient of heteroscedasticity by using a preliminary prediction-of-y as the size measure in place of x.
PSS - As noted, a great deal of data using this are available from http://www.eia.gov, and by contacting the US EIA using the email address supplied previously.
Article HETEROSCEDASTICITY AND HOMOSCEDASTICITY
Article Weighting in Regression for Use in Survey Methodology
Article Properties of Weighted Least Squares Regression for Cutoff S...
Conference Paper Alternative to the Iterated Reweighted Least Squares Method ...