How can we correct for autocorrelation and/or serial correlation post estimation?

02 February 2019 4 3K Report

I am using nonlinear regression models (SVM, Random Forest) on a pooled regression problem. I have high autocorrelation and some cross-sectional correlation and want to provide a performance metric like R^2 that is corrected from this issue. Or some variant of Newey-West that I can apply post estimation (directly in my errors or in the prediction and realized).

I also have an issue in reporting my R^2 cross-sectionally. My dependent variable varies between 0 and 1 and the cross-sectional average changes drastically in some periods. This results in the models having a negative R^2 suggesting they are very bad, but the scatter plot of the prediction vs. realized suggest a relationship between the two (for example the correlation is 0.5). What would be the best way to deal with this problem?

Thanks

Stefano Nembrini

Do you mean that you are interested in predicting an outcome variable? What is the size of your sample and number of regressors?

There are various ways to look at an R squared, one is the squared correlation between observed and predicted values. Autocorrelation is an issue for inference on the parameters in a linear model (Newey–West approach gives you corrected standard erros), but it does not change how the R squared is computed.

R squared could also be computed as 1 - MSE/ Var(y), and if it is out-of-sample it can also go to -Inf if the model is really bad. I would prefer this over the other.

I personally prefer (out-of-sample) Lin's Concordance Correlation Coefficient, because if the predicted values are a linear transformation of the true values, you can get a large squared correlation but that is not a good prediction.

Saad Mouti

Stefano Nembrini Thank you for your answer. Yes I am interested in predicting an outcome variable using 100 regressors with an unbalanced panel data with 1.7M rows (451 dates and a number of individuals that changes (3800 on average)).

I like Lin's CCC, but again I am confronted Heteroscedasticity and autocorrelation in the outcome (and errors). I would want to correct the bias in case some individuals that are highly correlated are driving my measure up.

Thank you again!

Stefano Nembrini

I understand you could do repeated cross validation or bootstrap the training data so that the data you used for testing are less correlated with your training samples, let's say you have 50 states with 100 observations each, you can assume that correlation within state is high but between state you're fine,

the you can create a training set in order to have 40 states and use the remaining 10 for testing

How does that sound?

Saad Mouti

I already make sure that my test data is not correlated with my training data (I respect a lag for the autocorrelation to fade). Also, even though doing a cross-sectional analysis, I want to make sure my data respects chronology (I don't want to train my model on recent events to test them on early ones, that's what prevents me from trying some variance of the k-fold cross validation).

Now within my test data (after lagging it so that my estimators have no information about the test data) has Heteroscedasticity and autocorrelation.

To be more precise, I have a dependent variable that is a function of one month stock returns for one year, that I observe on a moving window of one month. There is the fact that the measure coincide in 11 values each time, and the fact that some stocks are also highly correlated. My main concern is, can I correct my measure so it is the least biased by these effect.

Thank you again Stefano!

How can I prepare virus for a TEM or SEM imaging?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

Hello researchers Is this a random laser or just fluorescence?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

How can I apply boundary conditions in an orthotropic steel deck numerical model using ABAQUS software?

How do you delete a duplicate pdf for the same paper on ResearchGate?

Can you suggest reliable sources defining "3D mesh" and "3D city models"?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?