I have a dataset of a binary response variable and 22 predictor variables that have multicollinearity. My goal is to make a predictive model. The data was sampled every 30 seconds over three days. There is a high amount of autocorrelation between y(t) and y(t-1). I have done PCA regression with time and the y(t-1) as predictor variables. It seems to be working because my misclassification error is about 3%. Are there any statistical issues with what I have done? I did standardize my predictor variables including time.

Also, I additionally tried using stepwise logistic regression to select the best variables before doing PCA, which actually got rid of time and kept the y(t-1), and I got a similar misclassification rate of about 3% again.

I am using R.

Similar questions and discussions