I am planning to impute missing data in various variables, and am currently planning my imputation model. I wonder which auxiliary variables, besides the variables of the analysis model, I should include in the imputation model.

I have read that auxiliary variables should be correlated with missing variables (recommendation is r > 0.4) (https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt1_new/), but if I understand Stef va Buuren correctly, multicollinearity could also be a problem.

"For datasets containing hundreds or thousands of variables, using all predictors may not be feasible (because of multicollinearity and computational problems) to include all these variables. It is also not necessary. In my experience, the increase in explained variance in linear regression is typically negligible after the best, say, 15 variables have been included."(https://stefvanbuuren.name/fimd/sec-modelform.html)

How strongly should the variables in the imputation model correlate at most with the missing variables and other auxiliary variables? And how do I generally find suitable auxiliary variables for my imputation model?

Similar questions and discussions