Consider the natural case when we have an imbalanced dataset, maybe with missing values and containing categorical features. We need to impute missing values, do some sort of balancing and also encode categorical variables. In train-test split case we need to put aside a test set and never perform any learning on it (for example not use sklearn's fit or fit_transform methods, use only transform method). Generally, what is the appropriate order of steps in case of K-fold cross-validation to avoid estimators learning from test fold during the process?

Similar questions and discussions