In the case of imbalanced classified data, oversampling is a standard technique to avoid the learner to be biased toward the most represented class. In the case of cross-validation, we have two choices: 1) perform oversampling before executing cross-validation; 2) perform oversampling during cross-validation, i.e. for each fold, oversampling is performed before training, and this process is repeated for each fold. Case 1) is more efficient but model selection is done on the basis of the average performance on data including artificial data; case 2) is less performant but includes oversampling as a part of the model selection process. Which approach is most suitable?

Similar questions and discussions