Should oversampling be done before or within cross-validation?

In Case 1), as the entire original data is oversampled and the cross-validation is performed afterwards, similar examples (or exact replicas, depending on the oversampling algorithm applied) may appear in both the training and test partitions. Therefore, Case 1) seems more “efficient” because we are over-estimating the classification performance. This is known as the issue of “overoptimism”, introduced by a faulty CV design.

Case 2) is the correct way of handling imbalanced data, although proper care should be taken when choosing an oversampling algorithm (e.g. choosing algorithms that create exact replicas of original examples may lead to overfitting).

The following paper may be useful, as it compares Case 1) versus Case 2) (i.e. CV before oversampling and CV during oversampling) in what concerns the obtained performance results and associated data complexity measures in both the training and test partitions.

Article Cross-Validation for Imbalanced Datasets: Avoiding Overoptim...

Furthermore, the issues of overoptimism and overfitting are explained in detail and 15 well-known oversampling algorithms are compared across a benchmark of 86 publicly-available datasets.

Hopefully this helps future researchers in the field.

Best regards!

Giovanna Castellano

Many works prefer oversampling during cross-validation. I attach a link where you can find an explanation for this choice:

http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation

Michelangelo Ceci

I think that it is better to avoid oversampling of (under)sampling. CV is still (under)sampling. In this way oversampling takes into account more information.

What about undersampling of the majority class? You can still perform ensemble learning afterwards (see the enclosed publication - Fig 1).

Article Integrating microRNA target predictions for the discovery of...

Kumarjit Pathak

Oversampling with (SMOTE) can be done before training. However there is a catch. SMOTE reduces the majority class as well hence there is a chance of loosing valuable information.

My approach is as follows: ( this has been tremendously successful in most of the work I have done).

I tend to impute only minority class using SMOTE and then add that back to the original data. This helps to improve the prior probability of the minority class and during posterior estimation using any machine learning model helps to get a better decision boundary with confidence.

I call it "Targeted SMOTE" as I take only the minority class imputation from SMOTE and add it back to the original data.

Now you can divide this set for training and testing and CV as well , it would work just fine.

Try to check the performance of this model on a holdout sample data without SMOTE which you must keep aside before SMOTE to do an acid test. I am sure you will find fantastic results.

Miriam Seoane Santos

Article Cross-Validation for Imbalanced Datasets: Avoiding Overoptim...

Furthermore, the issues of overoptimism and overfitting are explained in detail and 15 well-known oversampling algorithms are compared across a benchmark of 86 publicly-available datasets.

Hopefully this helps future researchers in the field.

Best regards!

Informatics or Computer Science?

Could you please participate to a short survey for an experiment in XAI?

What do you mean by "interpretability" in models?

Which is the best alternative to Matlab?

How can I prepare virus for a TEM or SEM imaging?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

How can I apply boundary conditions in an orthotropic steel deck numerical model using ABAQUS software?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Can you suggest reliable sources defining "3D mesh" and "3D city models"?

RNA later for the preservation of RNA in fecal samples at room temperature for one day (37°C)?

Please explain how the plastic input value should be considered from the true stress-strain curve for the bilinear elastoplastic material model ?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?