Why oversampling inside and outside LOOCV loop gives very different results?

13 June 2019 4 9K Report

I am working with an extremely unbalanced dataset with a total of 44 samples for my research project. It is a binary classification problem with 3/44 samples of the minority class for which I am using Leave One Out Cross Validation. If I perform SMOTE oversampling of the entire dataset prior to LOOCV loop, both prediction accuracy and AUC for ROC curves are close to 90% and 0.9 respectively. However, if I oversample only the training set inside the LOOCV loop, which happens to be a more logical approach, AUC for ROC curves falls as low as 0.3

I also tried precision-recall curves and stratified k-fold cross validation but faced a similar distinction in results from oversampling outside and inside the loop. Please suggest me what is the right place to oversample and also explain the distinction if possible.

György Kovács

Hi Varshika!

You are right, oversampling inside the loop is the correct approach revealing the true performance of the system. When you are doing oversampling outside the loop it definitely leads to data leakage: there will be some samples generated all over the dataset, and when you exclude one by LOOCV, some generated samples will still be around aiding the correct classification. However, if you really doesn't know the point which is excluded (this is the assumption of the LOOCV loop), then the generated points nearby wouldn't be there, the reason why they are there is that the knowledge about the correct label of the excluded point was used to generate them. This is why it is a data leakage: even though you exclude one sample by an LOOCV iteration, some information in the form of points generated nearby remains in the training set and improves the classification performance.

If you are interested in oversampling techniques, let me recommend my recent research comparing 85 oversampling techniques and the Python implementation on GitHub:

Article An empirical comparison and evaluation of minority oversampl...

https://github.com/gykovacs/smote_variants

Miriam Seoane Santos

Dear Varshika,

You are facing an overoptimistic phenomenon, due to a biased application of LOO and SMOTE.

If you oversample the entire data and perform LOO afterwards, similar examples will appear in both the training and test partitions, which causes your performance to be high, though not realistic. It happens not because your model is able to generalize correctly, but rather because you are training and testing with very similar examples. This seems more “efficient” because you are over-estimating the classification performance. This is known as the issue of “overoptimism”, introduced by a faulty cross-validation design.

You mention that oversampling only the training set seems a more logical approach and in fact, it is. This is the correct approach: first, the LOO division is produced, then, only the training set is oversampled, the model is built and the remaining data point is used for testing (this happens iteratively).

The following paper may be useful, as it explains these two approaches and possible issues (overoptimism and overfitting) in detail, and compares several oversampling algorithms over a benchmark of 86 publicly-available datasets:

Article Cross-Validation for Imbalanced Datasets: Avoiding Overoptim...

For the specific case of LOO, I have produced the schema attached bellow. In the setup on the left, oversampling is performed first (incorrect approach). In this case, similar replicas may appear in both training and test partitions (A and a*, in Fold K = 1): A, although appearing on the test set, was previously used when performing oversampling, which is incorrect, since the model should not have any knowledge on the test samples. In such a way, it has implicit knowledge due to the oversampling of A through SMOTE, producing a*. Also, when performing LOO after oversampling, one may also think that the synthetic points are also to be considered for testing (in Fold K = 2), which should not happen since they were created to improve the training stage of the model, they are not original points! The setup on the right is correct: the oversampling procedure is performed iteratively on each training set, allowing a proper estimation of the true performance of our classifier.

Also, the choice of proper classification metrics is an important topic. Accuracy is clearly biased to the majority class: for instance, if your classifier considered all points as belonging to the majority class (bad behaviour), your accuracy would still be of 44/47 = 93.6%!

Hope this helps!

Best wishes,

Gilles Vandewiele

I think the attached figure explains pretty well why over-sampling before partitioning your data is just plain wrong.

You can find some code examples at https://github.com/IBCNServices/TPEHGDB-Experiments

and a article about this issue at Chapter A Critical Look at Studies Applying Over-Sampling on the TPE...

Burhan Rashid Hussein

Dear Varshika,

For the case of appropriate evaluation metric for an imbalanced dataset, you may be interested in looking at this workConference Paper Empirical Comparison of Area under ROC curve (AUC) and Mathe...

- And as mentioned, oversampling should be done in training set only and not the entire dataset before partitioning

- @ Chongomweru Haleem this may be a good question to follow

RStudio Bioconductor error for GEO dataset analysis?

What is the mesh grid density requirements for LES WALE simulation in ANSYS CFX?

If a flurophore is integrated with plasmons can fret and pret simultaneously take place?

How to distinguish between TPA and higher order effects present beyond cubic Nonlinear term?

Can someone suggest how do i convert my microarray id's of magnaporthe oryzae like AMG04346 to broad MGG_ id's.?

How to solve the problem of unconscious teeth grinding during sleep at night?

Does the delta function actually make sense to use on point particles?

Standard curve of H2O2?

How do we pick data for determination of Validation Acceptance Criteria?

Can a shoot-through event of a tri-state digital buffer cause momentary Hi-Z state?

Using a large amount of data, such as 30 years of daily data, in a machine learning model like LSTM can reduce the model predictivity?

Which research tool for expert validation for our study?

How does an AI detection tool work?

Seeking Clarification on the Fuzzy Analytic Hierarchy Process (FAHP) Methodology?

Can I use soybean, sunflower, or olive oil to prepare a standard curve for lipid estimation in plants or pollen?