What is a possible solution for cross validation of an imbalanced data set problem?

Dear Shaukat Shahee,

Indeed, in many situations honest testing of model's performance requires balanced data representation. Apparently, this aspect is crucial for discriminant techniques that use distributions of groups to create a discriminant border (e.g. PLS-DA).

Some of model validation aspects we have discussed in our paper - B. Krakowska et al. entitled "The Monte Carlo validation framework for the discriminant partial least squares model extended with variable selection methods applied to authenticity studies of Viagra® based on chromatographic impurity profiles"

Kind regards,

Michal

Article The Monte Carlo validation framework for the discriminant pa...

Shaukat Shahee

Thanks Mr. Michal Daszykowski

David E Drehmer

Mr.Shahee, the first thing I would probably do, if I understand your issue, is to determine whether the minority is sampled from the same distribution as the majority or at least behaves the same. Are the locations, variability, shape, skewness, relationship to other variables of interest the same or different. If you can reasonably assume that they are the same, then this becomes relatively simple issue, just ignore the differences. Oversampling one group relative to the other group, assuming two groups, just does not matter, they can just be pooled together.

If the groups are different, then I presume you would want separate estimates for each group and separate estimates of validity. If the groups are large enough, just use whatever the standard of practice is in your field. If any group is too small, then you might need to find a different estimation method. Generally, I like to use a jackknife or modified jackknife to get the best estimates of validity than I can. You might find the following articles helpful on computation, theory and applications..

Drehmer, D. E. and Morris, G.W. (1981) Cross-Validation with small samples: An algorithm for computing Gollob's estimator. Educational and Psychological Measurement, 41, 195-200.

Drehmer, David Edward. (1978) Bias and efficiency of population cross-validity estimators in multiple regression. Doctoral Dissertation, Illinois Institute of Technology, December 1978.

Krus, D. J., & Fuller, E. A. (1982) Computer-assisted multicrossvalidation in regression analysis. Educational and Psychological Measurement, 42, 187-193.

Yang, Ann Decision Making for Individual Investors: A Measurement of Latent Difficulties. (2013) Journal of Financial Services Research, Vol. 44(3), 303-329.

There is a slight bias when using a jackknife, but in practice, it is usually not meaningful or of great concern..

The oversampling problem is a different issue. It becomes an issue when you pretend that the populations from which the samples are drawn are the same, when, in fact, they are different. It is really trying to pool everything into one equation when all that will do is reduce complexity for some user in the future at the cost of adding variability to the system. I think that your accuracy will be increased (variability decreased) when you use separate prediction equations for different populations.

There are only n ways to split a group into n-1 and 1 samples. We might refer tto the n-1 group as being the weight estimation sample and the 1 left out as the weight determination sample. One gets only a slight bias when calculating correlations between the set of predicted values and the original observed values when the weights from the weight determination sample are applied to the weight validation sample. Sometimes this is referred to as a "leave one out" scheme.

The others you mentioned are not the only possibilities that exist. You might be happier to split leaving K observations out so that you have an n-k and k sample sizes. Doing this with replacement in my experience does not have any advantages other than a less discrete sampling distribution approximation. It also is not the closest you can come to using all of the data that you can use to get close to what would happen using all n observations.

In summary, first test to determine whether you can assume the groups are from the same population. If yes, then pool everything. If the groups are different, then calculate separate estimates for each group using the best methods you have available and determine how well those methods work for each group.

Best wishes,

David Drehmer

Alex Thomo

Maybe you can take a look at:

https://www.researchgate.net/publication/220519969_Mining_with_rarity_A_unifying_framework

Article Mining with rarity: A unifying framework

Shaukat Shahee

Thanks Alex

L.Y. Pan

Mr. Shahee, I'm interested in the section 2 and section 3 in your question.

I met the same troubles when I handled small unbalanced data set with cross validation.

If you can see my reply, I'll be very glad to get your explaination about these two parts. Thanks for your question.

Burhan Rashid Hussein

Hi you may also take a look of this paper maybe it can help.

Conference Paper Empirical Study of Sampling Methods for Classification in Im...

Its discusses different methods for tacking the problem of imbalanced data set.

Miriam Seoane Santos

Option 1) is a common misconception for researchers farther from the imbalanced data topic. If you oversample the entire data and perform the cross-validation procedure afterwards, similar examples (or exact replicas, depending on the oversampling algorithm used) will appear in both the training and test partitions. Here, the most problematic issue will be the overoptimism introduced in your design (rather than the overfitting).

Option 2) is the correct way to handle imbalanced data, although proper care must be taken when choosing an oversampling algorithm. An inappropriate choice (e.g. choosing algorithms that create exact replicas of original examples) may lead to overfitting.

The following paper may be very useful, as it explains these two issues (overoptimism and overfitting) in detail, and compares several oversampling algorithms over a benchmark of 86 publicly-available datasets:

Article Cross-Validation for Imbalanced Datasets: Avoiding Overoptim...

Your concern regarding the train-test distribution is valid, although I would argue that it is not that significant, providing that your cross-validation is stratified: each fold should contain the same number of minority/majority examples (for a binary-classification problem). Indeed, you will be oversampling the training set and testing on the original (imbalanced) test set, but your model, improved by an appropriate oversampling method, should generalise well for the test set. Note that improved CV procedures exist that handle partition-induced dataset shift (either prior probability or covariate shift), and perhaps are worthy of consideration:

Article Study on the Impact of Partition-Induced Dataset Shift on -F...

Regarding Option 3), I believe you are still considering Option 1) which should not be performed, for the reasons explained above. Therefore, you should use appropriate performance measures, more "robust" to imbalanced data than Accuracy (that is clearly biased towards the most represented class).

I understand perhaps this comes a little too late to be helpful in your case, but may help future researchers in the field.

Best regards!

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

What is the difference between mathematical R^4 space and physical 4D unit space?

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

Controlling for pupil light reflex when analyzing pupil size time course?

What are a “Farmers Producer Organization” (FPO) and its essential features?

Strugglling with m6A dot blot any suugesstion ?

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How to get moment output in Abaqus Standart?

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?