What are the possible approaches for solving imbalanced class problems?

Gino Tesei Popular answer

Hi Yasir,

There are several remedies for handling class imbalance in letterature. Here I will try to summarize some of them.

Model Tuning

The simplest approach to counteracting the negative effects of class imbalance is to tune the model to maximize the accuracy of the minority class(es).

Alternate Cutoffs

When there are two possible outcome categories, another method for increasing the prediction accuracy of the minority class samples is to determine alternative cutoffs for the predicted probabilities which effectively changing the definition of a predicted event. The most straightforward approach is to use the ROC curve since it calculates the sensitivity and specificity across a continuum of cutoffs. Using this curve, an appropriate balance between sensitivity and specificity can be determined.

Adjusting Prior Probabilities

Some models use prior probabilities, such as naive Bayes and discriminant analysis classifiers. Unless specified manually, these models typically derive the value of the priors from the training data. Some researchers suggest that priors that reflect the natural class imbalance will materially bias predictions to the majority class. Using more balanced priors or a balanced training set may help deal with a class imbalance.

Unequal Case Weights

Many of the predictive models for classification have the ability to use case weights where each individual data point can be given more emphasis in the model training phase. One approach to rebalancing the training set would be to increase the weights for the samples in the minority classes. For many models, this can be interpreted as having identical duplicate data points with the exact same predictor values. Logistic regression, for example, can utilize case weights in this way.

Sampling Methods

Basically, instead of having the model deal with the imbalance, we can attempt to balance the class frequencies. Taking this approach eliminates the fundamental imbalance issue that plagues model training. However, if the training set is sampled to be balanced, the test set should be sampled to be more consistent with the state of nature and should reflect the imbalance so that honest estimates of future performance can be computed. If an a priori sampling approach is not possible, then there are post hoc sampling approaches that can help attenuate the effects of the imbalance during model training. Two general post hoc approaches are down-sampling and up-sampling the data. Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes, while down-sampling refers to any technique that reduces the number of samples to improve the balance across classes.

Cost-Sensitive Training

Instead of optimizing the typical performance measure, such as accuracy or impurity, some models can alternatively optimize a cost or loss function that differentially weights specific types of errors. For example, it may be appropriate to believe that misclassifying true events (false negatives) is X times as costly as incorrectly predicting nonevents (false positives).

Gino Tesei

Hi Yasir,

There are several remedies for handling class imbalance in letterature. Here I will try to summarize some of them.

Model Tuning

The simplest approach to counteracting the negative effects of class imbalance is to tune the model to maximize the accuracy of the minority class(es).

Alternate Cutoffs

Adjusting Prior Probabilities

Unequal Case Weights

Sampling Methods

Cost-Sensitive Training

Ahmed J. Aljaaf

Hi Yasir,

Both Andreas and Gino have offered you a good techniques to deal with imbalance class distribution, and I would like to add another common techniques, which maybe useful for you.

Generally, It's quite common to have imbalanced class distribution within your dataset, to deal with this problem, you have two common methods, which called (Oversampling and undersampling).

In oversampling method, you will duplicate the observations of the minority class to obtain a balanced dataset. However, with undersampling method, you will drop the observations of the majority class to have an equal class distribution. you can try many different techniques, then compare the results. Finally, you will have a general idea of what is going on with your dataset, and which method is better.

All the best

Ahmed

Yasir Hamid

@Andreas Theissler Sir, Thanks for your reply. I will try considering your suggestions, one problem with my dataset is that it consists of some 23 classes and I want to consider them all.Thanks for the links you shared I will get through these papers.

@Gino Tesei Sir, Thanks for your reply, well the weighed approaches seems to be better, do you have any idea about which classifier takes into consideration the weights of classes.

@Ahmed Aljaaf Sir, Thanks for your reply. Can you suggest me some tools for oversampling or undersampling of data, or I need to do it manually??

Ahmed J. Aljaaf

Hi Yasir,

You can do it manually.

Tom Fawcett

Yasir,

I'm studying imbalanced classification at the moment. The answers you've gotten above are already very good. I'll add a few comments.

There are literally hundreds of papers written on this problem, because in addition to rebalancing example sets (a general technique) nearly every specific model type has one or more adaptations described in the research literature to let it deal with imbalanced classes.

Rebalancing classes is easiest and general. It is not necessarily optimal. You can do it manually but most packages (eg Python's scikit-learn) have class-balancing sampling code.

But since you have 23 classes you may have a different problem. In addition to -- or instead of -- having skewed example sets you have a large multi-class problem. You might be better off reading about multi-class classification solutions instead of imbalanced sets. Personally, I'd start with a decision tree because it can handle multi-classes naturally in a single model rather than you organizing a 1-vs-rest or 1-vs-1 approaches.

Best, -Tom

Anybody well familiar with Map Reduce Programming?

Is it fine to publish a paper in International Journal of Network Security http://ijns.jalaxy.com.tw/ ?

Anybody working on Internet traffic classification ?

Why 10 fold cross validation doesn't go well with lazy classifiers?

How can I convert nominal data to numeric data before feeding it to some classifier?

Trying to train Neural Network on a mighty dataset, already taken 48 hours but yet to settle down...any idea about what could be the possible reason?

Feedback defines the constitution of an organism?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Do you know best mines of western part of Afghanistan?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?