There are several remedies for handling class imbalance in letterature. Here I will try to summarize some of them.
Model Tuning
The simplest approach to counteracting the negative effects of class imbalance is to tune the model to maximize the accuracy of the minority class(es).
Alternate Cutoffs
When there are two possible outcome categories, another method for increasing the prediction accuracy of the minority class samples is to determine alternative cutoffs for the predicted probabilities which effectively changing the definition of a predicted event. The most straightforward approach is to use the ROC curve since it calculates the sensitivity and specificity across a continuum of cutoffs. Using this curve, an appropriate balance between sensitivity and specificity can be determined.
Adjusting Prior Probabilities
Some models use prior probabilities, such as naive Bayes and discriminant analysis classifiers. Unless specified manually, these models typically derive the value of the priors from the training data. Some researchers suggest that priors that reflect the natural class imbalance will materially bias predictions to the majority class. Using more balanced priors or a balanced training set may help deal with a class imbalance.
Unequal Case Weights
Many of the predictive models for classification have the ability to use case weights where each individual data point can be given more emphasis in the model training phase. One approach to rebalancing the training set would be to increase the weights for the samples in the minority classes. For many models, this can be interpreted as having identical duplicate data points with the exact same predictor values. Logistic regression, for example, can utilize case weights in this way.
Sampling Methods
Basically, instead of having the model deal with the imbalance, we can attempt to balance the class frequencies. Taking this approach eliminates the fundamental imbalance issue that plagues model training. However, if the training set is sampled to be balanced, the test set should be sampled to be more consistent with the state of nature and should reflect the imbalance so that honest estimates of future performance can be computed. If an a priori sampling approach is not possible, then there are post hoc sampling approaches that can help attenuate the effects of the imbalance during model training. Two general post hoc approaches are down-sampling and up-sampling the data. Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes, while down-sampling refers to any technique that reduces the number of samples to improve the balance across classes.
Cost-Sensitive Training
Instead of optimizing the typical performance measure, such as accuracy or impurity, some models can alternatively optimize a cost or loss function that differentially weights specific types of errors. For example, it may be appropriate to believe that misclassifying true events (false negatives) is X times as costly as incorrectly predicting nonevents (false positives).
There are several remedies for handling class imbalance in letterature. Here I will try to summarize some of them.
Model Tuning
The simplest approach to counteracting the negative effects of class imbalance is to tune the model to maximize the accuracy of the minority class(es).
Alternate Cutoffs
When there are two possible outcome categories, another method for increasing the prediction accuracy of the minority class samples is to determine alternative cutoffs for the predicted probabilities which effectively changing the definition of a predicted event. The most straightforward approach is to use the ROC curve since it calculates the sensitivity and specificity across a continuum of cutoffs. Using this curve, an appropriate balance between sensitivity and specificity can be determined.
Adjusting Prior Probabilities
Some models use prior probabilities, such as naive Bayes and discriminant analysis classifiers. Unless specified manually, these models typically derive the value of the priors from the training data. Some researchers suggest that priors that reflect the natural class imbalance will materially bias predictions to the majority class. Using more balanced priors or a balanced training set may help deal with a class imbalance.
Unequal Case Weights
Many of the predictive models for classification have the ability to use case weights where each individual data point can be given more emphasis in the model training phase. One approach to rebalancing the training set would be to increase the weights for the samples in the minority classes. For many models, this can be interpreted as having identical duplicate data points with the exact same predictor values. Logistic regression, for example, can utilize case weights in this way.
Sampling Methods
Basically, instead of having the model deal with the imbalance, we can attempt to balance the class frequencies. Taking this approach eliminates the fundamental imbalance issue that plagues model training. However, if the training set is sampled to be balanced, the test set should be sampled to be more consistent with the state of nature and should reflect the imbalance so that honest estimates of future performance can be computed. If an a priori sampling approach is not possible, then there are post hoc sampling approaches that can help attenuate the effects of the imbalance during model training. Two general post hoc approaches are down-sampling and up-sampling the data. Up-sampling is any technique that simulates or imputes additional data points to improve balance across classes, while down-sampling refers to any technique that reduces the number of samples to improve the balance across classes.
Cost-Sensitive Training
Instead of optimizing the typical performance measure, such as accuracy or impurity, some models can alternatively optimize a cost or loss function that differentially weights specific types of errors. For example, it may be appropriate to believe that misclassifying true events (false negatives) is X times as costly as incorrectly predicting nonevents (false positives).
Both Andreas and Gino have offered you a good techniques to deal with imbalance class distribution, and I would like to add another common techniques, which maybe useful for you.
Generally, It's quite common to have imbalanced class distribution within your dataset, to deal with this problem, you have two common methods, which called (Oversampling and undersampling).
In oversampling method, you will duplicate the observations of the minority class to obtain a balanced dataset. However, with undersampling method, you will drop the observations of the majority class to have an equal class distribution. you can try many different techniques, then compare the results. Finally, you will have a general idea of what is going on with your dataset, and which method is better.
@Andreas Theissler Sir, Thanks for your reply. I will try considering your suggestions, one problem with my dataset is that it consists of some 23 classes and I want to consider them all.Thanks for the links you shared I will get through these papers.
@Gino Tesei Sir, Thanks for your reply, well the weighed approaches seems to be better, do you have any idea about which classifier takes into consideration the weights of classes.
@Ahmed Aljaaf Sir, Thanks for your reply. Can you suggest me some tools for oversampling or undersampling of data, or I need to do it manually??
I'm studying imbalanced classification at the moment. The answers you've gotten above are already very good. I'll add a few comments.
There are literally hundreds of papers written on this problem, because in addition to rebalancing example sets (a general technique) nearly every specific model type has one or more adaptations described in the research literature to let it deal with imbalanced classes.
Rebalancing classes is easiest and general. It is not necessarily optimal. You can do it manually but most packages (eg Python's scikit-learn) have class-balancing sampling code.
But since you have 23 classes you may have a different problem. In addition to -- or instead of -- having skewed example sets you have a large multi-class problem. You might be better off reading about multi-class classification solutions instead of imbalanced sets. Personally, I'd start with a decision tree because it can handle multi-classes naturally in a single model rather than you organizing a 1-vs-rest or 1-vs-1 approaches.