Effect of imbalanced data on machine learning

Sobhan Sarkar Popular answer

Dear Rizwan Sir,

First of all, I would like to share my views that you have brought very interesting and challenging issues in ML perspectives. Imbalance dataset impacts on accuracy of your classifiers. Therefore, handling imbalance problem itself is a important aspect in ML. Thus, you could adopt the following methods to handle this issue. They are:

1. Collect more data that could balance your classes.

2. Change your performance metric. Use precision, recall, F1-score, Kappa, ROC curve or others.

3. Resampling your data set.

4. You can generate some synthetic samples like Synthetic Minority Over-sampling Technique (SMOTE)

5. You may also try for different algorithms

6. You may incorporate the penalized models like penalized-LDA, or penalized-SVM

Ahmed J. Aljaaf

Hi Rizwan,

It's quite common to have imbalanced class distribution within your dataset, to deal with this problem, you have two common methods, which called (Oversampling and undersampling).

What you did is oversampling, in which you have duplicated the observations of the minority class (expression of disgust) to obtain a balanced dataset. It seems that works for you and got a good accuracy. However, what I am suggesting is to keep this step and apply undersampling method, in which you should drop the observations of the majority class (expression of happiness) to have an equal class distribution, then apply libsvm for learning and compare the results. Finally, you will have a general idea of what is going on with your dataset, and which method is better.

All the best

Ahmed

Rizwan Ahmed Khan

Thanks a lot Ahmed. I will try it.

Aniello Raffaele Patrone

Duplicating the data does not help SVM because you are not really producing useful data. Usually it is better to have balanced classes. If you have two classes I would also try one-class SVMs. In your paper you should simply explain the procedure used.

Rizwan Ahmed Khan

Thanks a lot Aniello and Andreas for guidance and opinion. As suggested by Ahmed, I did oversampling of minority class and recorded result in the range of 80%. Now I am running experiment with under-sampled data. With under sampling of majority class result is around 55%. Is this normal behavior?

Yubing Tong

It is a good idea to try one-class SVMs. The difference between over-sampling and under sampling might be because under-sampling possibly leads to information lost at some degree?

Rizwan Ahmed Khan

I am not sure how to apply one class SVM. For the problem in hand, I have six classes i.e. six expressions. Should I train separate classifier for each class?

Jakob Nikolas Kather

How do you assess accuracy (by cross-validation, holdout, etc.)? How many items are in your training dataset and how many do you use for testing? You might get a serious problem if the duplicates from the training set are introduced into the test set.

Anastasia Pampouchidou

If you are using MATLAB the following is useful for downsampling

http://www.mathworks.com/help/stats/cvpartition.html

Rizwan Ahmed Khan

Thank you Jakob and Anastasia. I am using K-fold cross validation method. Results that I have quoted above are obtained using 10-fold cross validation technique.

Jakob Nikolas Kather

Dear Rizwan,

thank you for the additional information! As I see it, this way of balancing the data is not acceptable. Here's why: To balance classes, you duplicate items in your dataset. Then, you perform 10-fold cross validation, i.e. you perform 10 rounds of training and testing. It is almost certain that some of the duplicate items will end up in the testing set. This means that the accuracy is assessed on the same items the classifier was trained on, in other words: the testing set is not "unknown" to the classifier. Consequently, the accuracy is biased. In your case, this explains why the accuracy rises from 45% to 80%.

I suggest three alternatives:

a) To balance the classes, you acquire more raw data (This may be difficult, but is the most straightforward solution).

b) To balance the classes, you discard items from the larger classes (This is painful, but is a good alternative).

c) You accept class imbalances and try to use a different classification approach. For example, you can use RUSboost, which has been shown to be insensitive to class imbalances [1]. Plus, this method is available in Matlab [2]

Best regards,

Jakob

[1] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and a. Napolitano, “RUSBoost: Improving classification performance when training data is skewed,” 2008 19th Int. Conf. Pattern Recognit., pp. 8–11, 2008.

[2] http://de.mathworks.com/help/stats/ensemble-methods.html

Csaba Kertész

Yes, duplicating data in the data set is not acceptable, but I experienced similar phenomena with SVM when I "balanced" a dataset in a similar way years ago. In some framework, you can give a prior importance weight to the classes before training to give more emphasize to labels with less data. If you have such possibility, try it.

On the other hand, if you use some machine learning framework and you can easily try out other classifiers, I would suggest to give a try with decision trees or random forest. They don't need standardized (preprocessed) input and well-balanced data. If your features have good prediction power, these classifier will do the job magnitudes better with this kinda "problematic dataset" than SVM in my experience. (The small footprint of SVMs is very good, but they are slower than trees, you need a pile of well-balanced data and harder hand-tuning of hyperparameters.)

Edit: If you have small dataset (e.g

Ramon López de Mántaras

Dis you try "under-sampling"? it consists in deleting instances of the over represented classes. "Over-sampling" (adding copies of instances of the under-represented classes is also done but then instead of computing the accuracy it is better to use other performance measures such as confusion matrix or ROC curves or Kappa measure (accuracy normalized by the imbalance of the classes in the data). You can also generate synthetic samples. One algorithm to do so is the SMOTE or the Synthetic Minority Over-sampling Technique. Another possibility is to use "penalization" approaches which impose an additional cost for making classification mistakes on the minority class during training. These penalties bias the model to pay more attention to the minority class. There are penalized versions of algorithms such as penalized-SVM and penalized-LDA.

Rizwan Ahmed Khan

Thank you every one. I have tried SMOTE algorithm. Now the results are in the range of 65%. So in conclusion, following are the results:

1. Over sampling, result = 80 % (over fitting)

2. Under sampling, result = 55%

3. After applying SMOTE algo, result = 65%

Sobhan Sarkar

Dear Rizwan Sir,

1. Collect more data that could balance your classes.

2. Change your performance metric. Use precision, recall, F1-score, Kappa, ROC curve or others.

3. Resampling your data set.

4. You can generate some synthetic samples like Synthetic Minority Over-sampling Technique (SMOTE)

5. You may also try for different algorithms

6. You may incorporate the penalized models like penalized-LDA, or penalized-SVM

Ignacio Arroyo-Fernández

Depending of the implementation you are using, some implementations are endowed with imbalance compensation of the regularization parameter, e.g. Shogun machine learning toolbox has compensation for binary problems. You can also try one-class classification. Since the above compensation is directly related to the capacity of the machine I think if compensantions (or one-class classification) in the regularization parameter do not work, in the sense of statistical learning theory, your data is completely subsampled, i.e. the amount of samples for "disgust" expression is definitively not sufficient [see link]. In that case it is needed to try alternatives others have suggested in order to compensate subsampling in reliable ways.

http://www.csc.kth.se/~omida/BMVC_2013_oa.pdf

Can anyone help with Random Decision Forest implementation in C++?

Can any one point me to a Random Forest code (c++) such that the extracted node test criteria and features can be edited?

Feedback defines the constitution of an organism?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

What are examples of AI for good projects a teacher can assign to students?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

How to design human-centered classroom in the age of A.I.?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

Measuring the Intelligence of a Species?

What's the role of IT & AI in Telecommunication Industry?

Can usage of AI tools like chat GPT in research work is recommendable ?