Improving unbalanced classification?

Take the example of predicting a disease. Let’s say that only only 10% of the instances in your dataset have the actual disease. This means that you could get 90% accuracy by simply predicting the negative class all the time. But, how useful is this? Not very useful, as you wouldn’t have predicted a single instance of the actual disease! This is where the F1-score can be very helpful. In this example, the recall for the positive class would be 0, and hence the F1-score would also be 0.

◼️Reference:

https://thedatascientist.com/f-1-measure-useful-imbalanced-class-problems/

◾further reading

https://www.analyticsvidhya.com/blog/2020/10/improve-class-imbalance-class-weights/

https://sebastianraschka.com/faq/docs/computing-the-f1-score.html

Mohamed Elhadad

Dear Shima Shafiee ,

When building and optimizing any supervised learning model, measuring how accurately it can classify data is crucial, especially when the developer must choose between two or more algorithms. It is an easy question to ask but rather a problematic dilemma to answer what algorithm should be chosen if one of the used algorithms performs better on one class and the other on the other class. In most cases, obtaining high classification accuracy results gives a misleading indication of the model's classification ability, especially when dealing with imbalanced datasets available in real life. Overcoming this problem is particularly important in applications where misclassifying instances from the minority class is more costly. Hence, the ability to evaluate classification models independently of the size of datasets and the distribution of data on their classes is pivotal to selecting the most appropriate model to employ. It should be remarked that the selection of the most appropriate evaluation metric varies and is based on many factors, such as, the size of the dataset, the distribution of data on classes, and which class is more important to the end-user and thus to the developer. The Matthews Correlation Coefficient(MCC) is regarded as being the most informative single score to determine the quality of a binary classifier prediction in a confusion matrix context. The MCC is a correlation coefficient between the observed and predicted binary classifications. It returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 represents a random prediction and −1 indicates total disagreement between prediction and observation. Despite the superiority of MCC compared with other metrics, it has a limitation of making arbitrary assumptions to overcome the divide-by-zero problem. Moreover, it is unable to assess the class-based performance of classification algorithms. We introduced a performance evaluation framework based on a new evaluation metric we name "Multidimensional Classification Assessment Score (MCAS)". MCAS is used to evaluate the performance of learning algorithms by measuring how good is the classification algorithm in the presence of errors. This evaluation metric overcomes the limitations of the existing ones as it works independently regardless of the size of the datasets and the distribution of samples in its classes. The MCAS is a score that measures how efficient is the classification algorithm for dealing with binary classification problems in the presence of errors, as introduced in the following subsections. for further details, take a look at https://dspace.library.uvic.ca/bitstream/handle/1828/13219/Elhaddad_Mohamed_PhD_2021.pdf?sequence=3&isAllowed=y

Regards,

Mohamed Elhaddad, PhD

Shima Shafiee

Dear Milad Vazan

Thanks

It was very clear and complete and thanks for introducing useful resources

Shima Shafiee

DEAR Luca Parisi

Thanks for the efficient suggestions that are very appropriate

Good luck

Shima Shafiee

DEAR Mohamed Elhadad

Your explanation is very complete

Thanks for the efficient suggestions and resources

Good luck

Cecilia-Irene Loeza-Mejía

Dear Shima Shafiee

Please take a look at the following link, might be helpful:

https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

Kind regards,

Cecilia-Irene Loeza-Mejía

Shima Shafiee

DEAR Cecilia-Irene Loeza-Mejía

Thanks for the efficient suggestions and sharing link

Best wishes

Cecilia-Irene Loeza-Mejía

This paper could also be useful

https://link.springer.com/chapter/10.1007%2F978-1-4614-6849-3_16

Shima Shafiee

DEAR Cecilia-Irene Loeza-Mejía

Thanks for sharing the link

Best wishes

Rishav Pramanik

https://ieeexplore.ieee.org/abstract/document/5978225

https://www.sciencedirect.com/science/article/abs/pii/S0925231215001411

https://www.sciencedirect.com/science/article/abs/pii/S0031320312001471

Article One-class ensemble classifier for data imbalance problems

Here are some articles, you can have a good read about the performance metrics. It is often seen that relying on any one metric is not a healthy practice while dealing with imbalanced datasets. You should also keep in mind the degree of imbalance that your dataset has i.e the imbalance ratio.

How can I train the proposed model using a high-dimensional tensor?

How do I design MTT for enaluation of growth factor effcts for fast growing cells?

What is the preferred medium for expansion of Mesenchymal Stem Cells?

Is it a bioinformatics tool for extracting 3D coordinates?

Is GAN Network as sample generator?

Precision , reconition rate, and accuracy ?

Pixel coordinates using distance coordinates in MATLAB or python?

Which technique is suitable for imbalanced dataset?

Hello, I am a 5th-year medical student. I want to initiate research in the field of neuroregeneration after injury ?

Is it possible to use the following formula for CoFe2O4 to calculate anisotropy constant?

How combine yolo with Faster R-CNN?

Is a reliability test necessary in my survey on translations?

How do we pick data for determination of Validation Acceptance Criteria?

What is trustworthiness in qualitative research and how can you improve reliability accuracy and validity?

How is artificial intelligence being utilized to enhance the diagnosis and treatment of sleep apnea?

List of journals impact factors?

Experts on orchid viruses for an article for home gardeners?

How can the integration of GIS spatial analysis techniques with economic modeling improve the accuracy and effectiveness of environmental policy ?

What can be studied next in the field of 6G mobile communication network positioning?

How does the application of (GANs) for data augmentation impact the robustness and accuracy of image classification models?