Best preprocessing methods for imbalanced data in classification algorithms?

More Niluka Piyasinghe's questions See All

How get changing data set of true positive rate and false positive rate to draw ROC curves?

I am hoping to evaluate performance of supervised classification methods by using ROC curves.so classifiers will be tested using common testing data set.How can i get range of true positive...

03 April 2016 6,744 6 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

Swimming/space travel depends on the proprioceptive muscle spindles?

When the entire neocortex is ablated in rodents, although they are still able to swim, all the limbs move continuously and asynchronously (Vanderwolf 2006; Vanderwolf et al. 1978). Normal animals...

03 August 2024 835 3 View

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

Machine learning (ML) has shown great potential in predicting the compressive strength of concrete, an important property for structural engineering. However, its practical application comes with...

03 August 2024 2,546 2 View

Some new emerging problems on application of RL for scheduling in IoT networks?

I have seen plenty of existing works on applied Reinforcement Learning (RL) policies for optimized scheduling in IoT networks including Q-learning, DQNs, and the newer ones including PPO for...

01 August 2024 8,754 2 View

How to Compress Information Neurally?

Samuel Morse, the inventor of the Morse Code, understood that certain letters in the English language occurred more frequently than others (Gallistel and King 2010). To deal with this, Morse used...

01 August 2024 4,456 2 View

Siti Mariyah Popular answer

Imbalanced data set is serious problem in classification. It is caused by skewed distribution of data between classes. Most of standard algorithms assume or expect balanced class distribution or equal misclassification cost. Therefore, when presented with large imblanced data sets, these algorithms fail to properly represent distributive characteristics of data. The best of my knowledge, there are two approaches can be done, first on data level and second on algorithm level. On data level, most of studies applied resampling techniques to get the balanced distribution. You can undersample the majority class, oversample the minority class or both of them coincidely. Synthetic minority over sampling technique (SMOTE) is a familiar technique you can use. On algorithm level, you can apply boosting algorithm or adjust misclassification cost. Boosting algorithm is you construct strong classifier from some weak classifiers (baseline classifiers). You can use SVM and KNN as baseline classifiers.

Sergey Kuzmin

Check out http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf

Matthias Kohl

There are many different approaches. An interesting option is SMOTE.

Best Matthias

Lov Kumar

Before applying any classification techniques, you should do outliers analysis.

Box-plot analysis is one best way to identify outliers and remove effective outliers.

Nicolás Vila Blanco

You can try to apply any of well known resampling techniques (ROS, RUS, SMOTE or ROSE). You can use also 'unbalanced' R library. It implements a lot of functions to deal with imbalanced data. For example, ubRacing method automatically selects the best technique to re-balance your specific data.

Evidently, general purpose preprocessing techiques (outlier analysis, noise detection, feature selection, etc) should also be applied.

Siti Mariyah

Wolfgang Konen

Random Forests (R package randomForest) are known to deal quite robustely with imbalanced data sets. Consider this as an alternative to SVN and KNN. You should also think about what is the right error measure for your task. Plain error rate is probably not the best choice. You might consider mean per-class error rate or another weighted error rate with specific weights for each element of the confusion matrix. randomForest offers the options classwt and cutoff, which reflect the importance given to each class during training and prediction. Both options need however careful tuning to get best results.

Hawraz A. Ahmad

SMOTE technique gives desired solution for imbalanced data. I have used in my M.Sc. thesis.

Hassiba Nemmour

I think that SVM is adequate for unbalanced data and it doesn't need preprocessing steps to achieve such classification

Cristian Popa

How much imbalanced? Considering how they work, both SVM and KNN should do well with imbalanced data, to some extent.

Better testing first? Keep in mind, however, not to use the accuracy as a metric for the performance, since the baseline model (predictions on the majority class) may give you very high percentages. F1 may be a better choice.

If some data preparation needed, you may consider over-sampling the minority clas(es) or under-sample the majority class, or, as suggested above, find some cost-sensitive classification algorithms (or implementations of those you want to use).

Regards,

Bhagyashree S R

Using SMOTE is a better option...

Oyebayo Olaniran

You can try my new R package "BayesRandomForest", it handles imbalance data via bootstrap prior technique.

Miriam Seoane Santos

Among preprocessing (data resampling) strategies, researchers often invest in oversampling methods since they are capable of balancing class distributions without ruling out potentially important examples. Over an extensive comparison of oversampling algorithms, the best seem to possess 3 key characteristics: cluster-based oversampling, adaptive weighting of minority examples and cleaning procedures (e.g. SMOTE, SMOTE-ENN, SMOTE-TL, MWMOTE). The following paper may be useful, as it performs a thorough empirical comparison

of well-established oversampling algorithms, focusing on their behaviour and intrinsic characteristics:

Article Cross-Validation for Imbalanced Datasets: Avoiding Overoptim...