Theoretically, can unbalance data distribution affect the performance of Decision Trees and Random Forest?

02 February 2019 2 7K Report

Generally, in a classification task, when the class distribution is unbalance, most classifiers are bias towards the majority class.

Similarly, for Regression, regressors (especially linear regression) usually work better when the target variable is close to a normal distribution (i.e. a bell curve).

My question is: Do the above two general statements stand when a decision tree or random forest based classifier/regressor is used? Is the distribution able to affect/bias the calculation of information gain?

Syed Furqan Qadri

This dilemma of unblanced class can be deal to delete data from background class or foreground class or use SMOTE algorithm.

Paul Yarnold

Hi, Shin,

Not all "decision tree" (DT) methods are the same.

The maximum-accuracy machine-learning DT algorithm employs operations-research (mathematical programming) methods and requires NO distributional assumptions--for raw data or for model residuals. Here are four introductory articles:

Linden A, Yarnold PR (2016). Using data mining techniques to characterize participation in observational studies. Journal of Evaluation in Clinical Practice, 22, 839-847.

https://odajournal.com/2016/09/19/novometric-analysis-with-ordered-class-variables-the-optimal-alternative-to-linear-regression-analysis

https://odajournal.com/2017/04/18/what-is-optimal-data-analysis

https://odajournal.com/2018/11/26/visualizing-application-and-summarizing-accuracy-of-oda-models

ALSO, maximum-accuracy methods may be used to explicitly maximize the accuracy of legacy statistical methods:

https://odajournal.com/2018/10/01/comparative-accuracy-of-a-diagnostic-index-modeled-using-optimized-regression-vs-novometrics

https://odajournal.com/2013/09/20/maximizing-the-accuracy-of-multiple-regression-models-using-unioda-regression-away-from-the-mean

Feedback defines the constitution of an organism?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Hello researchers Is this a random laser or just fluorescence?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Hello Everyone ! I'm looking for a good journal to publish my manuscript with low publication cost?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

Is Galaxy.org good to use for research for analyzing data and for publication?