Overfitting in Decission Tree: What is the best practice to check the overfit indication?

Hello Yosafat,

as recommended by Mohammed Deeb you should split your data into a training set and a test set. Train your decision tree on the training data and compute the accuracy of your model for the training data (training accuracy). Next compute the accuracy of your model for the test data (test accuracy). If your model is subject to overfitting, the training accuracy will be larger than the test accuracy.

Best,

Michael

Utkarsh Singh

Decision trees are prone to overfitting. You already have a lot of samples and considerable amount of features.

1) Divide your dataset into train and test sets (80:20). Use 20% of your training data for validation (holdout method). You could also go for k-fold cross validation.

2) Train your DT by varying model parameters (especially the number of splits, high number of splits or the depth could be causing the overfit in your case). For each model plot training vs. validation error. The best fit is the case where validation error is closest to train error. Example: if on increasing the number of splits, validation error increases significantly as compared to training error, then it is an overfit case.

3) Although validation set should be used to tune parameters of algorithm. However, if you just use train and test sets, overfitting is said to happen when test error is significantly low compared to training error. Conversely, overfitting is evident if test accuracy is significantly high compared to training accuracy.

4) To be completely sure of overfitting, check the variance. Overfitting model has high variance.

5) If despite everything, your model still overfits, following are your options: (a) Reduce the number of input features (Feature selection), (b) Try other classifiers like Random Forest or SVM.

Useful references for you:

https://datascience.stackexchange.com/questions/48166/why-can-decision-trees-have-a-high-amount-of-variance
https://www.kaggle.com/getting-started/183376
https://machinelearningmastery.com/how-to-reduce-model-variance/
https://datascience.stackexchange.com/questions/60423/validation-curve-interpretations-for-decision-tree
https://scikit-learn.org/stable/modules/learning_curve.html

Ritika Lohiya

Pruning a decision tree might help in handling over-fitting. Over-fitting in decision tree is mainly due to the lack of representative instances.

For applying pruning you can fine tune the following parameters and re-observe the performance of the classifier: (1) max_depth: represents depth of the tree. More you increase the number, more will be the number of splits and the possibility of overfitting. (2) min_samples_split: represents minimum number of samples required to split an internal node. If you use 100% of the samples at each node, it will show underfitting, if you use 10%, it might show overfitting. Tune it accordingly. (3) min_samples_leaf: represents min. no. of samples required to be in the leaf node. The more you increase the number, more is the possibility of overfitting.

Apart from this you can also apply cross validation techniques such as stratified k-fold cross validation for re-sampling the dataset in training, test, and validation sets to avoid overfitting.

Moreover, decision trees are prone to overfit the data. Therefore, ensemble techniques such as random forest can be used that have the capability to reduce variance without increasing the bias.

Sumaia M. Al-Ghuribi

In Addition to the previous useful answers, I suggest to use k cross validation, it will give accurate results than dividing the dataset into only one part for training and testing dataset.

Muhammad Ali

Have a look at Article An Overview of Overfitting and its Solutions

https://statisticsbyjim.com/regression/overfitting-regression-models/

https://towardsdatascience.com/dont-overfit-how-to-prevent-overfitting-in-your-deep-learning-models-63274e552323

Dasaradharami Reddy Kandati

Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex.

Overfit regression models have too many terms for the number of observations.

Is there any Daubechies 44 Wavelet Packet Transform Implementation for Python?

Adaptation of oxidative biomarkers due to physical activity or exercise

Method to remove Baseline Wander on Sudden Death Holter Database (SDDB) ECG record

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Request Python code?

Why does everyone use vs code?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

Need help with my research project on open source SIEM and machine learning?

A Question about Phd thesis?

Where to find a gene list for CRISPRa/i library screening of regulatory factors that affect pathogenic Th17 differentiation in PBMC?

How to add Cr parameter in autodock?

Best Practices for Constructing and Maintaining Gravel Roads in Rural Areas?

Are these cassettes suitable for expressing PETase mutant in E. coli?