How can we avoid over-fitting in a Stochastic Gradient Boosting (SGB) based CART prediction model?

08 August 2014 1 4K Report

Hi there!

I have a sample of 2500 data and each having 9 attributes. I divided the set into 75% + 25% for training-testing purpose ( random selection for testing ). In SGB model I have taken 0.05 as learning rate (Shrinkage factor) and 0.5 as the sub-sample fraction for bagging. Each tree has 15 numbers of terminal nodes and all the features (attributes) interactions are allowed. By growing 20,000 trees (i.e. iterating 20,000 times) sequentially I am getting an R2 (R-square) of 99.7 in training data and 98.8 on testing data. In 10 fold cross validation I am getting an R2 of 97.6.

As I have used very low learning rate and bagging concept I assume that the accuracy I am getting is not due to the Over-fitting, and the MSE Vs No. of Tree graph is gradually decreasing without any spikes.

But, as I am iterating it 20,000 times, and getting this much of accuracy I am little bit confused regarding the Over-fitting concept. Please suggest me whether my approach and understandings are correct or not.

Thank you.

Zsolt Török

trat's a tricky (if not even cunning) detail

what kind of data are You testing?

Badges
Science topic

Similar topics
Computer Science
Data Mining

More Ashutosh Patri's questions See All

How can we determine the number of leaf nodes in Stochastic Gradient Boosting based Regression Trees?

According to the Friedman's Stochastic Gradient Boosting algorithm, the variable space is divided into J regions if the regression tree has J leaf nodes. But Hastie et al. suggested that J > 10...

07 August 2014 6,122 6 View

Can anyone use the data set provided in a published journal article for his own research work ?

Basically I want to know whether I could use their (published by another author) dataset for my personal analysis and can I publish my work with their dataset by giving them citations.

06 July 2014 979 1 View

How can I prepare virus for a TEM or SEM imaging?

I have virus (viral hemorrhagic septicemia virus) in suspension and the experiment will not involve cells. What level of TCID50 is preferred?

11 August 2024 3,115 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

Usually, additive manufacturing techniques like SEBM, SLS, and SLM are used for interconnected porous lattice structure generation with sizes of >100–200 micrometers. Can the Fused Deposition...

09 August 2024 7,892 0 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

I need to model an anisotropic material in which the Poisson's ratio ν_12 ≠ ν_21 and so on. Therefore, the elastic compliance matrix wouldn't be a symmetric one. In ANSYS APDL, for TB,ANEL...

09 August 2024 5,048 2 View