Can anyone help me better understand random forests?

Andrew J. Fairbairn

I’m going to make this a detailed response to make it self-sufficient, please be patient and read through the entire thing.

a) First of all, let’s consider what the Out Of Bag metrics are. When training a Random Forest, we create bootstrapped datasets to train each Classification And Regression Tree separately, so as to make them as decorrelated as possible. Each bootstrapped dataset is as “big” as the original, but was created by sampling with replacement from the original dataset. Thus, there will be samples that are repeated (or sampled multiple times) in every Bootstrapped Dataset. Conversely, there are samples in the original dataset that are never sampled in a particular bootstrapped dataset. What you'll find is when you do this Bootstrapping on an average about 30-40% of the samples in your original data set will not end up in a specific bootstrap dataset. These samples that don't end up in the bootstrap dataset or don't end up in the bag are referred to as out of bag samples.

As these Out Of Bag samples weren't used to train that specific decision tree, we can use them as a surrogate validation set. So, we select a particular sample and find all the trees in the random forest where it was an Out Of Bag sample. We calculate the accuracy of these trees in predicting this OOB sample. We repeat this procedure for all the samples to get an averaged OOB score.

Now, as you can see, the OOB score is slightly unbalanced as for any one sample, we only considered trees where it was an OOB sample. So, it may not always be as good as a “proper” validation score. However, in cases where the learning dataset is small, it is a good idea and an established practice to use the OOB score as a validation metric.

b) First of all, I assume that you have a classification problem, or else your question may not be perfectly stated.

Using the mean decrease in accuracy versus the Mean decrease in Gini node impurity are not better…but different.

The mean decrease in accuracy metric calculates how much precision the model gains on an average by using a specific feature.

On the other hand, the Gini score calculates how homogeneous the split becomes on an average if we use a specific feature to split on. If a feature is informative, it tends to split mixed labeled nodes into pure single class nodes.

The Gini impurity is a local attribute (as it looks at every split in every tree), but the Accuracy scores are more global (as it permutes over the final random forest). In that sense, I think you might be looking for the Accuracy scores as your parameter of choice.

c) You should always vary this hyper parameter and select the best value. There is no way to know if ntree=500 is enough, or too few.

d) In most cases, that would be a good idea if you were using a gradient based learning approach or making assumptions about the distributions of the features.

However, decision trees (and thus, random forests) do not use gradient based learning and don’t make assumptions about the distributions of the features.

Thus, there is no explicit need to do such transformations.

Viel Gluck!

Comparison of tissue to serum MS data?

Does anyone know any C++ implementation of Kolmogorov-Arnold network other than mine?

Music Therapy used in several colleges.Which one that are great?

What is the sampling rate of the Polar H10?

How come I cannot visualize linear DNA, open circular DNA, and Supercoiled DNA on my DNA nicking assay?

Culturing muscle cells in DMEM?

What are the drivers of wetland degradation specifically in western Uganda and how does it affect the livelihood of peeople?

What method is best for calculating confidence intervals for sensitivity and specificity?

Issue with DAPI staining on brain (dark spot in middle)?

Housekeeping protein for serum?

Hello researchers Is this a random laser or just fluorescence?

Which test should be used to study association among demographic profile and awarness level?

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

What is Random Audit?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Could you try using PeptiCloud and see if it's a useful tool for biology research?

Is it redundant to use both Random Forest and Decision Tree algorithms in the same regression project?

Are there any alternatives to PhenoScanner?

How to test multivariate outlier in STATA?

How to fix install package meta in R?