We're looking to use random forests to understand the importance of environmental variables in predicting species abundance and richness. I'm struggling a little bit with understanding a few different things, for example, why one would choose OOB over randomCV. Below are the points where I am looking for help.

1. With a relatively small dataset (several hundred points) with “many” predictors (20) what is better, random cross-validation or OOB. Mtry for both always ends the same (2).

2. When looking at variable importance scores, what is “better”, mean decrease in accuracy or, Mean decrease in GINI node impurity?

3. Should we be tuning ntree or is a default of 500 ok?

4. Should we be transforming the response variables to have a more normal distribution?

Thanks!

Any resources for a layperson would also be appreciated.

As a note, I'm using the r package caret to create the models and calculate the variable importance.

More Andrew J. Fairbairn's questions See All
Similar questions and discussions