19 June 2015 10 707 Report

My data consists of presence/absence (PA) of an allele in 354 plants collected from 127 collection sites as response, and a set of 25 climatic continuous variables in each site as predictors. The sample size for each collection site ranges from 1 to 4 plants. My goal is to find the climatic variables that are the best predictors of the PA pattern. As expected, the climate data is highly correlated. I would like to use the random forest approach to find the variables. I have 3 questions: 1. In logistic regression one can use a mixed model with the collection site as random effect. Is there a way or a need to define such random effect in random forest analysis or I can just add the collection site as a factor to the predictors? 2. Alternatively, I can calculate the frequency of the allele for each collection site and use it as the response (each case is a collection site). Then I will not need the collation site as random effect. However, the sample size is very small 1-4 plants. Is this a better approach than doing the analysis as logistic regression where each case is an individual? 3. Do I need to select the most uncorrelated variables? or random forest can handle many correlated variables?

Thank you.

More Hanan Sela's questions See All
Similar questions and discussions