01 October 2020 2 643 Report

Clustered/random data are very common in data analysis. For example, if I want to model the occurrence (presence/absence) of a species over multiple countries I could suggest that the countries are clustered/random effects. In theory I could use a binomial GLMM. However, the structure of the dataset, performance and information resulting from this model are not satisfiable and mostly do not fit my questions. The non-linear responses, high variability of the data, (randomly) missing values of the predictors, categorical predictors, and unbalanced dataset make it more challenging. Because of this I often use Random Forest models. Although, (sometimes) it suggested RF models are black-box models this is hardly the case. The return of variable importance, display of partial dependency plots, extraction of split-points at the root node, depth and number of split of the predictor variables makes it a complex white-box model. These results are also fitting most of the questions I ask. One could suggest a to use a GAMM, but there are so much buttons to tweak on these models, I do not feel confident and comfortable using them.

To handle the missing values, categorical data, and unbalanced datasets I used the randomForest package for R (Liaw and Wiener, 2002). The randomForest package has the possibility to impute the median for missing values and stratify (downsizing) the data in unbalanced datasets, which makes well suited for the data I work with. The stratification of the data is key as well as the imputation of the median. However, a drawback is that the randomForest package cannot take in account clustered/random effects. This then ends up as a discussion points for basically each analysis.

There are some scientific publications of MERFs (i.e. A. Hajjem et al. 2014) and R-packages of available (i.e. MixRF). However, from the description of the manual of these packages it does not seem they can impute the median and stratify the data. I do not want to lose a lot of my data by balancing my datasets before analysis and I do not want to lose information by removing incomplete samples.

Is there any news on an R-package that implements RF models that can handle al these things? Or, is there a suggestion for other types of models in R which can return similar information as the RF models and are (sort to say) user friendly like the randomForest package?

Thank you in advance,

Liaw, A., Wiener, M., 2002. Classification and Regression by randomForest. R News 2, 18–22.

Ahlem Hajjem, François Bellavance & Denis Larocque (2014) Mixed-effects random forest for clustered data, Journal of Statistical Computation and Simulation, 84:6, 1313-1328, DOI: 10.1080/00949655.2012.741599

Similar questions and discussions