Multicollinearity affects the accuracy of prediction models. Regression models are usually affected by multicollinearity between the variables considered. I want to know whether random forests are also affected by multicollinearity between features.
In such cases , partial least square method be adopted . In case of relationship between many variables, principle component will filter the predictors . Otherwise , grouping by clustering to be done and then prediction methods .
The short answer is no. It does not affect prediction accuracy.
Multicollinearity does not affect the accuracy of predictive models, including regression models. Take the attached image as an example. The features in the x and y axis are clearly correlated; however, you need both of them to create an accurate classifier. If you discard one of them for being highly correlated with the other one, the performance of your model will decrease.
If you want to remove the collinearity, you can always use PCA to project the data into a new space where the 'new features' will be orthogonal to each other. You can then, train your model with the new features, but you will find that the performance is the same. You simply rotated your original decision boundary.
Now, where multicollinearity becomes 'an issue' is when you want to 'interpret' the parameters learned by your model. In other words, you cannot say that the feature with the 'biggest weight' is 'the most important' when the features are correlated. Note that this is independent on the accuracy of the model, this is only the interpretation part, which in my opinion, you should not be doing anyway. To see why you can read: https://robertoivega.com/association-prediction-studies/#more-188
Toloşi, Laura, and Thomas Lengauer. "Classification with correlated features: unreliability of feature ranking and solutions." Bioinformatics 27.14 (2011): 1986-1994.
Strobl, Carolin, et al. "Conditional variable importance for random forests." BMC bioinformatics 9.1 (2008): 307.
It certainly has an effect on the interpretability of the variable importance measure
Multicollinearity is the rule rather than the exception as we deal with ecosystems. For example, the often used WorldClim and meteorological data generally show multicollinearity. In mountains, meteorological data show multicollinearity, i.e. correlation with elevation. The latter shows correlation with vegetation cover categories, and so on.
There are several issues to consider, some have been addressed above. Not all correlated geodata have the same spatial resolution and accuracy. WorldClim or meteo data are not primary data, but crude interpolations with a coarse spatial resolution (5 km) in comparison with a DEM (e.g. 90 m) or NDVI as cover proxy. Further, multicollinearity may have an impact on model transferability.
Please be welcome to have look at our pertinent articles. Keywords: Majella bear, Majella krummholz, transferability Australia/Spain.
Although the predictive power or reliability of machine learning algorithms is generally not affected by the multicollinearity of variables, the importance of variables with high collinearity will be offset by each other, thereby affecting the overall interpretability of the predictor variables.
Therefore, if you only focus on the prediction or classification performance of the random forest classifier, the multicollinearity between variables can be ignored; if the relative importance of these variables needs to be calculated and explained, the multicollinearity between the variables needs to be eliminated as much as possible.
In my opinion, the random forest uses bootstrap sampling and feature sampling such as row sampling and column sampling. Therefore Random Forest is not affected by multicollinearity that much since it is picking different set of features for different models and of course every model sees a different set of data points