I try to predict the occurrence of individual aquatic plants (48 species) with Random Forest (RF) models. For this I use six explanatory variables. The datasets are highly unbalanced. Lets say minimal 2.5% have presences, but can also go up to 25% (of 2000 observations). Not surprisingly, the accuracy (~70%) and Cohen's kappa (~0.2) are not very satisfactory. Moreover, the True Negative (TN) rate is high (~80%) while True Positive (TP) rate is low (~15%). I tried multiple things from changing the cut-off to 40-45%, which works somehow (still not satisfactory). Additionally, I subsampled my dataset (also down-sampling), build an RF model with 50 trees and repeat this 20 times and combine these 20 RF models in a one RF model (somehow circle reasoning as this is what down sampling does), but results in similar performance. Changing the mtry, node size (85-100% of the lowest class) or maximum number of observations ending in the terminal node (0-15% of the lowest class) also does not improve the performance. However, the latter two "smooth" the patterns, but does not improve performance or distinction between TN and TP. The best option seems to set the cut-off to 45%, node size to 90% and maximum obs to 10%.
First, my guess resulting to the low performance is of course due to the unbalanced dataset, where simply the pattern of absences is better captured than that of the presences. However, I cannot resolve this with the data I currently have (am I sure that I cannot resolve this? not really). This would mean I need more data (anyhow I want this). Second, TN are easier to predict in general. For example, fish need water, if there is no water the model predicts no fish (easy peasy). However, if there is water the model predicts fish, but because there is water, this does not necessarily mean there is fish. For aquatic plants, if flow velocity is > 0.5 m/s species of vascular plants are often absent and mosses are present. Yet, if flow velocity < 0.5 m/s this does not mean vascular are present or mosses are absent. Third, the predictor variables are not suitable and in general the species seem to distributed widely along the gradient of these predictors (you do not need an ML model to tell you this if you look at the boxplots). Moreover, correlations between predictors also present (while not an issue for prediction it is an issue for inference), for some species this is more apparent than others; and some species occur everywhere along these gradient. Although, this idea somehow seems to float around, actually relative little articles discuss this (excluding articles addressing the high misclassification rates of Trophic Macrophyte Indices in general):
Article Spatial and environmental effects on hydrophytic macrophyte ...
Article Distribution of aquatic macrophytes in contrasting river sys...
Article River macrophyte indices: Not the Holy Grail!
Article Water Framework Directive: Ecological classification of Danish lakes
Article Long-term dynamics of macrophyte dominance and growth-form t...
Even using different model types does not really work (SVM, KNN, GLM, [binomial]). Naive Bayes seem to work, but the prior ends up extremely low for some species thus the model hardly predicts presence. However I turn or twist (organize) the data, I cannot obtain a satisfactory prediction. Are there any statistic or machine learning experts who have any tips or tricks to improve model performance, besides increasing the datasets?
P.S. Perhaps I should start a contest on Kaggle.