How to improve my ML models?

03 May 2021 17 1K Report

I try to predict the occurrence of individual aquatic plants (48 species) with Random Forest (RF) models. For this I use six explanatory variables. The datasets are highly unbalanced. Lets say minimal 2.5% have presences, but can also go up to 25% (of 2000 observations). Not surprisingly, the accuracy (~70%) and Cohen's kappa (~0.2) are not very satisfactory. Moreover, the True Negative (TN) rate is high (~80%) while True Positive (TP) rate is low (~15%). I tried multiple things from changing the cut-off to 40-45%, which works somehow (still not satisfactory). Additionally, I subsampled my dataset (also down-sampling), build an RF model with 50 trees and repeat this 20 times and combine these 20 RF models in a one RF model (somehow circle reasoning as this is what down sampling does), but results in similar performance. Changing the mtry, node size (85-100% of the lowest class) or maximum number of observations ending in the terminal node (0-15% of the lowest class) also does not improve the performance. However, the latter two "smooth" the patterns, but does not improve performance or distinction between TN and TP. The best option seems to set the cut-off to 45%, node size to 90% and maximum obs to 10%.

First, my guess resulting to the low performance is of course due to the unbalanced dataset, where simply the pattern of absences is better captured than that of the presences. However, I cannot resolve this with the data I currently have (am I sure that I cannot resolve this? not really). This would mean I need more data (anyhow I want this). Second, TN are easier to predict in general. For example, fish need water, if there is no water the model predicts no fish (easy peasy). However, if there is water the model predicts fish, but because there is water, this does not necessarily mean there is fish. For aquatic plants, if flow velocity is > 0.5 m/s species of vascular plants are often absent and mosses are present. Yet, if flow velocity < 0.5 m/s this does not mean vascular are present or mosses are absent. Third, the predictor variables are not suitable and in general the species seem to distributed widely along the gradient of these predictors (you do not need an ML model to tell you this if you look at the boxplots). Moreover, correlations between predictors also present (while not an issue for prediction it is an issue for inference), for some species this is more apparent than others; and some species occur everywhere along these gradient. Although, this idea somehow seems to float around, actually relative little articles discuss this (excluding articles addressing the high misclassification rates of Trophic Macrophyte Indices in general):

Article Spatial and environmental effects on hydrophytic macrophyte ...

Article Distribution of aquatic macrophytes in contrasting river sys...

Article River macrophyte indices: Not the Holy Grail!

Article Water Framework Directive: Ecological classification of Danish lakes

Article Long-term dynamics of macrophyte dominance and growth-form t...

Even using different model types does not really work (SVM, KNN, GLM, [binomial]). Naive Bayes seem to work, but the prior ends up extremely low for some species thus the model hardly predicts presence. However I turn or twist (organize) the data, I cannot obtain a satisfactory prediction. Are there any statistic or machine learning experts who have any tips or tricks to improve model performance, besides increasing the datasets?

P.S. Perhaps I should start a contest on Kaggle.

Badges
Science topic

More Wim Kaijser's questions See All

How to address the inflated degrees of freedom in "dissimilarity modeling"?

I was asked to analyze a "community" response and the suggested response is a dissimilarity matrix. Ecologists take a sample at different locations (i) they however want to know the change in...

12 March 2024 1,466 0 View

Cost effectiveness (or ICER) of ICU admission for COVID patients ??

Performing a costeffectiveness analysis (CEA) on a cohort of admissions during a crisis is only possible along the way or in retrospect. We included a series of ICU admitted patients from a...

24 October 2023 4,367 0 View

Why is it P(|T|>=t)?

The p-value from the T-test is denoted as P(|T|>=t). Yet, observing a t-statistic similair or more extreme then T under H0 based on what I know does not fit with this notation. Often a small...

05 September 2023 9,944 3 View

Details Japanese sanctions on Russia?

Am looking for details on the sanctions on Russia by Japan. Some material is published by METI in English, but most in Japanes. Need a good overview of them or a recent comparison with UK, US end...

02 February 2023 3,503 0 View

Do we estimate parameters, statistics or both: least ambiguity of wording?

A parameter is defined as a value of the population whereas a statistic is a value of the data, i.e., the mean can be a statistic or parameter. However, it can be quite ambiguously and the...

14 November 2022 5,140 26 View

Minor "statistics" and credibility intervals?

For some smaller and less know "statistics" often no option to calculated the error or confidence intervals is given. However, this might be obtained by bootstrapping. In addition, both McElrath...

12 September 2022 4,690 3 View

Is it possible to calculate confidence intervals for CLES via Fishers Z transformation?

Determining intervals for the common language effect size (CLES), probability of superiority (PS), Area Under the Curve (AUC) or Exceedance Probability (EP) is possible via multiple method Ruscia...

10 May 2022 6,889 4 View

Waht is the effect of browsing animals on crown formation in trees?

I am looking for an article on the influence of grazing animals (cattle and game) on branch sagging. I suspect that when the buds of the lowest branches are eaten, the hormones in the branch...

23 March 2022 6,985 3 View

Is a weighted quasibinomial GLM reasonable?

I am exploring some data and and possibilities of quasibinomial GLM. The data is less than perfect. Nonetheless, the target variable can range from 0 till 1 and from my knowledge it seems okay (is...

14 September 2021 4,667 3 View

Is correction for multiple comparisons needed?

I have been wondering about this for a bit and forgive my ignorance. Consider we loath the NHST approach, but value the information the p-value gives. I consider the we have a "perfect"...

18 August 2021 7,462 6 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Hello researchers Is this a random laser or just fluorescence?

I am using Rhodamine6G as gain medium and silver nanoparticles as scatterers on a microscope slide and laser input 532 nm comes from above.

09 August 2024 9,894 2 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

What are the key methods and indicators used in assessing the biodiversity of river ecosystems, and how do these methods account for variations ?

Biodiversity assessment of river ecosystems is crucial for understanding the health and stability of these environments. This question aims to explore the various techniques employed to evaluate...

07 August 2024 4,290 3 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

Swimming/space travel depends on the proprioceptive muscle spindles?

When the entire neocortex is ablated in rodents, although they are still able to swim, all the limbs move continuously and asynchronously (Vanderwolf 2006; Vanderwolf et al. 1978). Normal animals...

03 August 2024 835 3 View

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

Machine learning (ML) has shown great potential in predicting the compressive strength of concrete, an important property for structural engineering. However, its practical application comes with...

03 August 2024 2,546 2 View