I have 1851 soil samples data on pH covering a study area of 7482sq.km in northern Ghana and I am using 52 environmental long-term average variables (Relief, Climate, MODIS Reflectances and derived products) to fit a model in order to explain the variability for pH prediction. So far, all models tested have shown low explained variance and sometimes even gives negative.
How may I improve the Explained Variance below?
Kindly see attached, the spatial distribution of points, and metadata excel file showing details about the covariates used.
Below is the summary of my models explained variance.
Multiple Linear Regression: 0.03
Step-wise Multiple Linear Regression: 0.04
RandomForest: -7.02
ExtremeGradient Boosting: 0.03
Support Vector Machines with Polynomial Kernel: 0.03
Additional information about the sample data
My assessment so far
I would be most grateful for insights into any techniques that could help improve the model explained variance.