I have 1851 soil samples data on pH covering a study area of 7482sq.km in northern Ghana and I am using 52 environmental long-term average variables (Relief, Climate, MODIS Reflectances and derived products) to fit a model in order to explain the variability for pH prediction. So far, all models tested have shown low explained variance and sometimes even gives negative.

How may I improve the Explained Variance below?

Kindly see attached, the spatial distribution of points, and metadata excel file showing details about the covariates used.

Below is the summary of my models explained variance.

Multiple Linear Regression: 0.03

Step-wise Multiple Linear Regression: 0.04

RandomForest: -7.02

ExtremeGradient Boosting: 0.03

Support Vector Machines with Polynomial Kernel: 0.03

Additional information about the sample data

  • Avg. distance between two sample points using nearest neighbor analysis: 2551.29m
  • Data source: Student research data
  • Sampling method: Grid/Management Zone Hybrid Soil Sampling method. The grid size is 2-4sq.km, management zones are subdivisions within the grid.
  • My assessment so far

  • Removed spatial outliers
  • Removed value outliers
  • Normality check was ok (please see attached)
  • Variography shows a spatial structure (please see attached)
  • Tried Recursive Feature Elimination to reduce the dimensionality but did not show any improvement
  • Tried reducing dimensionality by removing highly correlated covariates at a threshold of 0.75
  • I would be most grateful for insights into any techniques that could help improve the model explained variance.

    More Isaac Kissiedu's questions See All
    Similar questions and discussions