I am currently doing species distribution models using presence-only data, and I would like to perform spatial filtering of presence points in order to lower the sampling bias and to make sure that the data for building/evaluating models (I'm doing the data partition-based evaluation) are independent.
I have R function which leaves in one occurence point and removes other points in specified nearest-neihborhood distance area. However, choosing the distance within which the data will be rarefied is often arbitrary.
My concern now is how to justify or to select the distance threshold. As of now, I am using 1km threshold (duplicate points within 1km buffer are removed). My study area is ~250x50km, it is quite spatially heterogeneous, and resolution of my most coarse environmental variable is 1km (downsampled to 30 meters).
I feel that 1km threshold is somewhat adequate, but I cannot reasonabily justify this choice. Does anyone has tips on this issue or some articles to direct me towards. One method that I've came across in the literature so far is using variogram range where the points become spatially independent, but I am still not clear on how to use my environmental variables to build the semivariogram, so any tips here would be very appreciated.