Variogram for clustered data?

Hi Asad,

I should start by saying I disagree with people who state kriging is unaffected by clustered data. It is as much affected as any other weighted average interpolation method. What does help is preparing a continuity model (typically a variogram model) that can give less weight to closer samples as opposed to further (and thus reducing this "cluster influence").

A nested model (or models) is just a way of saying that you are using more than one function at the same time to produce a working continuity model. In practice we could, for example, say we have a spherical model for 40 % of the sill and an exponential for the remaining 60 % (but this does not mean those models will be clipped by sill or range values, its just an useful way of parameterizing because you often find horizontal structures in experimental variograms). Also in nested models is easier to have shaper curves and as so more adequate to model a bunch of different phenomena (like zonation).

I would not like to risk a general recipe for every clustered data case study. Every problem has its own quirks. But here are some questions you have to make yourself in order to make better choices. Clustered data can be a source of bias in several components of interpolation, as so:

1) Does the histogram of my data reflect the reality (this is particularly important for sequential simulations methods such as SGS since they typically enforce honoring the probability distribution)?

2) Are there any visible outliers in the clustered data?

3) What kind of variability do I have in the clusters (well behaved clusters should have low variability for small distances, bad behaved clusters might force some kind of nugget effect and thus decreasing the power of the continuity model to weight down the cluster "influence")?

4) Can a trend be modeled into the cluster data and if so what is the variability of the residues? (kriging with external drift might make you case study much easier to model)

Donald Myers

I believe your question really relates to clustered data locations nor to clustered data "values". E.g., clustering in data locations would not have any effect on a histogram. The real question is how does the spatial pattern of data locations affect the computation of an empirical variogram. A variogram is a function whereas an empirical variogram provides estimates of values of a variogram for a number of lag distances, each estimate is actually an average of half squared differences. So ideally one wants two conflicting goals; (1) estimates for a large number of lag distances and in particular many short lag distances, (2) a large number of half squared differences for each average. Unfortunately the total number of half squared differences is completely fixed by the total number of data locations so the result is a compromise. In particular the empirical variogram is not uniquely determined by the set of data locations (nor the data values) it depends on choices made by the researcher, e.g. the number of lag distances and the lengths of the lags. See . 1987, A. Warrick and D.E. Myers, Optimization of Sampling Locations for Variogram Calculations Water Resources Research 23, 496-500.

Both before and after you compute an empirical variogram you want to be sure and do some exploratory statistics, e.g. a histogram of the data values, a coded plot of the data locations (each location coded by the data value), fit a trend surface to the data and most important ask yourself what you know about the phenomenon that supposedly generated the data. You are not looking for an absolute or fixed set of answers. To compute an empirical variogram you have to make choices, try changing these choices to see how the empirical variogram changes.

As pointed out by another responder, clustering of the data locations has a different effect on the subsequent kriging vs estimation/modeling of the variogram. The kriging equations (unlike Inverse Distance Weighting or Nearest Neighbor) essentially "de-cluster" the data locations.

Because the shape of the variogram is most important for short lag distances, you want some pairs of data locations that are close together but on the other hand you don't want the "centers" of the clusters to be too far apart.

The comments about nested models are completely irrelevant. A nested model variogram is one that is a sum of several different variogram models e.g. gaussian vs exponential vs spherical and/or with different model parameters (range, sill). This is a way to obtain variograms with more complex shapes than allowed by the individual types (gaussian, exponential, spherical) or by changing the parameters.

Once you have computed an empirical variogram and fitted a model you still need to ask whether it is a good fit and/or want to explore how changing parameters or model types will affect the kriging results. For that one usually will use cross-validation. Any good geostatistical software should include these various tools

Michael Edward Hohn

As Don Myers says, some clustering of data locations can be a good thing because it does provide the closely-spaced pairs so important in modeling the variogram.

Where clustered data can be a problem is if there has been a disproportionate sampling of favorable areas, especially in the presence of a proportional effect where variability is proportional to local average. Papers by Journel in the 1980's covered some of these issues. Also, the book by Isaaks and Srivastava "Applied Geostatistics" shows examples. Some packages have a declustering routine that can be used to explore the data and perhaps mitigate the problems.

Asad Ali

Don Myers says "I believe your question really relates to clustered data locations nor to clustered data "values"". Exactly. The data essentially consist of eight clusters, of which 5 are large having ~30+ values, one has 18 values whereas two of them are rather small, having only five and six values. Just four points are located alone (see image 1 below). Thus there are multiple clusters in data. I have done some simple summary statistics and anova with LSD and other multiple comparison tests, and found that almost all clusters have no differences except the two smallest ones. later, I tried a pairwise relative variogram (Deutsch and Journel, 1997) which resulted in reduction in sill by some constant multiple, otherwise the shapes of the variograms remained exactly the same. Further, the kriging predictions are again exactly the same but kriging variances were reduced by some constant multiple. So by using a pairwaise relative empirical variogram, overall, the only thing one can see is reduction in the magnitudes of gamma and kriging variances. Finally, I should mention that the data is almost trend (either first or second order) free. I am attaching a few images for your viewing. The first image is of the data with clusters identified with different colors. The second image is of simple and pairwise empirical variograms, whereas in the third image we have fitted variogram models to the resulting empirical variograms. The fourth image shows kriging predictions for both simple and pairwise relative variograms and the last the one the kriging variances for both cases. One can easily see that the kriging prediction maps are almost exactly same by looking at there color legends. The kriging variances maps are also same with only difference in their scales.

Donald Myers

The pairwise relative variogram is a variation on the simple empirical variogram, it is not a new/different variogram model. It is quite possible that using the relative empirical variogram can result in your using a different variogram model and/or different parameters. Using the relative variogram does not directly affect the kriging results it can only produce different estimates for the values of the theoretical variogram. It is not possible to use an empirical variogram in the kriging equations.

If using the relative variogram seems to give a variogram model that has better kriging results that is fine. I would still suggest that you consider using cross validation

How can we differentiate between calcite, dolomite, siderite, magnesite and ankerite minerals in carbonatite rocks in thin section under op microscop?

Unusual intensity drop in some sections of chromatograms in DDA?

Can you suggest reliable sources defining "3D mesh" and "3D city models"?

Absorption coefficient of methane?

What is the best sampling strategy?

Looking for help on sem image analysis?

What is Random Audit?

Can we patent a process flow diagram developed using a process simulator but no actual cases is carried out?

How can we calculate the percentage of configuration interaction (CI) in the UV output data of the Gaussian program?

Please, what is the memory consumption of the Matlab function quad tree decomposition procedure [S = qtdecomp(I)] with respect to the input set I?

What is the script for running protein cluster by using DBSCAN?

If I only have data from 1 well, can variogram analysis be carried out for the distribution of facies and property?

How to select the correlation models in DACE tool box (Kriging surrogate model)?

Self-consistent clustering analysis method for concurrent multiscale simulation?

Suggest clustering algorithm to handle ordinal categorical dataset?

Density of the points to do interplolation of temperature on qgis?

What is the difference between POPDATA=1 and LOCDATA=1 in Structure settings?

Any lightweight Python packages to do data assimilation to generate synoptic analysis?

Fuzzy clustering within MATLAB?

Samples clustering by time by machine learning techniques, how to assess clusters’ sample similarity by time?