How to deal with missing microsatellite markers (SSR) when calculating distances (Jaccard) between genotypes?

If you are calculating the values for individuals, one approach would be to calculate the distance for each loci separately, omit any loci that have missing or null genotypes, and calculate the average Jaccard distance over the loci that you didn't omit. I use this approach for calculating Kosman and Leonard (2005) individual pairwise genetic distances in my R-package PopGenReport. David Winter calculates the same genetic distances in the R-package mmod, but his approach is to omit any individuals who have missing or null genotypes from the calculation of genetic distances. My approach is a bit more in line with the approach taken by Smouse and Peakall (1999) for their calculations of individual pairwise genetic distances. The approach that David takes is better in that you are calculating the same number of Jaccard distance for each pair of individuals and so it's a bit closer to comparing apples to apples. With the approach I've taken, the distances between pairs of individuals can be exaggerated a bit (but in most cases it's still fairly close to comparing apples to apples), but the benefit is that you get to use more of your data, particularly if there is a fair amount of missing/null data in your data set.

An example of how the approach I've taken affects genetic distances...

You have 2 pairs of diploid individuals, and you look at 2 loci for each pair. If there is a lot of genetic diversity (but the effect will be similar if you have little diversity) and as a result each pair of individuals only has 1 allele in common (out of 4 potential alleles), then their genetic similarity would each be 1/4. However, if there is data missing for one of the loci in one of the pairs, then one pair has a genetic similarity of 1/4 and the other has a genetic similarity of 1/2 as we drop the locus with missing data. This effect is dampened by looking at more loci (e.g. you go from a 1/4 difference due to losing a loci from initially looking at 2 loci to a 1/12 difference if you'd started with 3 loci, skipping a few cases, it's a 1/56th difference dropping from 8 loci to 7 loci (1/7 - 1/8).

Aaron Thomas Adamack

An example of how the approach I've taken affects genetic distances...

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Why does my protein refolded to beta sheet during thermal denaturation analysis?