Hi everyone. I need to calculate jaccard distance for different bean genotypes using microsatellites data (SSR). I don't know how to handle missing or null genotypes
If you are calculating the values for individuals, one approach would be to calculate the distance for each loci separately, omit any loci that have missing or null genotypes, and calculate the average Jaccard distance over the loci that you didn't omit. I use this approach for calculating Kosman and Leonard (2005) individual pairwise genetic distances in my R-package PopGenReport. David Winter calculates the same genetic distances in the R-package mmod, but his approach is to omit any individuals who have missing or null genotypes from the calculation of genetic distances. My approach is a bit more in line with the approach taken by Smouse and Peakall (1999) for their calculations of individual pairwise genetic distances. The approach that David takes is better in that you are calculating the same number of Jaccard distance for each pair of individuals and so it's a bit closer to comparing apples to apples. With the approach I've taken, the distances between pairs of individuals can be exaggerated a bit (but in most cases it's still fairly close to comparing apples to apples), but the benefit is that you get to use more of your data, particularly if there is a fair amount of missing/null data in your data set.
An example of how the approach I've taken affects genetic distances...
You have 2 pairs of diploid individuals, and you look at 2 loci for each pair. If there is a lot of genetic diversity (but the effect will be similar if you have little diversity) and as a result each pair of individuals only has 1 allele in common (out of 4 potential alleles), then their genetic similarity would each be 1/4. However, if there is data missing for one of the loci in one of the pairs, then one pair has a genetic similarity of 1/4 and the other has a genetic similarity of 1/2 as we drop the locus with missing data. This effect is dampened by looking at more loci (e.g. you go from a 1/4 difference due to losing a loci from initially looking at 2 loci to a 1/12 difference if you'd started with 3 loci, skipping a few cases, it's a 1/56th difference dropping from 8 loci to 7 loci (1/7 - 1/8).
If you are calculating the values for individuals, one approach would be to calculate the distance for each loci separately, omit any loci that have missing or null genotypes, and calculate the average Jaccard distance over the loci that you didn't omit. I use this approach for calculating Kosman and Leonard (2005) individual pairwise genetic distances in my R-package PopGenReport. David Winter calculates the same genetic distances in the R-package mmod, but his approach is to omit any individuals who have missing or null genotypes from the calculation of genetic distances. My approach is a bit more in line with the approach taken by Smouse and Peakall (1999) for their calculations of individual pairwise genetic distances. The approach that David takes is better in that you are calculating the same number of Jaccard distance for each pair of individuals and so it's a bit closer to comparing apples to apples. With the approach I've taken, the distances between pairs of individuals can be exaggerated a bit (but in most cases it's still fairly close to comparing apples to apples), but the benefit is that you get to use more of your data, particularly if there is a fair amount of missing/null data in your data set.
An example of how the approach I've taken affects genetic distances...
You have 2 pairs of diploid individuals, and you look at 2 loci for each pair. If there is a lot of genetic diversity (but the effect will be similar if you have little diversity) and as a result each pair of individuals only has 1 allele in common (out of 4 potential alleles), then their genetic similarity would each be 1/4. However, if there is data missing for one of the loci in one of the pairs, then one pair has a genetic similarity of 1/4 and the other has a genetic similarity of 1/2 as we drop the locus with missing data. This effect is dampened by looking at more loci (e.g. you go from a 1/4 difference due to losing a loci from initially looking at 2 loci to a 1/12 difference if you'd started with 3 loci, skipping a few cases, it's a 1/56th difference dropping from 8 loci to 7 loci (1/7 - 1/8).