12 December 2015 3 2K Report

Our lab uses ddRAD (a variant from the family of RAD-seq methods) for simultaneous SNP detection and genotyping. To find homologous loci across multiple individuals, we first cluster reads within an individual, based on some identity threshold (e.g. 90% similarity = same locus), and then do the same among all individuals in order to build a catalog of homologous sites across all individuals, again using some identity threshold. We do this using the program Vsearch (a centroid-clustering algorithm) as implemented by the pipeline pyRAD, but other popular RAD-seq assembling pipelines use a similar methodology (e.g. STACKS). 

My question is: How do we determine the "optimum" identity threshold for clustering? The issue is that if we have a threshold which is too strict, we erroneously split loci, but if our threshold is too relaxed we over-merge loci (for example clumping paralogs together). 

We have put some thought into this in our lab by looking at the relationship between various descriptors of our dataset and varying clustering thresholds (e.g. within-cluster genetic distance, number of parsimonious informative sites, number of loci failing paralog detection filtering, etc) but I am still having difficulty coming up with a logic for choosing a threshold. 

And of course this could also be done by considering the phylogenetic context- e.g. using timescale and mutation rate to predict average genetic distance within a locus, but this involves making a priori assumptions. 

Similar questions and discussions