I sent a test plate to the seq facility we are working with for ddRAD sequencing of a non-model organism without a reference genome. These 50 samples came from 3 field sites from 1 region. They ran these 50 in one lane of an Illumina HISeq. When the seq data returned, it appeared fine in preliminary popgen analyses after deNovo assembly and I recovered around 8,000 SNPs after filtering.

I then sent a large batch of samples (300) to be sequenced following the same protocol. These were from ~15 sites across 2 regions (5 in one region, 10 in another). The facility ran my samples as 300 in a single Illumina HiSeq lane, but ran the single lane twice, as opposed to 100 in each of 3 lanes, once, as I thought they would do. They said 300 in one lane ran twice would provide better quality than 100 in each of 3 lanes, ran one time.

When I received the 300 back, I combined those samples with my 50 from the test plate and proceeded to filter and call SNPs together, getting significantly less SNPs, but that was expected with the reduced depth of coverage from running so many samples/lane.

The main issue however, is the combined 350 dataset produces 2 genetic clusters when using program Structure and using a Principal Coordinates Analysis (PCoA) in Genalex. One cluster is the 300 and the other cluster is the 50. This is biologically impossible that either of these clusters are true as there are field sites in the 50 group that are very close to fieldsites in the 300 group. What could be causing the genetic "clusters" to align by sequencing run?

More Bennett Michael Hardy's questions See All
Similar questions and discussions