Hello everyone,
I am doing a phylogeographic/population genetic study on an island species with limited dispersal capabilities. I'm having difficulty determining between what is biologically realistic and bioinformatics error. I have ddRADseq data that came in 2 separate sequencing batches. The first batch consists of 96 samples and the second has 24 samples.
I ran a PCA for the first 96 samples and it displayed a certain clustering pattern, and when I included the remaining 24 samples the pattern remained largely the same, except the second batch samples formed tight clusters corresponding to the respective populations those samples are from (3 distinct clusters).
When I ran a PCA with just the second batch (24 samples) the clustering was almost exactly the same
I want to note that my pipeline consists of the STACKS protocol, and included additional filters from dDocent (from roughly 160,000 to 16,000 variant sites).
So my question is if there should be any further concern about batch effects considering the "investigation" I've conducted thus far.
Thank you all in advance,
Andre