Dear all,

I am quite new to the analysis of RADseq data. I will try to explain the question I have, but if I forget essential information please to let me know.

I am working on paired-end ddRADseq data that is reference-aligned. That RADtags are about 250bp long.

All the QC reports actually look very good, so I decided not to trim the data.

After variant calling the data was filter for minDP (5), minGQ (20, pretty standard values, as far as I can judge), next reads are filtered for max-missingnes (0.5), reads that are in repeat regions are filtered out. Filters for bi-allelic snps, minor allele count (3), allele balance and max-meanDP are applied and an iterative filtering for missingness in genotypes and individuals is done. All this essentially follows a recent paper written by O'Leary et al (10.1111/mec.14792).

After all this filtering I have a substantial amount of loci that have >10 variant sites, even with the most stringent filtering. But 10 is just an arbitrary number that came up in a discussion with a colleague.

Did anyone have the same issues? How did you deal with them and what was the rationale behind it?

Is there any reasonable biological number of snps per locus (with a locus size of 250bp) that can/should be used as a threshold? Preferably a published one...

For now I am just excluding loci with >10snps, but I would like to have a good reasoning for any threshold applied.

Maybe just to add some info: I am not working on human data, so the reference genome is not bad for a non-model species, still many things are unknown.

Any hints and ideas are highly appreciated!

Best wishes, Ann-Christin

More Ann-Christin Honnen's questions See All
Similar questions and discussions