I have created a denovo reference "genome" of RAD contigs and have mapped back my reads using BWA. Using the the samtools flagstat option I queried my read alignments. An example:

3232117 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

3226784 + 0 mapped (99.84% : N/A)

3232117 + 0 paired in sequencing

1595145 + 0 read1

1636972 + 0 read2

2662532 + 0 properly paired (82.38% : N/A)

2871947 + 0 with itself and mate mapped

354837 + 0 singletons (10.98% : N/A)

213393 + 0 with mate mapped to a different chr

213393 + 0 with mate mapped to a different chr (mapQ>=5)

Two things are obvious from the alignment: 1) singletons must arise because a mate fails the quality check during the mapping procedure, and 2) in some cases mates map to different "chromosomes" (RAD contigs).

So the questions.

Firstly, I feel like eliminating singletons would essentially be throwing out information. Should these be kept or is there too much uncertainty in their origin to be reliable because their mate hasn't mapped?

Secondly (and more importantly), how do I remove reads in which mate pairs have mapped to different chromosomes? I have played with the samtools view -f 3 option (read paired + read mapped in proper pair) and this reduces the split mates but they are still present; it also removes singletons. From what I gather, there doesn't seem to be a specific option combination to do this (https://broadinstitute.github.io/picard/explain-flags.html). NOTE: I have also tried using -f 11 (read paired + read mapped in proper pair + mate unmapped), but this removes all reads (I think because only reads that are properly paired are allowed make it through, and thus cannot have an unmapped mate).

A follow on question is should such pairs be removed, given that they might be indicative of repetitive sequences?

Thoughts and comments?

More Joshua A. Thia's questions See All
Similar questions and discussions