How do you remove mate reads mapped to different "chromosomes"?

04 April 2016 4 4K Report

I have created a denovo reference "genome" of RAD contigs and have mapped back my reads using BWA. Using the the samtools flagstat option I queried my read alignments. An example:

3232117 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

3226784 + 0 mapped (99.84% : N/A)

3232117 + 0 paired in sequencing

1595145 + 0 read1

1636972 + 0 read2

2662532 + 0 properly paired (82.38% : N/A)

2871947 + 0 with itself and mate mapped

354837 + 0 singletons (10.98% : N/A)

213393 + 0 with mate mapped to a different chr

213393 + 0 with mate mapped to a different chr (mapQ>=5)

Two things are obvious from the alignment: 1) singletons must arise because a mate fails the quality check during the mapping procedure, and 2) in some cases mates map to different "chromosomes" (RAD contigs).

So the questions.

Firstly, I feel like eliminating singletons would essentially be throwing out information. Should these be kept or is there too much uncertainty in their origin to be reliable because their mate hasn't mapped?

Secondly (and more importantly), how do I remove reads in which mate pairs have mapped to different chromosomes? I have played with the samtools view -f 3 option (read paired + read mapped in proper pair) and this reduces the split mates but they are still present; it also removes singletons. From what I gather, there doesn't seem to be a specific option combination to do this (https://broadinstitute.github.io/picard/explain-flags.html). NOTE: I have also tried using -f 11 (read paired + read mapped in proper pair + mate unmapped), but this removes all reads (I think because only reads that are properly paired are allowed make it through, and thus cannot have an unmapped mate).

A follow on question is should such pairs be removed, given that they might be indicative of repetitive sequences?

Thoughts and comments?

David Roy Nelson

These pairs might help you identify mis-assemblies [or even sites for recent transposon mobilization] thus there is no reason to remove them.

Singletons should be retained as single reads only.

It might help if you specify 'why' you want to remove the reads.

Joshua A. Thia

@David,

Thanks for the answer. :)

Do you have any tips as to how to treat the singletons as unpaired reads in analysis that also includes paired reads in the same sample?

I guess reads whose mates map to diffrent chromosomes make me nervous in that they don't have an undispurted origin. Also, because they are relatively low in frequency I reasoned that I wouldn't lose too much data by throwing these out.

Still learning about "best practises" for genomic data, so any insights are really useful.

First of all, BWA is antiquated. Use this software, it is literally a step ahead of the rest: https://ccb.jhu.edu/software/hisat2/index.shtml

But to answer your question, most read alignment softwares have this option (to map reads as singletons). For BWA, apparently you have to use something like this: bwa samse

Again, I would recommend to use Hisat2. See the help menu for options to do all the things you asked about and more.

@David.

Thanks for suggestion. Will definitely have a play with this program.

Does the temporal summation of EPSPs accumulate linearly to action potential threshold or logarithmically to a plateau/limit?

How to avoid metal contamination from a sonicator?

What would be the safety of our roads be with the autonomous vehicles ?

What should the phase and pulse duration be for deep-brain stimulation (DBS) for epilepsy?

Why does extracellular cathodic (-ve) stimulation depolarise neurones?

Should a biphasic electrical stimulation pulse on neural networks be positive followed by a negative phase or negative followed by positive phase?

African philosophy solution to the global problem of poverty?

Is it best to submit literally all of your research plans to your IRB?

The worldview of indigenous educational values?

In what ways does the use of LIH tubes impact laboratory workload and efficiency compared to traditional serum tubes?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I can't see the ssDNA band after performing asymmetric PCR. Is there any way to do this?

Does crude extraction using NaOH and Tris work well with Fungi?

E.coli contamination in human RNA seq data ?

Why after performing site directed mutagenesis ,I don't see any colony after transformation?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

Does anyone have issues using Prepman Ultra reagent for MicroSeq ID bacterial, fungal and yeast sample preparation?

What is the acceptable p-value cutoff for GO enrichment analysis ?

Inquiry on Maximum Nucleic Acid Volume for 2.5 mL Liposome Solution?

Are the apoptotic cells is positive for γH2AX ?