Best way to get strand specific gene expression information using TopHat

01 January 1970 2 9K Report

I just realized that when giving the --library-type option to tophat, the resulting accepted_hits.bam still contains reads mapping in both orientation wrt the transcripts provided in the gff file. I am not mapping on the transcripts, but on the genome, and I provide the gene coordinates in a gff file. I was pretty convinced that the accepted_hits.bam file would contain ONLY the reads mapping in the right orientation wrt the transcript but this is actually not the case. you can simply check this by running TopHat with strand-specific RNA-seq paired end reads using firststrand and secondstrand library type, and you will get exactly the same output. SO, if I use the bam file to count the mapping reads with some tool like bedtools coverage, I will consider reads mapping in both orientation, therefore neglecting the strand specific information. The only thing that TopHat is doing when you specify the library type is adding a tag XS:A that can take + or - value, which refer to the orientation of the transcript that is overlapped to the read. So, it seems to me the only way to get the right counts is by splitting the bam file in first and second mate files. for a firststrand library (like e.g. the ones obtained with TruSeq) I expect that the second mate is in the same orientation as the transcript, while the first mate is in the reverse orientation. Therefore, for the first mates, I get only the reads that are in the opposite direction with respect to the transcript, while for the second mates, I get those that map in the same orientation wrt the transcript. Then I have to get the coverage for the two files separately and then I put the counts together (see below). Is it possible that no tool does this procedure in automatic? Am I missing something?

example. paired reads are in R1.fastq and R2.fastq

#mapping

tophat -G genome.gff -o R1R2_map --no-mixed --no-discordant --library-type fr-firststrand genome.bt2 R1.fastq R2.fastq

#this gives accepted_hits.bam

#now I separate first and second mate

#get first mate mapped on plus

samtools view -f 64 -F 16 accepted_hits.bam > accepted_hits_FIRST_PLUS.sam

#get first mate mapped on rev

samtools view -f 64 -f 16 accepted_hits.bam > accepted_hits_FIRST_REV.sam

#get second mate on plus

samtools view -f 128 -F 16 accepted_hits.bam > accepted_hits_SECOND_PLUS.sam

#get second mate on plus

samtools view -f 128 -f 16 accepted_hits.bam > accepted_hits_SECOND_REV.sam

#now I grep to take only reads mapping in the right orientation wrt the transcript, whose orientation is encoded in the XS:A flag and knowing that the first mate is rev wrt the transcript, while the second one is in the same orientation

grep "XS:A:-" accepted_hits_FIRST_PLUS.sam > accepted_hits_FIRST_OPPOSITE.sam

grep "XS:A:+" accepted_hits_FIRST_REV.sam >> accepted_hits_FIRST_OPPOSITE.sam

grep "XS:A:+" accepted_hits_SECOND_PLUS.sam > accepted_hits_SECOND_SAME.sam

grep "XS:A:-" accepted_hits_SECOND_REV.sam >> accepted_hits_SECOND_SAME.sam

#get the bams for bedtools coverage

samtools view -b -o accepted_hits_FIRST_OPPOSITE.bam accepted_hits_FIRST_OPPOSITE.sam

samtools view -b -o accepted_hits_SECOND_SAME.bam accepted_hits_SECOND_SAME.sam

#now I calculate the coverage for the two resulting files separately

bedtools coverage -a genome.gff -b accepted_hits_FIRST_OPPOSITE.bam > accepted_hits_FIRST_OPPOSITE_cov.bed

bedtools coverage -a genome.gff -b accepted_hits_SECOND_SAME.bam > accepted_hits_SECOND_SAME_cov.bed

#and at the end, we need to put together the counts

#please consider my question

https://www.researchgate.net/post/coverage_for_a_genome_assembly

which also holds for this case. Indeed, I am selecting only pairs where both mates map but this concerns the mapping on the genome, and it may be that some pair has one of the two mates outside the coding region (e.g. in UTRs) such that there will still be some bias when calculating the coverage...

if anyone knows a better way to do this...please write it here!

Luca Lenzi

Ciao Matteo,

The following are a couple of tools you may find useful for your task:

HTseqCount

https://htseq.readthedocs.io/en/release_0.10.0/

FeatureCounts within the subReads package:

http://subread.sourceforge.net

Both have an option to select and count only the alignments in your desired orientations.

Luca

Matteo Brilli

Thank you a lot Luca!

How do you refer to a specific researchgate user while writing an answer?

When performing GSEA, is it better to take GO annotations as they are or propagate them to parents?

Is it right to calculate the probability of the valid peak using the binomial probability density function in SeqTar?

Are researcher aware that putting in their expertise as Bioinformatics and Computer Biology means that they are able to program in some language?

How do I replace a file with a more recent version of a paper that was uploaded to ResearchGate?

How to solve g_mmpbsa error?

How do you delete a duplicate pdf for the same paper on ResearchGate?

AUX gas reading problem on QE with full MS and PRM method in one run?

JCPDS 65-7246 file please?

Dirty and clean?

Where to find a gene list for CRISPRa/i library screening of regulatory factors that affect pathogenic Th17 differentiation in PBMC?

How to change the version of the article full-text pdf file?

Difficulty with permittivitt and Magnetic Permeability Calculations?

How to do pca analysis of c-alpha atom of the protein?