I just realized that when giving the --library-type option to tophat, the resulting accepted_hits.bam still contains reads mapping in both orientation wrt the transcripts provided in the gff file. I am not mapping on the transcripts, but on the genome, and I provide the gene coordinates in a gff file. I was pretty convinced that the accepted_hits.bam file would contain ONLY the reads mapping in the right orientation wrt the transcript but this is actually not the case. you can simply check this by running TopHat with strand-specific RNA-seq paired end reads using firststrand and secondstrand library type, and you will get exactly the same output. SO, if I use the bam file to count the mapping reads with some tool like bedtools coverage, I will consider reads mapping in both orientation, therefore neglecting the strand specific information. The only thing that TopHat is doing when you specify the library type is adding a tag XS:A that can take + or - value, which refer to the orientation of the transcript that is overlapped to the read. So, it seems to me the only way to get the right counts is by splitting the bam file in first and second mate files. for a firststrand library (like e.g. the ones obtained with TruSeq) I expect that the second mate is in the same orientation as the transcript, while the first mate is in the reverse orientation. Therefore, for the first mates, I get only the reads that are in the opposite direction with respect to the transcript, while for the second mates, I get those that map in the same orientation wrt the transcript. Then I have to get the coverage for the two files separately and then I put the counts together (see below). Is it possible that no tool does this procedure in automatic? Am I missing something?

example. paired reads are in R1.fastq and R2.fastq

#mapping

tophat -G genome.gff -o R1R2_map --no-mixed --no-discordant --library-type fr-firststrand genome.bt2 R1.fastq R2.fastq

#this gives accepted_hits.bam

#now I separate first and second mate

#get first mate mapped on plus

samtools view -f 64 -F 16 accepted_hits.bam > accepted_hits_FIRST_PLUS.sam

#get first mate mapped on rev

samtools view -f 64 -f 16 accepted_hits.bam > accepted_hits_FIRST_REV.sam

#get second mate on plus

samtools view -f 128 -F 16 accepted_hits.bam > accepted_hits_SECOND_PLUS.sam

#get second mate on plus

samtools view -f 128 -f 16 accepted_hits.bam > accepted_hits_SECOND_REV.sam

#now I grep to take only reads mapping in the right orientation wrt the transcript, whose orientation is encoded in the XS:A flag and knowing that the first mate is rev wrt the transcript, while the second one is in the same orientation

grep "XS:A:-" accepted_hits_FIRST_PLUS.sam > accepted_hits_FIRST_OPPOSITE.sam

grep "XS:A:+" accepted_hits_FIRST_REV.sam >> accepted_hits_FIRST_OPPOSITE.sam

grep "XS:A:+" accepted_hits_SECOND_PLUS.sam > accepted_hits_SECOND_SAME.sam

grep "XS:A:-" accepted_hits_SECOND_REV.sam >> accepted_hits_SECOND_SAME.sam

#get the bams for bedtools coverage

samtools view -b -o accepted_hits_FIRST_OPPOSITE.bam accepted_hits_FIRST_OPPOSITE.sam

samtools view -b -o accepted_hits_SECOND_SAME.bam accepted_hits_SECOND_SAME.sam

#now I calculate the coverage for the two resulting files separately

bedtools coverage -a genome.gff -b accepted_hits_FIRST_OPPOSITE.bam > accepted_hits_FIRST_OPPOSITE_cov.bed

bedtools coverage -a genome.gff -b accepted_hits_SECOND_SAME.bam > accepted_hits_SECOND_SAME_cov.bed

#and at the end, we need to put together the counts

#please consider my question

https://www.researchgate.net/post/coverage_for_a_genome_assembly

which also holds for this case. Indeed, I am selecting only pairs where both mates map but this concerns the mapping on the genome, and it may be that some pair has one of the two mates outside the coding region (e.g. in UTRs) such that there will still be some bias when calculating the coverage...

if anyone knows a better way to do this...please write it here!

Similar questions and discussions