Suppose that you obtained a genome assembly and now you want to calculate the coverage for each contig. Also, suppose that you have both paired and single end reads. I checked a lot of tools (coverageBed for instance) and they all treat paired end reads as single end, I mean, each mate count as a fragment (therefore if the two map on the same contig, they are counted as two fragments). Now, this can be appropriate if you only have paired end reads and if you only focus on pairs that are mapped properly (i.e. both mates map on the same contig). In such a situation you will count twice the coverage for all contigs, and once you normalize by the number of total reads, you will get the right numbers. However, in most genome assemblies you'll have a certain amount of mates mapping on different contigs. The more the repeats that you are not able to resolve with your short reads, the more fragmented the assembly will be, and the regions surrounding these repeats in the genome will likely end in different contigs. It follows that focusing only on properly mapped mates will introduce a bias in coverage calculation, as less reads will map properly in repeats and their surrounding regions.

Moreover, if for instance your insert was shorter than 2X the length of the reads such that the mates are often overlapped, and you used some tool like flash to merge the overlapping mates in single end reads...than you have a problem. Similarly if you also have single end reads (rare today, I know - while overlapping reads are rather common). In this case, single end (or "flashed") reads will count one, while mates will count two, while basically both represent a single DNA fragment. How you deal with this? one way would be to calculate the coverage separately for single-end and paired end, and then put the value together after dividing by 2 the counts for paired reads, but this requires focusing only on pairs both mapping on the same contig but in doing so, pairs mapping on different contigs will contribute only half count, is there any tool dealing with this correctly?

More Matteo Brilli's questions See All
Similar questions and discussions