I do quite a lot of analysis of NGS data to assess viral evolution in clinical samples. Up to now, i have used a pre-existing pipeline to process the data but I want to modify this to improve the output.

Briefly, the pipeline maps the raw reads to a reference sequence using bwa, then indexes, sorts and produces a vcf file using samtools/bcftools. Finally it uses the vcf to produce a consensus sequence using vcf2consensus.pl.

This works fine for a lot of the samples, but in some samples, particularly where there is poorer quality data, the consensus file will contain lots of degenerate nucleotides that then need to processed manually. Also with these samples, if there are no reads covering a region in the reference, then it will just insert the reference sequence into the consensus in this region.

I think the issue is mostly around using vcf2consensus.pl and looking on the Internet there are a lot of different tools for variant and consensus calling. I was wondering if anyone knew of a tool that would allow me to define when to call a degenerate nucleotide in the consensus based on the proportions of the different nucleotides at that position and also where there are no reads mapped to the reference just to return dashes for those nucleotides rather than inserting the reference?

More Alexander Byrne's questions See All
Similar questions and discussions