Hi,

I'm creating a fasta alignment of concatenated SNPs from a combined vcf file, but I'm having some trouble with the gaps.

I'm using the useful script vcf_tab_to_fasta_alignment.pl (Christina Bergey, 2012, url = "http://code.google.com/p/vcf-tab-to-fasta" ). From an input.vcf.gz it I obtain all SNPs aligned to the reference in fasta format (see below). However, the SNPs coordinates are different between isolates so the loci with no SNPs info are filled with gaps (-).

I get:

>REF ATCCTTGCA

>ID1 -CT-A-CT-

>ID2 CG---CAA-

Is it possible to fill in those gaps using the nucleotide in the reference genome? I imagine that it can be achieved by using a simple script but I'm not a programmer.

I wish:

>REF ATCCTTGCA

>ID1 ACTCATCTA

>ID2 CGCCTCAAA

Any help would be very much appreciated! :D

Many thanks,

A

More Ana Valero Rello's questions See All
Similar questions and discussions