I wish to create a consensus sequence of viruses on a high taxonomic level (family).
I have several thousand sequences of variable length (300-20,000 nt), representing partial or whole genome sequences of viruses. The viruses all belong to the same taxonomic family, but they are different genuses and species, which means they have some similarity, but also quite a lot of diversity.
I have different numbers of sequences for each species, so I cannot just throw all the sequences into the same alignment, because that would bias the consensus sequence to over-represent the species with the highest representation in the alignment. So I am looking for strategies to curate the sequences before the final alignment to make sure that the alignment best represents the diversity in the family.
I am considering creating separate alignments for each species. Then I might align the species level consensus sequences to create genus level consensus sequences. Perhaps I will even be able to align the genus level consensus sequences to a family level consensus.
BUT I worry that the species and genus consensus sequences will lose the information on the original ratio of ambiguous nucleotides, which would mean that the family level consensus would also not contain any true information on these ratios.
So my question is -
Is there any way to align multiple consensus nucleotide sequences, while retaining the information on the correct ratios af ambiguous nucleotides?
Thanks in advance.