Is there a tool for annotation-assisted multiple sequence alignment?

04 April 2013 5 1K Report

At the HIV Sequence and Immunolgogy Databases where I work, we have used a bit of creativity to solve some difficult problems in multiple sequence alignment. Often we want to produce an alignment of gene sequences from more than 20,000 different isolates of HIV-1 in less than a few minutes. We are very good at "deep" multiple alignment, thousands of copies of the same small genome.

My problem comes when I want to align the genomes of other viruses or similar sized gene regions (the complete mitochondrial genomes of vertebrates for example, which are roughly 7 kb in size), they don't always have the same gene order. A good example are the mitochondrial genomes of birds and mammals, which are mostly co-linear, but with the NADH6 gene moved to a different location (see attached mitochondrial genome maps).

In other cases, I think it is the primate mitochondrial genomes, the authors all used a different site for the "base #1" in the circular genome. So, although the primate mitochondrial genomes are 100% co-linear with other vertebrates, we have to chop several thousand bases off the right end and past them onto the left end (5' end, beginning) to make them align with the mt-genomes of other mammals.

So, it seems to me that there ought to be a multiple sequence alignment tool, that can read GenBank files with their annotation, and use the annotation to help with the alignment process. One tool that I am aware of, which can help a lot, is the "Artemis Genome Comparison Tool" (ACT) and its associated DOUBLE-ACT server. The DOUBLE-ACT server uses BLAST to find regions on a pair of genomes which are homologous/similar and creates a table of these matched regions. The Artemis Comparison Tool then loads both genomes into an ARTEMIS Genome Browser tool and uses the BLAST hit table to help the browser get both genomes "in synch" with each other as you browse the genomes. Although the DOUBLE-ACT BLAST step here is not dependent on annotations at all, the annotations are visible when browsing the genomes in ACT.

I am quite sure that I am not the only one in the world who needs this type of tool. I am increasingly seeing large multiple sequence alignments being done for classification of organisms, where the authors could have used such a tool.

Please let me know if you have any ideas about where to look for such a tool, or which groups of bioinformatics workers might be able to develop one.

Blake Stamps

This might help http://gel.ahabs.wisc.edu/mauve/ . Allows for multiple alignment of samples to show rearrangement of similar regions with annotations present. http://gel.ahabs.wisc.edu/barphlye/ would then in turn take this alignment and produce a bayesian tree of samples, as well as an animation to show the progression of changes in samples.

Vladimir Leontiev

As a biologist I can see how such a tool may be useful. As s programmer I foresee the obstacle in developing such tool: one will have to match sequence to sequence (easy) and also feature to feature - this may be tricky because biologists use different names to describe same feature. If you need to match to regions both labeled, say "NADH6", no problem. But most likely you will get "NANH-6", "Nadh6", "ND-6" and many, many other names for the same gene. So if you are willing to manually label features, I am sure someone may be willing to develop a tool for you. Otherwise - sorry...

Brian Thomas Foley

Yes, the annotation of genomes is a weak link for sure, which is why I mentioned the BLAST-based DOUBLE-ACT tool. But the annotations can be of help to humans if not so much to programs, so I would be interested in retaining an ability to "see" which genes were rearranged during the evolution of the genomes. A human can usually spot things like ND6 = NADH6.

I will have a look at MAUVE and the http://gel.ahabs.wisc.edu/barphlye/ tools.

THANKS!!!

Vladimir Leontiev

I was thinking about the task of aligning thousands of sequences where genes are shuffled around. Perhaps one can first build dot matrix for each pair of sequences (HIV sequences are not so long, so perhaps it all will fit in RAM?). One can see alignment on dot matrix even if genes are in different order. From there you can proceed with your ultra-fast alignment tool for each gene separately. This way you wouldn't need to mess around with annotations. Just a thought....

Brian Thomas Foley

Thanks, Vladimir. This is essentially what DOUBLE-ACT does already. And it does it in an automated, computer-readable out put way. My thought on the annotation part of the project would be mainly to check what the annotators had noted about such things as gene duplications and pseudogenes.

How do you measure phylogenetic signal vs noise in a multiple sequence alignment?

What is the largest known protien?

What is a good tool for pulling complete genes from chromosomes/contigs/genomes?

What are some of the best Phylogenetics/ Evolution textbooks out today, for undergraduate and PhD level?

Is the devil in the details?

Is this good advice for a student or post-doc doing bioinformatics in a traditional "wet bench" biology group?

Would this paper be good for your project?

How to fix a phylogenetic tree, long branches attract problem?

GC-MS retention index prediticon?

Are there instances where molecules with larger molecular weights exhibit greater mobility than those with smaller molecular weights?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

For an in-vitro drug release study, what molecular weight cut-off (MWCO) dialysis bag is required for a 117 kDa protein?

How to start a Molecular Dynamics Simulation?

Who of all the Global Scientific community will help me Prof. Dr. Yoshida make way for TPEOM, MEC ~EMC to return the atmospheric gases to the norma ?

Which will be the best software for the Hydration shell analysis with molecular dynamics?

Can anyone provide me with molecular docking softwares/ websites?

How is the bacterial genome's high protein count verified as genuine despite 800+ contigs and good metrics (98.55%completeness, 0.68% contamination)?

How to restart MD without using checkpint file in GROMACS?