Is there a model to estimate genetic distance that uses both point mutations and indels?

06 June 2014 4 7K Report

There are numerous models that estimate genetic distance between nucleotide sequences on the basis of point mutations. As far as I understand the standard phylogenetic pipeline, all positions with indels in at least one of the sequences are thrown away.

Besides that there are some methods to take indels into account (so-called "indel coding"); but these, as I understand them, effectively phase out point mutations and measure the number of insertions/deletions that separate the given sequences.

My practical problem is to estimate distances between alleles of a hypervarible gene that are very indel-rich, with some alleles differing in the contents of an indel region. Throwing away any part of information (either indels or point mutations) results in too high information loss (with too many alleles becoming indistinguishable).

So the question arises: is there a method/model that can effectively "digest" both indels and point mutations?

The central obstacle here appears to be the scoring of a single point mutation vs. a single indel event.

David Enard Popular answer

Hi Konstantin,

I am not aware of a method using indels in addition to substitutions. However, I would recommend against using indels for your purpose because aligners do a very poor job at inferring them properly. Aligners that do better such as Nick Goldman's PRANK rely on a previously known phylogeny to correctly infer indels. So if you already know the tree topology and are only interested in estimating genetic distances at this point, this could be useful.

But maybe this is possible for you to edit your alignment manually?

I recommend that you look at the following link, people who have actually worked on the issue give interesting insights:

http://evol.mcmaster.ca/~brian/evoldir/Answers/GeneticDistance.with.indels.answers

David Enard

Hi Konstantin,

But maybe this is possible for you to edit your alignment manually?

I recommend that you look at the following link, people who have actually worked on the issue give interesting insights:

http://evol.mcmaster.ca/~brian/evoldir/Answers/GeneticDistance.with.indels.answers

Konstantin K. Avilov

Thank you for your answer, David...

1) Aligners: yes, this is a problem too. I already use PRANK (with, obviously, its default approach of inferring the "previously known phylogeny" from a point mutation metrics). But PRANK performs not so good: it attributes clearly identical insertions of significant length (6-9+ nucleotides) to independent insertion events.

So the alignment that I currently use is PRANK with pretty much manual correction (which may bring the result pretty close to what Muscle of ClustalW produce).

1.5) No, I do not know the tree/topology beforehand. Futhermore, since I deal with bacterial genes, there is a high chance that evolution is not tree-like and there are lots of horizontal gene transfer.

2) The link: yes, I have already googled that text and read it. It is pretty much about the same thing as my question here: indel coding and the problem of arbitrariness of indel vs. point mutation weighing.

========================

By the way, my question may be transformed into another one:

Is there a model that mechanistically describes the processes of deletions and insertions (with some explanation where the code in insertions comes from)?

Brian Thomas Foley

There are too many different mechanisms for insertions and deletions to give one answer that describes them. Many regions of genes can have inverted repeats which can form stem-loops in the DNA prone to deletion. RNA viruses are even more susceptible because their genome exists as RNA with extensive secondary structure. Repetitive DNA such as GAAGAAGAAGAA is very susceptible to "stuttering" which causes variable numbers of the tandem repeat. In HIV envelope gene we often observe inserts that are identical to short regions of sequence near the insert, as if the insert was copied from nearby, but this is probably do to the mechanism of replication of retroviruses (two genomes packages, with template switching during the reverse transcription) and less likely to be found in other organisms.

Are there any "semi-linear" functions?

Is it possible to multiply a diagonal in a matrix by a scalar using basic matrix operations?

(Semi-)analytical solution to the overdetermined b=exp(A*x) equation?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?