There are numerous models that estimate genetic distance between nucleotide sequences on the basis of point mutations. As far as I understand the standard phylogenetic pipeline, all positions with indels in at least one of the sequences are thrown away.

Besides that there are some methods to take indels into account (so-called "indel coding"); but these, as I understand them, effectively phase out point mutations and measure the number of insertions/deletions that separate the given sequences.

My practical problem is to estimate distances between alleles of a hypervarible gene that are very indel-rich, with some alleles differing in the contents of an indel region. Throwing away any part of information (either indels or point mutations) results in too high information loss (with too many alleles becoming indistinguishable).

So the question arises: is there a method/model that can effectively "digest" both indels and point mutations?

The central obstacle here appears to be the scoring of a single point mutation vs. a single indel event.

More Konstantin K. Avilov's questions See All
Similar questions and discussions