Background

Usually phylogenies are generated from  a  sample of specific regions of the genome, with a few to hundreds or occasionally more loci. Putative orthologous regions are aligned and the phylogeny is estimated under a substitution model. Often huge chunks of the genome may not be sequenced (its costly) but even worse  researchers just ignore large parts of the data generated from next generations sequencing.

An alignment free method described by Huan et al 2015 uses kmer frequencies and reconstructs a phylogeny from whole-genome raw short read sequence (SRS) data. A basic evolutionary model for sequence divergence change is used to construct phylogenies with branch lengths. Various adjustments are made to account for different genome sizes, homoplasy, sequencing errors, and a range of sequence coverage. Arguably this is more or less immune to issues of recombination too?

I think this method could be used to take fuller advantage of next generation data. It gets at the "total evidence" and seems like a defensible model, at the very least worth comparing to phylogenies generated using aligned samples for targeted loci. Do you agree?

What are the biggest limitations of this approach? Would you consider including a tree generated in this way as a potentially valid alternative to traditional alignment based methods? Are there smarter people out there that can critique this?

Literature cited:

Fan, Huan, Anthony R. Ives, Yann Surget-Groba, and Charles H. Cannon. 2015. “An Assembly and Alignment-Free Method of Phylogeny Reconstruction from next-Generation Sequencing Data.” BMC Genomics 16 (1): 522. doi:10.1186/s12864-015-1647-5.

http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1647-5

More Christopher E. Buddenhagen's questions See All
Similar questions and discussions