I am revisiting a phylogeny I did years ago but now refseq in genbank has about 15000+ relevant protein sequences!! I wanted to filter those out to a more manageable set and I am using t-coffee (like I used to) but it's taking a long time and I am wondering what people do these days. Are there other things people use to automatically remove sequences that are very similar to each other?? Since I am interested in the deep branches I don't need to have all sequences, just the few hundred most divergent ones (a lot of these are different strains of e.coli for example). Any suggestions??

Similar questions and discussions