I work on an ancient protein family that exists in both eukaryotes and prokaryotes and my interest is in elucidating its deeper branches. When I pull down the data from NCBI I get about 6000 sequences. If I use an algorithm to help me cluster similar proteins (e.g. 80% similar) it gets reduced, and if I do my clustering with 50% similarity I get about 550.

The question is, in your experience, which one would you choose:

- use the smaller dataset (or even smaller) and do the most sophisticated, yet computationally intensive, analyses you can hoping that if there is a signal, these analyses would pick it up and thus the deeper branches get better support.

- use a lot of data and hope that then the intermediate evolutionary steps could be inferred more easily and the deeper branches get more support?

More Ramiro Barrantes's questions See All
Similar questions and discussions