I work on an ancient protein family that exists in both eukaryotes and prokaryotes and my interest is in elucidating its deeper branches. When I pull down the data from NCBI I get about 6000 sequences. If I use an algorithm to help me cluster similar proteins (e.g. 80% similar) it gets reduced, and if I do my clustering with 50% similarity I get about 550.
The question is, in your experience, which one would you choose:
- use the smaller dataset (or even smaller) and do the most sophisticated, yet computationally intensive, analyses you can hoping that if there is a signal, these analyses would pick it up and thus the deeper branches get better support.
- use a lot of data and hope that then the intermediate evolutionary steps could be inferred more easily and the deeper branches get more support?