Hi all!
I want to construct a phylogenetic tree of all earthworm (Lumbricidae) COI sequences available on GenBank.
My search on the NCBI nucleotide database has retrieved about 10,000 sequences. Obviously, I can't construct a tree using all these. Also, I realize there will be redundant results, duplicates, and unverified sequences.
What would be my next step when dealing with these sequences? Should I clean my dataset of 10,000 sequences? If yes, how would I do that? Which tools or software are commonly recommended?
Any insight on the logical next steps would be immensely appreciated.