I am working with around 2600+ genomes and wish to study the genome, gene and intergenic features among various groups. In case of taxonomical groups which have very few representatives, there is no issue. In case of taxonomical groups having multiple genomes, on what basis shall I remove similar genomes so as to get just a few representatives from each taxonomic group. Should I use lenght or GC% or some other feature to remove genomes - like if two genome have a GC% variation of less than 1% I shall remove that. Some thing like that. Please suggest accepted ways and kindly explain the reason as well.
Example:
I have around 60 genomes of Mycobacterium sps of which more than 20 are of M. tuberculosis alone which have
GC% range of 65.48 to 65.7 and
Length range of 4.27 to 4.41 MB
How to screen and remove similar genomes?