When we download thousands of sequences for databases, we often pick multiple copies of the same sequences which are submitted by different group or sometimes different strains. How can we curate the data set, so that it contains only one representation of each sequence for quantifying and comparative studies ?

Similar questions and discussions