Hi fellow researchers, I have got a noob query regarding the question of the post. I am trying to build a data set of enzymes (from UniProt data base) that would contain protein sequences and their corresponding parent enzyme class (say hydrolase or kinase or similar). Now, I need to get rid of those proteins that have similar sequences (that 30% similarity benchmark that bioinformaticians use) and keep sequentially dissimilar proteins in my data set. Which one would be more appropriate - sequence similarity vs sequence identity?
I am not a bioinformatics researcher, however, my project requires this task to get accomplished. So, I don’t have much of a coherent idea regarding these concepts (I am willing to learn). I would also love to learn which tools are used to assess sequence similarity and identity of protein sequences in an in-house prepared data set (not an online data base).
Feel free to refer me to resources that may contain relevant and comprehensive information. I duly appreciate any suggestions. Thanks a million in advance.