I have several protein structure datasets that are each non-redundant and low homology (they average in size between 7k and 14k proteins).

My first strategy was to analyze each dataset aside and later estimate the possible overlaps.

However, I realized that pdb id is not enough to screen the datasets, and the real overlap will involve actual identity between the sequences.

What would be an acceptable method to estimate redundancies between two datasets for example? For now I just need to know how much overlap there is between them.

More Yazan Haddad's questions See All
Similar questions and discussions