I need a help: Suppose we have a collect of similar protein sequences and we want to create a Hidden Markov model for this collection. What is a lower limit of number of protein sequences in collection, to provide an reliable MSA?
The answer will depend on how variable the protein is for one thing. If you have sequences that are 99% identical to each other then alignment is trivial and most of the variants are probably just random mutations that have not yet undergone any selective pressure. If you have a protein critical to all life forms such as a DNA polymerase sampling a few from tetrapods, fish, insects, crustaceans would be better than 100 from mammals.
Unfortunately, I have no information about function for these proteins. All these potential families have been recognized de novo, by clustering. I know that, these clusters can be from many bacteria phylums, while other from just a single species. I would like to know, what are “trust limits” of similarity for such clusters for reliable HMM. In my opinion 100 pretty same sequences (j90-80 % similarity ) from metagenome samples can not be a reliable basis for HMM creating, because there is no information about organisms what they are from. It can be that “most of the variants are probably just random mutations that have not yet undergone any selective pressure”, but they can be often sequenced reads from an abundant species in a corresponding sample.
On another site are clusters with low similarity (at least 0%). Many of these look like proteins from many distinct organisms, but such a cluster can have non-homolouegs in there.
I am afraid, that a really ensuring MSA is not possible for this kind of data.
It seems that your paper: https://www.researchgate.net/publication/283492932_A_combined_bioinformatics_and_functional_metagenomics_approach_to_discovering_lipolytic_biocatalysts shows that alignments and analyses of alignments works well for at least some cases. You were able to predict that a set of unknown proteins could be lipases or esterases and then proved that this protein could cleave esters. And that was done while working with a most difficult data set; the proteins which were not already assigned to known protein families.
Article A Combined Bioinformatics and Functional Metagenomics Approa...