I'm working on an "alternative" amino acid substitution matrix, in the style of BloSUM (I.e - focus is on functional and structural alignment, between remote homologs).
I know the original blosum matrix extracted blocks of conserved residues from pair-wise (or MSA?) local alignments, (i.e "local, ungapped alignments").
I don't understand what would be the best way to work with this in the following context:
1) I have MSAs for the various sequences & families (I wish to analyze, in order to get the statistics for building my new AA substitution matrix).
Should I be looking at pairwise alignments? what does it mean "no gaps"? What's the minimal length of such an alignment?
2) What are the best, modern tools for extracting these ungapped blocks?
(The BLOCKS+ database, and the tools Henikoff used there aren't maintained anymore. And i'm unsure if just extracting local alignments is enough. I've seen something similar in G-blocks, but that seems aimed at phylogenetic analysis ).
3) To reduce "sequence similarity" (i.e 62% similarity = Blosum62, etc'), should I simply cluster/filter all the sequences in my database using CD-hit/UniRef? Or should the clustering be applied on the level of the individual protein blocks? Or somehow applied post alignment?
4) I recall that in the original paper, Heinikoff & H. didn't use any existing matrices (PAM) to get their blocks/aligned motifs. I'm confused as how to do that using existing methods, and whether I even should. (I.E, motif extraction vs pairwise alignment vs MSA alignment, for block extraction).
[Disclaimer: I lack any previous background in MSA work & the like].
Thank you very much!!