Maximum parsimony (MP) has been used extensively for phylogenetic analysis of nucleic acids. However, can it be used for protein sequences? If yes how effective would it be when compared with ML or NJ?
MP, NJ and ML methods were actually first applied to proteins before shifting to nucleic acids analysis. First application of MP to protein sequences goes back to the 60's (Dayhoff, 1966) in which aminoacid substitutions were considered equivalent. Later Fitch upgraded the algorith to nucleotide sequences. MP inference is reliable and accurate in the estimation of the topologies as long as the alignment has no aberrant behavior (homoplasies, saturation).
As compared to nucleotides inference it has the advantage of having 20 states instead of 4, which makes homoplasies less likely. That is advantageous because homoplasy is the first source of bias in MP phylogenetic inference. The disadvantage is that aminoacid substitutions have underlying mechanisms more complicated than nucleotide substitutions, because a single specific aminoacid substitution can occur by a number of different combinations of single, double or triple nucleotide substitutions.
For the estimation of the topology MP trees, despite sturdy defense by their supporters, have been considered able to show inconsistent topology estimation (i.e. increasing the amount of information not necessarily leads to the right topology) under many circumstances. While supporters tended to publish improvements and allegations regarding their accuracy, researchers with a statistical-mathematical background tended to show results favoring distance and likelihood methods. They pointed that ML or Bayesian estimation of a topology could only be inconsistent if the wrong model is chosen. Distance and parsimony supporters had known disputes about that.
The other part of a phylogenetic inference, the branches lengths estimation, was considered one of the weakest parts of parsimony inference: so much that they usually were not reported. Tendency was to underestimate branches lengths, and some specific types of bias even got a name, as the long-branches-attraction phenomenon in which minimization of mutation events makes individual branches separated by a long time to have a common ancestor younger that it should be.
For protein inference, modern methods use to cope with several sources of bias. I recently knew about Markov codon models, which are alternative models of substitution that incorporate the structure of the genetic code into aminoacid substitutions. In my opinion they could be more interesting that classical methods in protein sequence phylogeny.
Have a nice day.
Article CodonPhyML: Fast Maximum Likelihood Phylogeny Estimation und...
MP, NJ and ML methods were actually first applied to proteins before shifting to nucleic acids analysis. First application of MP to protein sequences goes back to the 60's (Dayhoff, 1966) in which aminoacid substitutions were considered equivalent. Later Fitch upgraded the algorith to nucleotide sequences. MP inference is reliable and accurate in the estimation of the topologies as long as the alignment has no aberrant behavior (homoplasies, saturation).
As compared to nucleotides inference it has the advantage of having 20 states instead of 4, which makes homoplasies less likely. That is advantageous because homoplasy is the first source of bias in MP phylogenetic inference. The disadvantage is that aminoacid substitutions have underlying mechanisms more complicated than nucleotide substitutions, because a single specific aminoacid substitution can occur by a number of different combinations of single, double or triple nucleotide substitutions.
For the estimation of the topology MP trees, despite sturdy defense by their supporters, have been considered able to show inconsistent topology estimation (i.e. increasing the amount of information not necessarily leads to the right topology) under many circumstances. While supporters tended to publish improvements and allegations regarding their accuracy, researchers with a statistical-mathematical background tended to show results favoring distance and likelihood methods. They pointed that ML or Bayesian estimation of a topology could only be inconsistent if the wrong model is chosen. Distance and parsimony supporters had known disputes about that.
The other part of a phylogenetic inference, the branches lengths estimation, was considered one of the weakest parts of parsimony inference: so much that they usually were not reported. Tendency was to underestimate branches lengths, and some specific types of bias even got a name, as the long-branches-attraction phenomenon in which minimization of mutation events makes individual branches separated by a long time to have a common ancestor younger that it should be.
For protein inference, modern methods use to cope with several sources of bias. I recently knew about Markov codon models, which are alternative models of substitution that incorporate the structure of the genetic code into aminoacid substitutions. In my opinion they could be more interesting that classical methods in protein sequence phylogeny.
Have a nice day.
Article CodonPhyML: Fast Maximum Likelihood Phylogeny Estimation und...
As in any analysis MP (or ML or NJ) is dependent on your data fitting the requirements of the method (actually all methods are good, the Problem is that they are often used with data that does not fit the requirements or assumptions of the method).
The main requirement of MP is that transformations among states are rare (actually your null hypothesis is that each state evolved in a single evolutionary origin - assumed independent characters are tested by a congruence-"test" with a set of commonly analysed characters). Deviations from this single-origin-hypothesis should be randomly distributed, otherwise you get a bias into your analysis.
So amino-acid data in most cases even better fit your requirements.than nucleic acid data, since (i) as already mentioned by Edson the number of states is higher, so the probability of homoplasy is lower per se, and (ii) the evolutionary rate is lower due to the high number of synonymous mutations usually present in data sets.
My 2 cents would be that you should play around with your data in various phylogeny inference tools, including one for MP, to get a feeling for how your data behave under different optimality criteria.
My recommendation for an MP program would be TNT: http://www.lillo.org.ar/phylogeny/tnt/
Some related reading: https://www.researchgate.net/publication/235601668_Incorporating_molecular_data_in_fungal_systematics_a_guide_for_aspiringresearchers
Article Incorporating molecular data in fungal systematics: A guide ...