After a search for a suitable phylogenetic tree using RAxML, how can I compute the Bayesian posterior probability for the resulted tree. What programs shall I use? Thanks.
RAxML uses maximum likelihood to estimate the "best" tree. For node support using RAxML you would need to bootstrap. This seems like a handy tutorial, and eludes to differences in the metric of support at nodes: http://bodegaphylo.wikispot.org/RAxML_Tutorial. To estimate support at nodes in terms of Bayesian posterior probabilities you would have to run a different program like BEAST or MrBayes that uses Bayesian statistics.
I am not clear whether BEAST or MrBayes can test posterior support for a given tree or not, I looked at BEAST AND could not find where to set this. Do you know how to set this?
BTW, it seems that PhyloBayes can do this according to its manual, when I run the pb command, it always say error in command, anyone got experience with PhyloBayes?
It looks to me, as if some of the answers above are not quite hitting it. It looks like you don't want to build a new tree with Bayesian methods, you want to evaluate a tree you already built with with RAxML. I am not a whiz with MrBayes, BEAST and other Bayesian phylogeny tools, but I suspect you should be able to do what you want in one or more of those tools by loading in your tree and your data and asking the software to evaluate your input tree rather than search for "the best tree".
Thanks all! Brian got me, I have figured out how to do this. The Tree Annotator program in BEAST allows me load my target tree to calculate corresponding posterior support. Now I am able to present both the bootstrap and bayesian supports in my ML tree. i saw some people do this but they did not state in detail how. Another question is that, I always get very weak maxium likelihood bootstrap support for one of my target clade, some people say this may be a problem when your target sequences are too homologous, and weak ML bootstrap support does not necessarily means the clustering is wrong, anyone got any comment, I wonder what you do or how you justify when there is weak bootstrap support. Thanks a lot!
I was a little confused by the answers here relative to the question. Bayesian posterior probabilities are based of the results of a Bayesian phylogenetic analysis. The most used phylogenetic methods (RAxML, MrBayes) evaluate how well a given phylogenetic tree fits your molecular data. A maximum likelihood score is created for each comparison/iteration of the analysis. For each iteration, the tree is tweaked and its new score is compared to old scores from earlier trees. Bayesian analyses are different from simple ML analysis because Bayesian analyses perform millions of iterations, but poorer scoring phylogenies are still retained relative to better ones, as long as the poorer ones fall within a margin of error. In the end all of these resulting trees are part of your "posterior distribution". The proportion of these trees that have your clade of interest gives you your "posterior probability". That is, the probability that a specific clade/branch will be found within the distribution of the Bayesian trees.
In essence you have to perform a Bayesian analysis to get posterior probabilities. I imagine with Brian's suggestion, you're using your RAxML tree as a starting tree to begin a Bayesian analysis. This should work just fine, but don't be surprised if the bayesian consensus tree isn't exactly identical to your RAxML ones.
It's important to realize that every phylogeny is simply a hypothesis. An estimation of relationships based on the data we are using. If you are trying to find evidence for a particular systematic relationship, but the phylogenetic analysis isn't giving the support you need, then you need to ask yourself a few more questions. Are you using the right data? That is, if you're using molecular data, perhaps the genes your using aren't sufficient by themselves, and that more or different molecular characters are needed. Also, what if the phylogeny is correct? What were the preliminary observations that caused you to hypothesize these organisms were closely related? Perhaps those observations were off, and there is really something else that is driving the evolution and diversification of the organisms you are studying. Phylogenies can be powerful in showing us how things evolved and motivating us to develop new hypotheses..
Hi Andrew, Thanks for the detailed elaboration! Sorry about the later reply.
Yes indeed, I have got a single gene family that I want to look at its phylogeny, the topologies always varied with different methods, including NJ, ML, and Bayesian test. The NJ and ML trees are kind of close, and Is consistent with commonly recognized life tree. But the ML support was weak for one of my target clade. Therefore, I want to test the Bayesian posterior support for my ML tree in the hope that there might be strong Bayesian support for that clade.
With Bayesian test, it generates a tree pool, I can calculate the best bayesian tree with this pool, and I can also calculate the posterior support for a previously determined phylogeny (not as an initial search tree). The thing is that the best Bayesian tree topology is very different with my NJ and ML trees.
Bayesian support calculation for my ML tree also doesn't give support to my target clade. So the problem is that I prefer to believe the NJ and ML topology, but I cannt get support for my target clade. (I found it hard to interpret the best Bayesian tree although it has strong Bayesian support, it seems to me that you could always get strong supported Bayesian tree if you let it run long enough, right?)
You may be right, the weak support for my target clade may result from my dataset, which may not be sufficient enough. Then my question is that why the ML method give weak support for its “best tree”. 1000 time bootstrapping for ML is already very computing consuming.
Another questions, when constructing phylogeny for a protein encoding gene family, do you prefer to use the protein sequence or CDS sequence? I found they could lead to very different results.
Sorry. This is probably waayy too late to reply as I'm sure you've probably resolved the issue. Regardless, I feel compelled to respond to your questions. First, yes, Bayesian analyses that have 'plateaued' will just increase support for the topology with longer runs. This is assuming that there isn't a more optimal topology hiding in treespace that the Bayesian analysis will somehow magically find. This is rare but not impossible.
As far as protein sequence vs CDS, am I correct in assuming you mean amino acid sequence vs nucleotide sequence for a particular gene? Assuming yes I think you can analyze it both ways and reflect on the similarities more than the differences. I tend to used nucleotide sequence exclusively as there is more information there. However, if the relationships between genes is evolutionarily distant, then with nucleotides you might not be able to align the dataset sufficiently for analysis. The one time I encountered this problem I went to amino acid sequence and everything worked well. Ultimately, between the two datasets you would chose the phylogeny ("answer") that makes the most sense biologically to focus your discussion around.
Thanks Andrew! Always glad to learn more. I like the idea of "hypothesis" in phylogeny analysis in your previous answer. Whatever results people get from different phylogenetic methods, they got to make biological sense.
As you mentioned, simply let Bayesian MCMC chain run longer after convergence would increase topology support, could these results be biased if someone intentionally let it run longer? Is there a norm that should be followed?
Nucleotide sequences definitely contain more information than amino acid sequences, which however, from my understanding, does not necessarily produce worse results when people try to answer a specific biological question.
I wouldn't call it bias. It's a statistical sampling process. More sampling = more statistical rigor. What you're essentially saying is that there is a high probability that the data you are using are going to recover a particular topology (give high posterior probability for some clades in your tree). Given that a phylogeny represents a 'hypothesis', any particular topology can be challenged or shut down given different or better data and analysis. As a result, someone can come along later and replicate any of my research, using different data and analysis, and ultimately refute my conclusions. This is the nature, and strength of the scientific method. Each new study generates greater understanding.
Beyond this philosophizing, Bayesian analyses are typically run with 10 thousand replicates. AND, I also think it is good to perform a 10K replicate analysis, multiple independent times (3x in general). Each time you can replicate a result, you are demonstrating statistical rigor in your analysis, outcome, and ultimate conclusions.
EDIT: Upon review I said 10K replicates when I should have said 10 million. My bad.
Phylogenetic analysis was performed by Bayesian inference (BI) implemented in MrBayes 3.2.5 and Maximum Likelihood (ML) using RAxML v8.0 , respectively. ML bootstrap values (BPs) and posterior probabilities (PPs) were plotted on Bayesian 50% majority-rule consensus trees using FigTree v1.4.2.