I need to run codeml for ortholog clusters from different related genus. Do I need to generate a species tree for it that I can use for all the clusters or I need to generate gene tree for all the cluster individually.
Hi. Considering that codeml depends on internal nodes to calculate the dN/dS ratio for the different branches of the tree, I would suggest that you use either a maximum likelihood or bayesian approach to first generate individual trees for each cluster and then pass it to codeml. This way you have the freedom to choose different models and approaches while building the tree for each cluster.
You can build phylogenetic trees in whichever way you prefer. For what I remember branch lengths are not used but recalculated by codeml. It is mandatory to align the proteins and then perform the alignment of the coding sequences following that as with revtrans (http://www.cbs.dtu.dk/services/RevTrans/), or alternatives such as MACSE http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0022594. Only in this way you are sure that the aligning algorithm does not introduce frameshifts that would create artifacts in the ensuing selection analysis. Also, as codeml basically will count and model mutations, it is important that there are not too many differences in the sequences; if you not, many of the sites might have had multiple mutations that you can't observe, and the selection level calculation will be affected. You can somehow control for this taking into account the branch lengths of the input tree, usually in substitutions per site, you should have much less than one for every sequence in the multialignment. Clearly the tree has to be built on nucleotide sequences. If there are highly variable regions, you can remove problematic regions from the alignment, but you need to be very careful in doing this because you have to reason in coding sequence terms such that your edit unit will be of three nucleotides, following the frame.
It is also important to contrast the likelihood of the model with selection to the one of the model with no selection. If they do not differ much then, even if you observed a few sites under positive selection, you can't reject the hypothesis of no selection.
Ideally, you should use the species tree and several gene trees generated using different nucleotide substitution models. If the null is rejected with each tree then you have strong evidence for positive selection.