How to calculate dN, dS? What kind of input should I prepare? And which method is better? (A) Nei-Gojobori (1986) method, (B) Yang & Nielsen (2000) method or (C) LWL85, LPB93 & LWLm methods?
While Nei-Gojobori and other "counting" methods (all of those you name) made a major contribution to in field of computational molecular evolution, much better alternatives now exist. Do you only have pairs of sequences? If so then use maximum likelihood to compute dN and dS. This method was shown to perform the best in the paper Yang & Nielsen (2000). The method itself has nice statistical properties, unlike the "counting" approaches (which are basically ad hoc).
If you can, it is always better to use multiple sequence alignments instead of pairwise. Your inferences will so much more powerful! Then you can study selection with more sophisticated codon models as was already suggested above. With my collaborators I have written a few reviews on the topic of codon models and tests for positive selection: Anisimova & Liberles 2007 (Heredity), Anisimova & Kosiol 2009 (MBE), Kosiol & Anisimova 2012 (Meth Mol Biol), and book chapters in the book "Codon evolution: mechanisms and methods" - Anisimova & Liberles 2012 and Anisimova 2012. The pdfs of these papers are available on my website:
My favorite implementation (for robustness, accuracy and ease of use) is PAML (codeml program) of Ziheng Yang.
We also have recently developed an application for inferring phylogenies using codon models - an extension of PhyML, called CodonPhyML. It is available on source-forge (both Windows and Mac executables and a compressed file with all the source-code for compilation on Linux/Unix including the user manual and examples):
http://sourceforge.net/projects/codonphyml
If you'd like to try please have a go. We'll appreciate any user feedback (both positive and negative :-))
We also used JCoDa, which is user-friendly and easy to apply; moreover, it provides with nice plots illustrating the results. You can freely download it at http://www.tcnj.edu/~nayaklab/jcoda.
Thanks all of you! I need more detailed instruction. If i want to know the dS & dN of some regions between Human and Chimpanzee. First, I fetched the corresponding sequence from reference genome of hg18 and panTro2, and then aligned them as the input. is this ok? Do i need sequences from outgroup species? Or should I have complete sequences from multi-individuals of Human and Chimpanzee? I'm new to molecular evolution and would appreciate your guidance.
Well, dN/dS is a pair-wise characteristic, so you don't need any outgroups. (However, you will need them if you decide to construct also a phylogenetic tree (but that's a different story)). Of course, you can use whatever other sequences to get comparison, but it doesn't make sense to take very distant seq-s, as the dN/dS will become uninformative in this case.
As far as I remember, in JCoDa you don't even need to align sequences, you can just input plain sequences (like FASTA format). In this case the tool will make the alignment automatically, which is, in my view, not the best variant, because it's always recommendable to have a look at your alignment before you go on with analysis. You can also input the alignment you made yourself, and the tool wil take over from this moment.
What is "complete sequences from multi-individuals of Human and Chimpanzee"?
Ekaterina, thank you for your comment. and sorry for my vague description. As you mentioned, dN/dS is a pair-wise characteristic. But what if this pair of sequences is not representative? I mean, because these two sequences are got from reference genomes, and the reference genome is in fact the genome of one individual. Let' say, If this individual (species A) has a rare mutation, which is different with the corresponding position in species B, and this mutation is actually not fixed in the species A population. Could i say it is a divergence site? That's why i said in my last post "do i need sequences from more than one individuals of both species A and species B, in which way, we can distinguish the rare mutation with fixed mutation?". I hope it is clear now. And by the way, do you have any recommended book or paper addressing these kind of basic concepts of population genetics?
The PAML package, as suggested by Serena, is one the most used. There are different tools inside. For example, there is the yn00 software, that compute dN/dS in a pair-wise manner between two sequences:
http://abacus.gene.ucl.ac.uk/software/paml.html
There is also CodeML, which is use to detect positive selection using a phylogenetics tree. If you are interested, there is a tutorial to use the branch-site model on my blog:
Having an outgroup is useful when you want to see which branch a mutation occurred. In that case, you might want to study dN and dS for the two branches separately.
Regarding sample size, if you only have one sequence from each species you will mainly be able to see the effect of ancient mutations. In order to study more recent evolution you will typically need larger samples.
While Nei-Gojobori and other "counting" methods (all of those you name) made a major contribution to in field of computational molecular evolution, much better alternatives now exist. Do you only have pairs of sequences? If so then use maximum likelihood to compute dN and dS. This method was shown to perform the best in the paper Yang & Nielsen (2000). The method itself has nice statistical properties, unlike the "counting" approaches (which are basically ad hoc).
If you can, it is always better to use multiple sequence alignments instead of pairwise. Your inferences will so much more powerful! Then you can study selection with more sophisticated codon models as was already suggested above. With my collaborators I have written a few reviews on the topic of codon models and tests for positive selection: Anisimova & Liberles 2007 (Heredity), Anisimova & Kosiol 2009 (MBE), Kosiol & Anisimova 2012 (Meth Mol Biol), and book chapters in the book "Codon evolution: mechanisms and methods" - Anisimova & Liberles 2012 and Anisimova 2012. The pdfs of these papers are available on my website:
My favorite implementation (for robustness, accuracy and ease of use) is PAML (codeml program) of Ziheng Yang.
We also have recently developed an application for inferring phylogenies using codon models - an extension of PhyML, called CodonPhyML. It is available on source-forge (both Windows and Mac executables and a compressed file with all the source-code for compilation on Linux/Unix including the user manual and examples):
http://sourceforge.net/projects/codonphyml
If you'd like to try please have a go. We'll appreciate any user feedback (both positive and negative :-))
I do not necessarily recommend PAML rather than HYPHY/datamonkey: the latter is more userfriendly and flexible. More importantly it allows for variation of synonymous rates along the sequence that may lead to false positive (of selection ) in PAML.
As an alternative to clustalw, you may used mafft for alignements, especially for big datasets.
Beware of recombination issue when dealing with several intraspecific sequences (see datamonkey sites for more details and associated references).
uses Pal2Nal briefy, say you have a fasta file with 10 aa sequences and corresponding nucleotide sequences, yoiu can align the aa sequences and get an output in fasta format. the aligned aa sequences can be used to get a codon-aligned nucleotide sequences in fasta format. Howver your sequences IDs in the originally aligned aa fasta file must match the sequence ID in the nucleotide fasta file you want to pass into Pal2Nal. You can use Pal2Nal to remove ambigous positions, stop codons, gaps, etc basically to help ur codon-aligned nucleotides conform to what PAML expects.