IN the DAMBE phylogenetic package under the pulldow menu for Graphics, there is an option to plot transitions and transversions for each pairwise comparison in your data set. This is very useful for studying the degree of saturation in your data set. I am attaching a plot made using Flaviviiruses complete genomes. In this plot you see that within recent isolates for each "species" (Zika Virus, Dengue serotype 1, Dengeue seqotype 2, etc) there is no evidence of saturations as the transitions outnumber transversions and the plots are linear. At greater distances, between "species" the transversions begin to outnumber transitions because many sites have been mutated hundreds of times over the many eons of time separating the species, even while many other sites remain invariant due to selection pressure.
Hi Brian, thanks very much for your kind answer. I have tried this software in windows, but I wonder to know if it can batch my data because of a huge number of orthologs needed to test. What do you think?
I doubt you need to test each ortholog. Once you look at a few data sets you will notice that nearly all comparisons are the same, not matter which organism (virus, bacteria, eukaryote) or which gene (structural RNA, positively selected or negatively selected protein-coding, etc) is studied. At low distances, less than 20% to 25% DNA distance (75% to 80% identity) saturation does not influence estimates of true distance very much. In the range from 25% to 50% distance, the multiple hits per site can be reasonably well corrected by phylogenetic software to get a reasonable estimate of true distance. Beyond there, we can only guess at true distance, but general branching order and clades can still be accurately reconstructed. The attached tree here is made with the same data set used for the TsTv plot above.
Hi Brian, thanks again, but I'm sorry I don't really understand you. The 10000 orthologous genes of six species were picked out by Blast. And I would use them to analyze and identify positive selection genes. I consider I should filter some saturated genes firstly. There are TsTv plots of two orthologs in the attachment. It seems the first one is saturated and another is not. Maybe different orthologs have different conditions. So I wonder if I need to test each ortholog to filter. What do you think?
I am doing the same kind of analysis with codeML over hundreds of orthologous genes coming from the comparison of different RNAseq datasets between closely-related species. First, I do not understand very well what you mean when telling us that nearly all genes are giving you N.dN and S.dS equal to zero for your tree branches. If both N.dN and S.dS are null this means that there is no substitution on your given branch so there is no need to check selection on it. When N.dN is null but S.dS is high, this just mean that there is a strong purifying selection acting on the branch and, in this context, there is no need to filter the gene as dN/dS=0 is a 'true' value. In fact, genes that should be filtered mostly correspond to those where S.dS=0 or when S.dS is very high (because of saturation). In the case of S.dS=0 but not N.dN, this may simply due to a possible bias in the codon sampling along the gene which is often due to the length of the alignment. I recommend to filter out orthologous genes with a CDS shorter than 150 codons to avoid this bias. You can also test if your dN/dS ratio is still greater than one by recalculating your dS with the value of 1/S instead of zero (i.e. got one synonymous substitution by chance). For saturation, dS>1 is good but drastic option. I think the best option is indeed to compute the tv/ts on your pairs of species to see if you should remove either a species or a set of genes before doing your codeML analyses.
Many thanks to your reply. I mean N.dN is null, S.dS is null or the both are null in some branch of each ortholog. But N and S in each ortholog are both not null, I am very confused about this situation. And this ortholog will be filtered, if N*dN
N et S represent the total number of putative non-synonymous and synonymous sites, respectively in the sequence. They are constant all over the branches of the tree and only depend on the gene under scrutiny. For example, If your first set of orthologs alignment is characterized by 100 codons (so 300 bases), then depending on the codon composition you may find that you have 100 possible synonymous sites (S) and 200 non-synonymous sites (N) with S+N = 300. If you look then at the observed substitutions between your species, let assume that you have 3 synonymous substitutions and 6 non-synonymous substitutions between your pair of species, then dS=3/S so 3/100 and dN=6/N so 6/200=3/100. Your dN/dS will be therefore equal to 1. In the tree, your observed non-synonymous and synonymous mutations will be distributed among the branches of the tree depending on whether they are shared or not between species. For example, if a non-synonymous mutation is only observed in one species, this mutation will be assigned to the terminal branch of the given species. As a consequence, if you look at the N.dN and the S.dS of a given branch, it simply indicates you how many non-synonymous (N.dN) and synonymous (S.dS) there are on the branch you look at. If both of them are null, it simply means that you have no mutation associated with the so-called branch and thus no reason to test it for selection. dN/dS (0/0) is infinite but must be filtered. Based on the examples you sent me, it seems that you have 6 species with a first ortholog of 720 bp (513.4+206.6) and a second long ortholog of 4323 bp (3126.6+1196.4). From the dataset, it seems that there are very few mutations between your species and thus, no saturation at all over the sequences as both dN and dS are very low. Your problem is thus mainly to filter the zero values from the dataset. In both cases, the number of codons is sufficiently high to discard a problem of codon sampling bias. So, you should exclude definitively all branches with both N.dN=0 and S.dS=0. You can keep branches with N.dN=0 if S.dS is not zero. However, if the S.dS is not very high: say lesser than 3, the number of mutations in the branch may be too low to get a safe estimate of dN/dS. It is then a matter of sequence length. Filtering all S.dS=0 is recommended but could be a problem in the specific case of species with weak divergence. As you have a much lower number of synonymous sites in the sequence, if the average number of mutations between species is low (your probable case), there is a good chance of having no observed synonymous mutations in the sequence despite some in the non-synonymous compartment. if sequences are long enough (example 4323 bp) and contain a non-negligible number of non-synonymous mutations (i.e. N.dN= 10) but no synonymous mutations (S.dS=0), it may be still indicative of positive selection, so it could be an option to test if the dN/dS is still greater than one, by recalculating it under the hypothesis that you may have missed one synonymous change with a dS=1/1196.4 (with the 4323 bp gene).
Considering the Tv/Ts analysis, I don't think that it is a good option given the two examples you gave me. It seems that your 6 species are very closely-related or still under speciation unless the two examples represent unusual highly conserved genes in your dataset. But yes, the idea was to compute Tv/Ts from all your species pairs for each ortholog and then to perform a biplot Tv vs Ts in order to test whether they are linearly regressed. You can then estimate a R2 (and/or its associated p-value) over your genes and remove the outliers using an R2 (or a p-value) threshold.
Thank you very much for your kind and clear explanation and suggestion. Indeed, my six species belong to two very closely-related families, 2 are in one family and 4 are another. Further, 2 of 4 are in the same genus. So, as you can see, a lot of dN, dS are null in almost all orthologs. I understand what you said except that I am not sure that if it needs to be filtered when both N*dN and S*dS are null in only one branch of the ortholog.
You don't need to filter out a gene if at least one branch of the tree has a N.dN and a S.dS equal to zero. You have perfectly the right to keep all the non null branch values for this specific gene and remove only remove the branches for which the tree has null values. What you need to do is to look at the dNdS variation for each branch separately and see which genes should be under positive selection for a given branch.
Let assume that you have a four branch tree with 3 species
gene 1 Br1 = N.dN=0 and S.dS=0, Br2 is OK (non null), Br3 is OK, and Br4 is OK
gene 2 Br1 is OK, Br2 is OK, Br3 = N.dN=0 and S.dS=0, Br4 is OK
gene 3 Br1 is OK, Br2= N.dN=0 and S.dS=0, Br3 is OK, Br4 = N.dN=0 and S.dS=0
and so on until you have all your 10000 orthologous genes
you can do a table with your 4 branches and keep in