Recently I have been very interested in calculating the dN/dS ratio to estimate rates of nonsynonymous and synonymous substitutions. Since I've just started learning the theory, I have a few basic questions:
One of the applications I'm interested in is estimating selection pressure on various viral genomes. Does it make sense to use dN/dS across coding sequences from different viruses; e.g. an analysis considering HCV, HIV, and influenza all together?
relatedly, what is the best way to estimate dN/dS within species, and is there an existing implementation? I ask because of this paper, which explains why we shouldn't rely on dN/dS within species...http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000304
is it possible to use dN/dS on whole genome coding sequences within species? I obtained CDSs for 4 strains of HCV from NCBI (NC_009823, NC_009824, NC_009825, NC_009826) and ran them through FEL in data monkey (http://www.datamonkey.org/fel), but the results said that there were no regions under positive or negative selection. This seems wrong to me..so it is likely that I'm doing the analysis incorrectly.
I am not an expert in virus evolution... they just mutate so fast... As they are also so variable within one infection, it seems always tricky to find the consensus.
Anyhow, you could try a window approach: Basically rather than looking at dN/dS on a whole gene, just do a sliding window.
The simple dN/dS is normally pairwise. You might want to compare it to an outgroup, and then use HKA or MacDonald Kreitman test. This paper might be of use:
https://rdcu.be/OOlm or https://www.biorxiv.org/content/early/2016/12/20/095679
- Estimating dN/dS and detecting positive selection across such highly diverged viruses, due to their high mutation rate and recombination, will be difficult. I guess this depends on what your question is - is this about molecular evolution across RNA viruses as a whole? Additionally, detecting positive selection can be done site-wise, gene-wise, or for a proportion of sites on specific branches of a phylogeny. My gut reaction is that alignment across all of these is going to be messy and lead to a lot of false-positive signatures of positive selection. But, if the question is compelling enough, there might be reasonable ways to deal with these sources of uncertainty.
- For within-species, there is just not enough information to reliably estimate dN/dS. Usually, population-level work applies nonsynonymous and synonymous nucleotide polymorphism, which is just counting and does not rely on some underlying substitution model and phylogeny. This depends on evolutionary distance among individuals within that species though, so viruses like HIV usually have reasonable levels of evolutionary distance to estimate dN/dS between strains and samples within those strains (see work from Sergei Kosakovsky-Pond's lab and colleagues - the HYPHY developers).
-Your results sounds right. There were only 4 taxa, correct? There will be almost no power to detect selection with a site-wise method such as fixed effects likelihood (FEL). You want at least 10, more is better. I forget where that number comes from, it might actually be specific to some simulations from MEME (another type of test), but the point is that you should not be surprised by your result.