I'm using Treepuzzle to predict the best gene with the highest phylogenetic signal. I would like a tool to predict the best region in the gene with the highest phylogenetic signal. Can someone help me?
The SimPlot program, by Stuart Ray, is invaluable for this type of question.
http://sray.med.som.jhmi.edu/SCRoftware/simplot/
It may take you a couple hours to figure out how to best use it. Write to me with a sample data set, and I can send you back some screenshots and tips, if you have any troubles.
I'm very grateful for your offer. I will be very happy if you help me with this analysis. Well, i have a data set of the large segment of Oropouche virus, and i'd like make a comparison betwen theese genes. My data set is compound only by Oropouche sequences. I'm sending a fasta alignments sequence that contain my data set.
The sequences in that alignment are L segment of the genome of Oropouche virus, which encodes a polymerase. This virus seems to be rather slowly evolving, based on the fact that there are sequences from several other isolates in GenBank that are 99% identical to these. Oropouche virus is an RNA virus in the Orthobunyavirus group. The next closest neighbor to it which has been deposited in GenBank so far, is Jatobal virus at 77% identity, then several other viruses such as Shuni virus, at 67% nucleotide identity over the 6.8 kb sequenced. Surely there must be other viruses in the gap between 99% identity and 80% identity, but they have not yet been sampled, sequenced and entered into GenBank. The distribution of differences between the 99% identical sequences is rather evenly spread over the 6.8 kb, as we would expect. So looking at the variability in the more distantly related sequences (those with something between 60 and 80% identity) will give a better idea of which regions of the polymerase are variable and conserved. I have attached the very most "quick and dirty" tree possible which is simply the BLAST result tree provided by NCBI/GenBank.
Following a BLAST search result link to the NCBI taxonomy of these viruses brings us to: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=11572&lvl=3&lin=f&keep=1&srchmode=1&unlock
And then clicking on the POPSET (set of sequences from population diversity studies) takes us to: http://www.ncbi.nlm.nih.gov/popset/?term=txid11572[Organism:exp]
This is very helpful because although you have sequenced the L segment encoding polymerase, the most phylogenetic signal in a virus infecting mammals is likely to be found in an envelope or coat gene segment that is being driven to diversity by the host immune responses. Also, regardless of where the best signal is in the genome, it is often most informative to be on the same bandwagon, sequencing the same genomic region, that other researchers have sequenced in other isolates of your virus or related viruses. When studying bacteria, for example, the 16S rRNA is not the best molecule for phylogenetic signal, but it is the best for having the most isolates of the most species entered into the databases.
Anyway, at this point, there is a lot more information, and several potentially fruitful paths to follow. But which path is "best" depends on what exactly your biological questions are. Are you interested in one local outbreak of Oropouche virus, or are you interested in how this virus compares to related Orthobunyavirus species?
The interest of this study is to find what gene and its respective region holds the strongest phylogenetic signal. It will be further tested by phylogenetic analysis whether such regions are capable of expressing the same results as compared to when using complete sequences of the segments. We have an interest in reducing the cost of a molecular diagnosis of Oropouche Virus, because it is known to cause outbreaks in our region.
After this. We will do the same analysis to the other segment, the small and mediun segment of OROV. Ms. Brian, my question is: May we predict the region that hold the best phylogenetic signal only comparing the sequences of OROV among themselves ? Or, Will we need to include other viral samples as Jatobal virus?.
Ms. Brian, I am very grateful for your great help, you demonstrated to be a very competent person. I apologize if I wrote something wrong. I'm still a beginner in this field of evolutionary study
The question of detecting infections is a bit different from finding the region with the most or strongest phylogenetic signal. In retroviruses, for example, the entire genome is prone to mutation, as the error-prone reverse transcriptase copies the RNA viral genome to DNA in the cytoplasm of the host cell. But the evolution rate of the envelope gene of retroviruses is very much faster than the evolution rate of the pol gene, because env has much more positive selection pressure drive by antibody as well as T-cell host immune response, while pol gene evolution is essentially free from antibody selection pressure. Also, many regions of the envelope protein are free to change (less negative selection pressure) while more regions of the Protease, Reverse Transcriptase, and Integrase proteins encoded by the pol gene are constrained to preserve the functions of the proteins (more negative selection pressure).
So, the best region of the genome of HIV-1 for detecting infections, is the more conserved pol gene region. Tests which detect HIV-1 M group subtype B pol gene will also detect HIV-1 M group viruses of the other subtypes, because the pol gene is so similar between subtypes. The best region of the genome of HIV-1 for determining the evolution pattern within a single patient, or for detecting who infected who in a transmission cluster is the env gene, which is more diverse.
A plot of silent vs nonsilent mutations in the gene region you have here, shows as expected that the polymerase of the Oropouche virus is under negative selection, with almost all of the observed differences between your sequences being silent site substitutions which do not alter the encoded protein sequence. The attached plot was made with the HIV Databases Highlighter tool:
From this plot alone, you might be tempted to think that there are sites near codons 200, 500 and 1250 which are mutation "hot spots". But that is not the case. What you have there are 3 sites where the "reference sequence" which in this case was AM01_1980 because it was the fist sequence in your file, perhaps sorted alphabetically, differs from the other viruses.
A tree built from your sequences shows that there are 2 clades of Oropouche virus in your sequences, and that the viruses are not sorted into clades by year of sample.
Choosing a different virus to which the others are compared in a highlighter plot again illustrates the lack of amino acid changing mutations (red) among very synonymous mutations (green).
In the second highlighter plot, the region between codons 600 and 820 stands out as apparently lacking any DNA changes, silent or nonsilent. Is this region protected from mutation? Or is there an RNA secondary structure or some other biological function which selects against mutations in this region? There can be many explanations for a highly conserved region that are more biologically plausible than a region being protected from mutation. In general, mutations happen at random (with respect to position in a gene; transitions happen more often than transversions for a number of reasons), and unequal rates of evolution at different sites in a gene is a function of selection pressures.
I am very grateful for the explanation. I'll have to talk to my employer, we need to change our methodology. In fact, everything is more clarified now.