Software for predicting phylogenetic markers in viral genome data set?

Hello Brian,

I'm very grateful for your offer. I will be very happy if you help me with this analysis. Well, i have a data set of the large segment of Oropouche virus, and i'd like make a comparison betwen theese genes. My data set is compound only by Oropouche sequences. I'm sending a fasta alignments sequence that contain my data set.

I'll be waiting for your reply.

Cordially

Brian Thomas Foley

The sequences in that alignment are L segment of the genome of Oropouche virus, which encodes a polymerase. This virus seems to be rather slowly evolving, based on the fact that there are sequences from several other isolates in GenBank that are 99% identical to these. Oropouche virus is an RNA virus in the Orthobunyavirus group. The next closest neighbor to it which has been deposited in GenBank so far, is Jatobal virus at 77% identity, then several other viruses such as Shuni virus, at 67% nucleotide identity over the 6.8 kb sequenced. Surely there must be other viruses in the gap between 99% identity and 80% identity, but they have not yet been sampled, sequenced and entered into GenBank. The distribution of differences between the 99% identical sequences is rather evenly spread over the 6.8 kb, as we would expect. So looking at the variability in the more distantly related sequences (those with something between 60 and 80% identity) will give a better idea of which regions of the polymerase are variable and conserved. I have attached the very most "quick and dirty" tree possible which is simply the BLAST result tree provided by NCBI/GenBank.

Brian Thomas Foley

Following a BLAST search result link to the NCBI taxonomy of these viruses brings us to: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=11572&lvl=3&lin=f&keep=1&srchmode=1&unlock

And then clicking on the POPSET (set of sequences from population diversity studies) takes us to: http://www.ncbi.nlm.nih.gov/popset/?term=txid11572[Organism:exp]

This is very helpful because although you have sequenced the L segment encoding polymerase, the most phylogenetic signal in a virus infecting mammals is likely to be found in an envelope or coat gene segment that is being driven to diversity by the host immune responses. Also, regardless of where the best signal is in the genome, it is often most informative to be on the same bandwagon, sequencing the same genomic region, that other researchers have sequenced in other isolates of your virus or related viruses. When studying bacteria, for example, the 16S rRNA is not the best molecule for phylogenetic signal, but it is the best for having the most isolates of the most species entered into the databases.

Anyway, at this point, there is a lot more information, and several potentially fruitful paths to follow. But which path is "best" depends on what exactly your biological questions are. Are you interested in one local outbreak of Oropouche virus, or are you interested in how this virus compares to related Orthobunyavirus species?

Davi Toshio Inada

The interest of this study is to find what gene and its respective region holds the strongest phylogenetic signal. It will be further tested by phylogenetic analysis whether such regions are capable of expressing the same results as compared to when using complete sequences of the segments. We have an interest in reducing the cost of a molecular diagnosis of Oropouche Virus, because it is known to cause outbreaks in our region.

Davi Toshio Inada

After this. We will do the same analysis to the other segment, the small and mediun segment of OROV. Ms. Brian, my question is: May we predict the region that hold the best phylogenetic signal only comparing the sequences of OROV among themselves ? Or, Will we need to include other viral samples as Jatobal virus?.

Davi Toshio Inada

Ms. Brian, I am very grateful for your great help, you demonstrated to be a very competent person. I apologize if I wrote something wrong. I'm still a beginner in this field of evolutionary study

Brian Thomas Foley

The question of detecting infections is a bit different from finding the region with the most or strongest phylogenetic signal. In retroviruses, for example, the entire genome is prone to mutation, as the error-prone reverse transcriptase copies the RNA viral genome to DNA in the cytoplasm of the host cell. But the evolution rate of the envelope gene of retroviruses is very much faster than the evolution rate of the pol gene, because env has much more positive selection pressure drive by antibody as well as T-cell host immune response, while pol gene evolution is essentially free from antibody selection pressure. Also, many regions of the envelope protein are free to change (less negative selection pressure) while more regions of the Protease, Reverse Transcriptase, and Integrase proteins encoded by the pol gene are constrained to preserve the functions of the proteins (more negative selection pressure).

So, the best region of the genome of HIV-1 for detecting infections, is the more conserved pol gene region. Tests which detect HIV-1 M group subtype B pol gene will also detect HIV-1 M group viruses of the other subtypes, because the pol gene is so similar between subtypes. The best region of the genome of HIV-1 for determining the evolution pattern within a single patient, or for detecting who infected who in a transmission cluster is the env gene, which is more diverse.

A plot of silent vs nonsilent mutations in the gene region you have here, shows as expected that the polymerase of the Oropouche virus is under negative selection, with almost all of the observed differences between your sequences being silent site substitutions which do not alter the encoded protein sequence. The attached plot was made with the HIV Databases Highlighter tool:

http://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/HIGHLIGHT_XYPLOT/highlighter.html

From this plot alone, you might be tempted to think that there are sites near codons 200, 500 and 1250 which are mutation "hot spots". But that is not the case. What you have there are 3 sites where the "reference sequence" which in this case was AM01_1980 because it was the fist sequence in your file, perhaps sorted alphabetically, differs from the other viruses.

Brian Thomas Foley

A tree built from your sequences shows that there are 2 clades of Oropouche virus in your sequences, and that the viruses are not sorted into clades by year of sample.

Brian Thomas Foley

Choosing a different virus to which the others are compared in a highlighter plot again illustrates the lack of amino acid changing mutations (red) among very synonymous mutations (green).

Brian Thomas Foley

In the second highlighter plot, the region between codons 600 and 820 stands out as apparently lacking any DNA changes, silent or nonsilent. Is this region protected from mutation? Or is there an RNA secondary structure or some other biological function which selects against mutations in this region? There can be many explanations for a highly conserved region that are more biologically plausible than a region being protected from mutation. In general, mutations happen at random (with respect to position in a gene; transitions happen more often than transversions for a number of reasons), and unequal rates of evolution at different sites in a gene is a function of selection pressures.

Brian Thomas Foley

Molecular epidemiology of Oropouche virus, Brazil.

Vasconcelos HB, Nunes MR, Casseb LM, Carvalho VL, Pinto da Silva EV, Silva M, Casseb SM, Vasconcelos PF.

Emerg Infect Dis. 2011 May;17(5):800-6. doi: 10.3201/eid1705.101333.

PMID: 21529387

http://wwwnc.cdc.gov/eid/article/17/5/10-1333-f2.htm

This paper roughly 690 bases of the nucleocapsid gene.

Any suggestions on processing ESI MicroTOF data with XCMS - xcmsSet() function?

Is it possible to use more than one normalization method on LC-MS peak intensity?

Does anyone know a data bank where we can get metabolome data?

Can anyone assis with integrative analysis of transcriptome and metabolome of plants?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

GC-MS retention index prediticon?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?