I have SNVs and indels from my high throughput seq runs, how should I proceed?

11 November 2012 12 10K Report

Let's say I ran forty cancer samples, twenty responders and twenty non-responders to some treatment. I performed alignment and annotation of SNVs and indels in each sample. Now I want to know if there are any deferentially affected genes in the two groups that may justify the different response I observed in those tumors. Problem is, simple chi squares won't do, because a single gene can be affected by multiple potentially deleterious SNVs and indels, and I may have to bin multiple alterations to have enough power. I may therefore tag each gene as affected/unaffected by a potentially deleterious variation, and proceed this way with my analyses. But how do I know if a "potentially" deleterious variation is really so? It's impossible to validate biologically all of them. Or I may restrict my analysis to variations that recur more than once in two different samples, but then again, some genes can be altered in many places, and in one place just once out of several samples, and I would overlook them.

Finally, I may consider not genes but pathways, and see whether some pathways are involved by deleterious alterations in responder tumors and not in non-responders, or vice-verse, but pathways are not as well defined as one may want to think; there is a risk of including deleterious variations that have nothing to do with the underlying biology of my samples. It's more philosophy than statistics here. What would you do? What do you you recommend?

Hannah Carter Popular answer

You can get a lot of information about your mutations from the CRAVAT web server (www.cravat.us) . The server hosts some machine learning-based tools for the analysis of missense mutations detected in tumor sequencing. This includes functional analysis (is the protein affected?) as well as a method that is cancer-specific ( does the mutation look more like a driver or a passenger?). These methods return a continuous score between 0 and 1 that can be used to prioritize mutations, as well as p-value and FDR estimates to help you select a cutoff for deciding which missense mutations to keep. CRAVAT annotates all of your variants with dbSNP ids, allele frequencies for 1000 genomes and ESP6500 populations, overlap with the COSMIC database, provides gene functional annotations from the GeneCards database, and the results of a PubMed search for each mutated gene.

If you want to look for more fundamental differences between the groups, you might want to read "Mutational processes molding the genomes of 21 breast cancers" from Mike Stratton's group (Cell 2012). They look at differences in the processes underlying mutation among 21 individuals with Breast Cancer.

Benedetta Izzi

to be honest there`s a much bigger problem in your setting. If you do not have normal tissue samples to compare with, how are you going to correct your data for possible population biasses? If you do not have them, it will be impossible o identify the correct significantly different variations.

Michal Okoniewski

Ingenuity has just started a tool for genotyping and pathway-related analysis of variants. Played with it today - seems to work nicely. They plan to charge quite a bit of money for it, but chances are for an open trial limited version.

Zoppoli Gabriele

Benedetta why should I have population biases if all my patients come from the same region and are from the same ethnicity? I assume you mean the issue of having healthy matched tissues is related to filtering germline variants from somatic ones.

Zoppoli Gabriele

Michael will it come with the same license as IPA does?

Jonathan C Strefford

The first question is are you investigating germ-line variation or somatically acquired variation? Either or both is good, but somatic variation will be easier as you can exclude germ-line variation from your matched non-cancer tissue. Analysis of variation in your germ-line material will require alignment to some kind of reference genome, and you will get a huge number of potential variants from this analysis.

This is a massive field, but briefly.

Assuming you have suitable read depth then you can filter your variants based on some type of depth filter and that the variant occurs on both DNA strands. It is worth eye-balling your variants in software such as IGV to see how the variants look in the context of you read depth. If you are looking at somatic variants, you can calculate a somatic p value based on the read depth in your germ-line and somatic tissue, this can be useful at prioritising candidate genes.

Then you should define whether the variants are within genes and if they are non-synonymous, giving an insight into whether they are effecting gene function. Furthermore, whether the variants result in a frame-shift or the creation of a stop codon. Again, insight into functional consequences. SAMTools can do this kind of stuff, but there are plenty of other packages.

You can filter your data based on the 1000 genome project (or dbSNP), if your looking at cancer genes you can also look for ones in COSMIC.

There are also packages that predict the potential impact of a non-synonymous variant, such as SIFT, POLYPHEN and GRANTHAM. Certain genes are also highly mutable in the context of normal individuals, and you could consider removing them, there is a paper by Fuentes (cannot remember the reference) that defined these genes.

Then you can look at recurrently targeted genes and pathways.

This might help you prepare a list of genes that are likely to be real based on your NGS data, and then prioritise them based on potential effects. There are many, many things that you can do, but hopefully this helps.

Benedetta Izzi

Yes, I did mean that and I think Jon Strefford gave you a complete explanation. I`m not a cancer person, but there`s been a lot of discussion in my lab and in general in the cancer community here at Harvard about the need of including healthy tissues in cancer profiling-studies. I`m epigenetist, and in this field this is a must for each of our studies in the lab on cancer. I suppose it is not exactly the same in genetics, but I would try to get as much as I can and Jon suggested some interesting approaches to do that. Thank you for replying I`m always interested in learning new stuff especially from people working in different fields.

Good luck!

Hannah Carter

Jonathan C Strefford

Hannah, CRAVAT looks great and really powerful, thank you for the introduction. I have passed the details on to my group.

Zoppoli Gabriele

Thank you Hannah, really interesting tool!

Michele Rubini

I would suggest to have a look also to PriVar, a toolkit for prioritizing SNVs and indels from next generation sequencing data. Executable jar package is available at http://paed.hku.hk/uploadarea/yangwl/html/software.html

Julian Gough

You might find FATHMM useful (http://supfam2.cs.bris.ac.uk/FATHMM/), it even includes cancer-specific options. Also it is connected to dcGO (http://supfam.org/SUPERFAMILY/dcGO/) which can give you pathway and other useful ontology information.

What is the best kit for agilent one-color gene expression microarrays?

How to plot figures with a given gene, its domains, and symbols from a list of pre-specified mutations?

Is there an easy way to annotate SNVs and indels on an on-scale gene representation?

Issue with annotation of Shah Nature paper 2012.

Freeware tools to annotate aminoacids and domains in 3D protein structures.

How to sort human genes by AT content?

Can anyone recommend the best mRNA preamplification kits for microarrays out there - if any?

Could you recommend some articles on Urban Transportation System optimization and Innovation?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?