Pre-processing reads from Ion Torrent technology

06 June 2013 3 462 Report

We have recently been involved in the analysis of sequencing data coming from Ion Torrent technology. We had previous experience in biological data analysis such as that coming from microarrays or mass spectrometry, but no experience with sequencing data.

After searching for information and "playing" a little bit with the data, what we have seen is that the pre-processing of the reads generated is a critical step before any downstream analysis. Although there are many tools and approaches, I still don't have clear what pre-processing steps are the most appropriate. I guess the 'strictness' of the trimming and filtering should depend on the downstream analysis but, what are the criteria to trim/filter the data? What other steps (such as error correction, maybe) are interesting? To what extent these steps are technology-independent (or, more interestingly, which depend on the technology and how?.

Issues that I know/guess that are relevant (in the particular case of Ion Torrent data, though most likely they appear in other technologies) are:

Quality of the reads - Which (obviously) decreases with the length of the read. Trimming the 3' tail is a good approach to increase the average quality of the read, but I'm not quite sure it is enough. Are there other approaches to do this beyond the simplistic one of cutting at the first position that surpasses a given quality threshold?. I'm also concerned with the distribution of the quality values through the reads, which seem to have a high variability (high quality values followed by low quality ones). This (may) have to do with the next issue

Influence of homopolymers - I assume that all the technologies have problems with this issue, but to what extent? Which are good criteria to filter according to this issue (total number of homopolymers?, maximum length?)

Contamination with adapters/barcodes - In my particular data (although I have observed it also in public data) I would say that in the first 10-15 bases of some reads there are evidence of contamination with incorrectly removed adapters and/or barcodes. I have seen tools that perform this 'cleansing' step for data coming from other technologies but, what about Ion Torrent data?. If there is no tool, any suggestion about how could we use the sequence of the adapters to identify and remove this contamination?

To sum up, I need feedback about pre-processing steps (trimming, filtering and any other method) for sequencing data, particularly for data coming from Ion Torrent technology.

Tamás Kovács

Our institute uses Roche 454. Usually we are doing de novo sequencing, therefore, there are no general rules exist for us how significantly should be taken into consideration e.g. homopylemers in quality control. Trimming, filtering will depend on the particular run, and these are influenced by many factors (e.g. DNA number per beads, quality of beads, quality of template DNA and so on). I can suggest a pipeline, unfortunately especially for Roche, but after the several first steps, which are performed by the Roche software, the CLC Genomics Workbench could be used also for Iontorrent. This software can import reads (also from Iontorrent) and you can perform quality control. I upload a part of my latest NGS bioinformatics presentation in pdf, I hope, that it will be useful. The homopolymer problem is platform-dependent, we can trust 6-8 homopolymers in case of Roche.

Krishnanand Padmanahan

On similar line i would like to know what kind of software could be used to understand the probability distribution of genetic variants from next gen sequencing data from ion torrent machine.The software which the torrent provides can only list the variants but i didn't find a feature by which you could quantify the mutations.I wish to understand how the mutation spectrum changes during cancer progression(I.e predominant versus rare mutation in different stages of cancer) from a single individual.

Joana Séneca

Hello,

I started my NGS data analysis using the QIIME pipeline and the website is has very easy to follow tutorials and you end up understanding why you do what you do in all the pre-processing steps. Particularly in the 454 technology you have to account for the known sequencing errors (homopolymers) and, depending on your dataset, you should denoise it or not.

I recently started to analyse Ion Torrent produced data and I didn't have a clue regarding the pre-processing steps or whether or not I should pay attention to particular sequencing errors and I found that the Brazilian Microbiome Project has a nice pipeline to analyse 16s rRNA profiling using the Ion Torrent technology, which combines both QIIME and UPARSE.

Here's the link: http://www.brmicrobiome.org/#!16sprofilingpipeline/cuhd

I'm still struggling with my own data, but let me know if it is helpful for your data.

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How to confirm the site-directed mutagenesis result without performing NGS?

I can't see the ssDNA band after performing asymmetric PCR. Is there any way to do this?

How to calculate CCS for Sodiated adduct ions and Multiply Charged Ions?