We have recently been involved in the analysis of sequencing data coming from Ion Torrent technology. We had previous experience in biological data analysis such as that coming from microarrays or mass spectrometry, but no experience with sequencing data.
After searching for information and "playing" a little bit with the data, what we have seen is that the pre-processing of the reads generated is a critical step before any downstream analysis. Although there are many tools and approaches, I still don't have clear what pre-processing steps are the most appropriate. I guess the 'strictness' of the trimming and filtering should depend on the downstream analysis but, what are the criteria to trim/filter the data? What other steps (such as error correction, maybe) are interesting? To what extent these steps are technology-independent (or, more interestingly, which depend on the technology and how?.
Issues that I know/guess that are relevant (in the particular case of Ion Torrent data, though most likely they appear in other technologies) are:
Quality of the reads - Which (obviously) decreases with the length of the read. Trimming the 3' tail is a good approach to increase the average quality of the read, but I'm not quite sure it is enough. Are there other approaches to do this beyond the simplistic one of cutting at the first position that surpasses a given quality threshold?. I'm also concerned with the distribution of the quality values through the reads, which seem to have a high variability (high quality values followed by low quality ones). This (may) have to do with the next issue
Influence of homopolymers - I assume that all the technologies have problems with this issue, but to what extent? Which are good criteria to filter according to this issue (total number of homopolymers?, maximum length?)
Contamination with adapters/barcodes - In my particular data (although I have observed it also in public data) I would say that in the first 10-15 bases of some reads there are evidence of contamination with incorrectly removed adapters and/or barcodes. I have seen tools that perform this 'cleansing' step for data coming from other technologies but, what about Ion Torrent data?. If there is no tool, any suggestion about how could we use the sequence of the adapters to identify and remove this contamination?
To sum up, I need feedback about pre-processing steps (trimming, filtering and any other method) for sequencing data, particularly for data coming from Ion Torrent technology.