I have contamination of adapters inside the reads. Some part of the adapter sequences are found a few bases downstream or upstream of the 3' or 5' ends of the reads consecutively. How to deal with such reads? Looking for your suggestions.
There are tools to process these reads like FastX, SeqQC, SeqPrep, etc. There are two kinds of adapter contaminations. 1. When your fragment size is lesser than your sequencing length, few bases of adapters will be read at the end (should happen at 3' end of the read) 2. There are possibilities for the formation adapter/primer dimers during PCR which would result in the reads full of adapters; there won't be any actual sequence genomic representation. Both these influence your results based on type of your project. But when it happens at the 3' end, it can be only few bases of the adapter whereas if it happens in the 5' end then it should be full adapter.
PS: If your data is paired end, it is easy to find the adapters more accurately using SeqPrep.
There are tools to process these reads like FastX, SeqQC, SeqPrep, etc. There are two kinds of adapter contaminations. 1. When your fragment size is lesser than your sequencing length, few bases of adapters will be read at the end (should happen at 3' end of the read) 2. There are possibilities for the formation adapter/primer dimers during PCR which would result in the reads full of adapters; there won't be any actual sequence genomic representation. Both these influence your results based on type of your project. But when it happens at the 3' end, it can be only few bases of the adapter whereas if it happens in the 5' end then it should be full adapter.
PS: If your data is paired end, it is easy to find the adapters more accurately using SeqPrep.
For an eg, the sequencing length is 100bp but there are few fragments which are of 95bp in the library. While the sequencer read those fragments, 95bp of genomic region will be read, then it continues reading 5bp of adapter in the end as the sequencing length is 100bp. Same applies if the fragment is 70bp, you will end up reading 30bp adapter in your read. Whereas, if the adapter starts at 5' end (mostly in the case of adapter/primer dimer), that means you are going to read only the adapter where you will cover the whole adapter ~32bp continued by another adapter and poly 'A's till it reaches 100bp.
Some level of adapter-adapter dimer contamination is common in some procedures (especially prep procedures with low or variable input levels) and It is important to run quality control prior to sequencing, such as Bioanalyzer or gel analysis, which can rapidly tell you the level of contamination in your libraries (you will see a discrete set of bands at the dimer sizes). If it is more than 1% of the library it may be better to re-prep the library.
Once the data has been generated, contaminants can be filtered out with awk or something similar, and in any case, most of the contaminant reads will not map to the mouse or human genomes. Also, most mapping programs have an option to use only use of the read if this is an issue (e.g., --trim5 in Bowtie). However, depending on the level of contamination, excess contaminants can actually create bias in the base composition, which can throw off basecalling for all the reads in a run. You can use various approaches to solve this, such as re-running basecalling, using the metrics from another lane within the same run.
What I'm interested in is, how can you assure every subsequence consistent with adapter is probably contamination, especially the case of "inside the read"?
In my opinion, target sequences are likely to contain subsequences similar to adapter.
Like I said before, if the data is paired end (illumina) then you can do it accurately. Because, there is a pattern for adapter contamination in paired end where both the reads will have same adapter sequence of same length at the end. If it is single end, its little bit messy to do accurately. You will have a trade off between number of reads and accuracy. There is no possibility to have adapter contaminations like this . And also, I have heard that illumina people have designed adapter in such a way to be unique from genome sequences (correct me if am wrong). Still subset of adapter sequences could have homology to genome sequences as you said. But when we check for subset of adapter (eg 10bp), we expect it in 3'end only which increases probability of it being actual adapter. When it happens in the middle of read, then we expect to have full adapter (32bp) which is unique from genome sequence. Very few base adapter contaminations (1bp-5bp) are very tricky. They can't be easily captured (in single end) and it might get aligned like a chap leading to false positive variations.
To give some useful answers it would help to know what data you're actually truing to analyze and what problems you're facing. What library are your reads based on and what technology were used to sequence them. Are those adapter contamination only related to sequencing adapters (e.g. Illumina adapter) or do you also have issues with e.g. amplification primers/adapters? And what is your definition of a contamination 'inside a read' - could you give an example?
I use cutadapt - it finds the adaptor or any part of it, then clips it and also discards anything downstream. It seems really flexible and powerful and it is easy to use and also generates a nice report. http://code.google.com/p/cutadapt/
Thanks to all for your suggestions. The meaning of inside a read is, I have found traces of reads either few bases downstream to 5'-end or few bases upstream to 3'-end. I would really like to know that if there are any possibilities due to reaction chemistry, traces of adapter sequences exist within the reads but not at the ends.
We don't see such cases that traces of adapter sequences exist in the middle of the reads instead at the ends. Rather there might be poorly size selected fragments or adapter-adapter, adapter-primer contamination, for which it's better to do a library clean-up or size selection first. Refer https://www.nvigen.com/dna-clean-up/ or https://www.nvigen.com/dna-size-selection-tutorial/.