What is the strategy to deal with adapter contamination inside the reads but not at either of the ends (3' or 5')?

More Sourav Nayak's questions See All

Are transcripts with class code X could possibly be the anti-sense RNA transcripts?

We prepared un-stranded RNA-Seq library (SMART-Seq2) and sequenced in Illumina platform. We mapped with STAR (by adding XS tag) and assembled using cufflinks RABT assembly. Now after comparing...

11 December 2018 2,503 1 View

Can anyone give me an ideas to build a database with functionality?

Currently I am designing a gene expression database in MySQL. Additionally I want give an web interface to the database, which is primarily for our lab use. And also I want to include some...

05 June 2018 6,435 2 View

What are the essential SNP filtering criteria one should follow before performing GWAS analysis?

We are working on a project where we need to perform genome wide association studies for multi-parental population. We have the SNP data from whole genome re-sequencing protocol. As I am quite...

09 October 2017 612 9 View

What are the considerations for choosing a statistical model to perform association studies?

I have phenotype data and genotype data (DArT markers, SNP markers and SSR markers). My objective is to perform marker trait association study. I have found several R packages to perform...

07 August 2017 4,144 5 View

How to analyze BCFTools RoH output? How to provide information of in-house alternate allele frequency?

I have been trying to interpret the BCFTools output file for a single member of a small family. My aim is to find homozygous region with high confidence. With default command which is: bcftools...

10 November 2016 752 1 View

Can anybody help me in understanding one variant reported in 1000Genomes phase three release?

I am trying to write a program to annotate any VCF files against 1000 Genomes database. I have all the files with me. Now Surprisingly I have found out a variant which is : 1 207237233 . GGTGT...

11 December 2014 3,888 5 View

Is there any freely available tool to predict network from vcf file?

Rare complex disorders believed to be caused by combinatorial effect of many genes. Our lab works on one such disease. We are done with WES of four affected members from such a family. We are now...

07 August 2014 8,873 4 View

What if anyone removes all the unknown (or confusing) contigs from the reference genome (eg hg19) before alignment and other consecutive tasks?

There are a good number of contigs in the hg19 release. The coordinates of this contigs are a bit confusing. Now if anybody want to keep these contigs out of reference before doing alignment, what...

01 February 2014 7,340 3 View

Which variant caller researchers are using frequently?

I am developing/refining a pipeline for exome data. According to reported articles there are mainly two callers I found out to be more acceptable among researchers: One is SAMTools' variant caller...

10 November 2013 895 5 View

Is there any repositories where I can find a vcf file for dbSNP release 131?

I am looking for a vcf file of dbSNP release 131. Can anyone provide me with a link holding the data? Or is it possible to convert dbSNP flat files to vcf format by using any tool?

10 November 2013 7,910 7 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

How to confirm the site-directed mutagenesis result without performing NGS?

I'm cloning a fragment of 3200 nts into plasmid. The cloning was successful, however, 02 amino acids were mutated. Now I want to fix these 02 aa by site-directed mutagenesis technique using...

08 August 2024 4,645 2 View

Are there any instruments for studying time similar to the way it is in space?

There are a huge number of methods for studying objects in space, according to the senses (and not only). Mechanical, thermal, optical, acoustic, electrical, magnetic, based on particle beams,...

06 August 2024 7,102 0 View

Weak DAPI staining after immunohistochemistry - how to improve?

After immunohistochemistry of previously fixed in PFA and EtOH and then frozen 20 μm sections of zebrafish brain, DAPI staining is very weak (right) compared to the same sections stained without...

05 August 2024 9,637 2 View

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

I aim to be as skeptical as possible regarding whether a pair of orthologous genes results in the same phenotype in their different but related bacterial organisms under similar environmental...

05 August 2024 6,787 4 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended?

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended? If no axillary lymph node dissection was not...

05 August 2024 8,056 1 View

Mohamed Ashick Popular answer

There are tools to process these reads like FastX, SeqQC, SeqPrep, etc. There are two kinds of adapter contaminations. 1. When your fragment size is lesser than your sequencing length, few bases of adapters will be read at the end (should happen at 3' end of the read) 2. There are possibilities for the formation adapter/primer dimers during PCR which would result in the reads full of adapters; there won't be any actual sequence genomic representation. Both these influence your results based on type of your project. But when it happens at the 3' end, it can be only few bases of the adapter whereas if it happens in the 5' end then it should be full adapter.

PS: If your data is paired end, it is easy to find the adapters more accurately using SeqPrep.

Mohamed Ashick

Sourav Nayak

Thanks a lot. But can you elaborate the last point i.e. why we are considering different contamination length in case of 3'-end and 5'-end.

For an eg, the sequencing length is 100bp but there are few fragments which are of 95bp in the library. While the sequencer read those fragments, 95bp of genomic region will be read, then it continues reading 5bp of adapter in the end as the sequencing length is 100bp. Same applies if the fragment is 70bp, you will end up reading 30bp adapter in your read. Whereas, if the adapter starts at 5' end (mostly in the case of adapter/primer dimer), that means you are going to read only the adapter where you will cover the whole adapter ~32bp continued by another adapter and poly 'A's till it reaches 100bp.

Hamid Ashrafi

If you use CLC genomics Workbench, you can drop the reads when adapter found.

Stephen Ayers

Some level of adapter-adapter dimer contamination is common in some procedures (especially prep procedures with low or variable input levels) and It is important to run quality control prior to sequencing, such as Bioanalyzer or gel analysis, which can rapidly tell you the level of contamination in your libraries (you will see a discrete set of bands at the dimer sizes). If it is more than 1% of the library it may be better to re-prep the library.

Once the data has been generated, contaminants can be filtered out with awk or something similar, and in any case, most of the contaminant reads will not map to the mouse or human genomes. Also, most mapping programs have an option to use only use of the read if this is an issue (e.g., --trim5 in Bowtie). However, depending on the level of contamination, excess contaminants can actually create bias in the base composition, which can throw off basecalling for all the reads in a run. You can use various approaches to solve this, such as re-running basecalling, using the metrics from another lane within the same run.

Chongzhi Wang

What I'm interested in is, how can you assure every subsequence consistent with adapter is probably contamination, especially the case of "inside the read"?

In my opinion, target sequences are likely to contain subsequences similar to adapter.

Like I said before, if the data is paired end (illumina) then you can do it accurately. Because, there is a pattern for adapter contamination in paired end where both the reads will have same adapter sequence of same length at the end. If it is single end, its little bit messy to do accurately. You will have a trade off between number of reads and accuracy. There is no possibility to have adapter contaminations like this . And also, I have heard that illumina people have designed adapter in such a way to be unique from genome sequences (correct me if am wrong). Still subset of adapter sequences could have homology to genome sequences as you said. But when we check for subset of adapter (eg 10bp), we expect it in 3'end only which increases probability of it being actual adapter. When it happens in the middle of read, then we expect to have full adapter (32bp) which is unique from genome sequence. Very few base adapter contaminations (1bp-5bp) are very tricky. They can't be easily captured (in single end) and it might get aligned like a chap leading to false positive variations.

Sebastian Jünemann

To give some useful answers it would help to know what data you're actually truing to analyze and what problems you're facing. What library are your reads based on and what technology were used to sequence them. Are those adapter contamination only related to sequencing adapters (e.g. Illumina adapter) or do you also have issues with e.g. amplification primers/adapters? And what is your definition of a contamination 'inside a read' - could you give an example?

Christopher J Cowled

I use cutadapt - it finds the adaptor or any part of it, then clips it and also discards anything downstream. It seems really flexible and powerful and it is easy to use and also generates a nice report. http://code.google.com/p/cutadapt/

Thanks to all for your suggestions. The meaning of inside a read is, I have found traces of reads either few bases downstream to 5'-end or few bases upstream to 3'-end. I would really like to know that if there are any possibilities due to reaction chemistry, traces of adapter sequences exist within the reads but not at the ends.

Qiuyuan Liu

We don't see such cases that traces of adapter sequences exist in the middle of the reads instead at the ends. Rather there might be poorly size selected fragments or adapter-adapter, adapter-primer contamination, for which it's better to do a library clean-up or size selection first. Refer https://www.nvigen.com/dna-clean-up/ or https://www.nvigen.com/dna-size-selection-tutorial/.