How do I remove primers from fastQ files?

23 April 2015 9 8K Report

I received my fastq files (amplicon sequencing) back from the sequencing centre. To process data, we use the uparse pipeline. First we merge the forward and reverse reads.

However, all the sequences still have the amplification primers I used attached to them. Not all sequences have the same primersequence attached though. For example, for gene 1 I used a forward primer of 20 nucleotides long. In some sequences, I can find these 20 nucleotides, in others, I only find the 14 last nucleotides, etc.

Each sequence also has a quality score (each base gets a score). When primers are removed, the region in the quality score that stood for the primer also has to be removed. How can this be easily done?

Afterwards, we have to do dereplication and OTU clustering. This is why I want the primers removed. The primers were degenerate, so they would influence these processes, and I want to avoid this.

Also, the pipeline is designed to do the OTU clustering at 97%. However, I want to cluster at a higer percentage (as high as possible). Reason is that the DNA sequences will be translated to amino acids afterwards. 1 nucleotide difference could easily result in another amino acid sequence. If I cluster on a lower percentage, OTUs would be formed with several sequences, but this could mean those sequences have a different amino acid sequence. Since only 1 representative of each OTU is chosen, this would mean I lose amino acid sequence data.

We tried to cluster at higher percentages, so >97%. The process worked fine, so we get the final fasta file with all unique sequences. Only, the OTU table that is made isn't correct. A high percentage of OTUs are missing in this file. An explenation was that the pipeline is designed for 97% and if you go higher, problems could arise. For example, if you choose 99.9%, it could be that the sequence isn't found in the original files, so it is thrown out. This seems weird to me, because the sequence originates from this original file, so how can it not be found?

So in the end, what I'd like to have help with is

1) How to remove the primer sequences from my data, as well as the same region in the quality score of the sequence? Seqtk doesn't work btw

2) How can I cluster at a higher percentage than 97% and what is the maximum percentage that can be used without the OTU table being wrong?

3) What suggestions would you have to process our data, what programs could we use?

Many thanks in advance!

Francisco J. Enguita Popular answer

Hi there

I have used this software called Trimmomatic to remove primers and also to remove bad quality reads. Give a try

http://www.usadellab.org/cms/?page=trimmomatic

Best of luck. Cheers

Paco

Fiona Thorburn

Are you sequencing amplicons i.e are your primers at the ends of reads? If so you could just trim the ends of reads, or you could try Trim Galore! using --adapter to remove the primer sequence. These should both remove the quality scores too.

Leavy Zhang

How to trim primers is not an easy method. First, you should know your primer sequences and then the trimmed result should be evaluated to make sure that a balance between the retaining of read number and diversity and the lowering of the impact of primers for your later analysis.

Cutadapt, as I thought, is a tools for rubbish sub-sequences removal.

Guillaume Tahon

Thank you both for answering.

I'm indeed working with amplicon sequences. Forgot to mention that in the question. So the primer sequences are at the ends.

Francisco J. Enguita

Hi there

I have used this software called Trimmomatic to remove primers and also to remove bad quality reads. Give a try

http://www.usadellab.org/cms/?page=trimmomatic

Best of luck. Cheers

Paco

Jacob Israel Cervantes Luevano

Hi,

I used FASTX Toolkit , it works well !!

http://hannonlab.cshl.edu/fastx_toolkit/

Best

Jacob

Guillaume Tahon

Thank you all for your reply. I ended up using cutadapt. It removes most primers after several rounds, although it still misses a few. Luckily I could easily remove those after my alignment. Again, thank you all!

Pol Cuscó

I recommend that you use Seeq for trimming those primers. It will work with incomplete or degenerate sequences, so give it a try.

You can find instructions on how to do it here:

https://github.com/ezorita/seeq#v-using-seeq-as-a-sequence-trimmer

https://github.com/ezorita/seeq

Andrey Kechin

For removing primer sequences from NGS reads you can use cutPrimers (https://github.com/aakechin/cutPrimers). It has been created exactly for that!

Best method for remove false positive cluster in task fMRI (SPM) ?

How i can overcame Convergence failure occur in Nonlinear time history analysis ?

What centrifuge speed and time should I use to separate diatoms with the bacterial in the phycosphere with bacteria in the media?

Cloning issue: mixed colonies how to get rid off them?

Convert a spectrum of Cathodoluminescence to determine the perceived luminescence color ?

What model for pooled data from multiple surveys in multiple countries and years?

How to Induce Intestinal Permeability in mice ?

Une expertise dans la comptabilité LEAN ?

How to estimate the divergence date and mutation rate using quickly-evolving bacterial gene sequences obtained through amplicon-sequencing?

How can I reduce baseline background for 27Al liqude NMR analysis?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

How to confirm the site-directed mutagenesis result without performing NGS?

I can't see the ssDNA band after performing asymmetric PCR. Is there any way to do this?

Are there any instruments for studying time similar to the way it is in space?

Weak DAPI staining after immunohistochemistry - how to improve?

Why after performing site directed mutagenesis ,I don't see any colony after transformation?

Anyone having idea about VN primer for miRNA primer design ?