How do I remove primers from fastQ files?

04 April 2015 10 3K Report

I received my fastq files (amplicon sequencing) back from the sequencing centre. To process data, we use the uparse pipeline. First we merge the forward and reverse reads.

However, all the sequences still have the amplification primers I used attached to them. Not all sequences have the same primersequence attached though. For example, for gene 1 I used a forward primer of 20 nucleotides long. In some sequences, I can find these 20 nucleotides, in others, I only find the 14 last nucleotides, etc.

Each sequence also has a quality score (each base gets a score). When primers are removed, the region in the quality score that stood for the primer also has to be removed. How can this be easily done?

Afterwards, we have to do dereplication and OTU clustering. This is why I want the primers removed. The primers were degenerate, so they would influence these processes, and I want to avoid this.

Also, the pipeline is designed to do the OTU clustering at 97%. However, I want to cluster at a higer percentage (as high as possible). Reason is that the DNA sequences will be translated to amino acids afterwards. 1 nucleotide difference could easily result in another amino acid sequence. If I cluster on a lower percentage, OTUs would be formed with several sequences, but this could mean those sequences have a different amino acid sequence. Since only 1 representative of each OTU is chosen, this would mean I lose amino acid sequence data.

We tried to cluster at higher percentages, so >97%. The process worked fine, so we get the final fasta file with all unique sequences. Only, the OTU table that is made isn't correct. A high percentage of OTUs are missing in this file. An explenation was that the pipeline is designed for 97% and if you go higher, problems could arise. For example, if you choose 99.9%, it could be that the sequence isn't found in the original files, so it is thrown out. This seems weird to me, because the sequence originates from this original file, so how can it not be found?

So in the end, what I'd like to have help with is

1) How to remove the primer sequences from my data, as well as the same region in the quality score of the sequence? Seqtk doesn't work btw

2) How can I cluster at a higher percentage than 97% and what is the maximum percentage that can be used without the OTU table being wrong?

3) What suggestions would you have to process our data, what programs could we use?

Many thanks in advance!

Francisco J. Enguita Popular answer

Hi there

I have used this software called Trimmomatic to remove primers and also to remove bad quality reads. Give a try

http://www.usadellab.org/cms/?page=trimmomatic

Best of luck. Cheers

Paco

Fiona Thorburn

Are you sequencing amplicons i.e are your primers at the ends of reads? If so you could just trim the ends of reads, or you could try Trim Galore! using --adapter to remove the primer sequence. These should both remove the quality scores too.

Leavy Zhang

How to trim primers is not an easy method. First, you should know your primer sequences and then the trimmed result should be evaluated to make sure that a balance between the retaining of read number and diversity and the lowering of the impact of primers for your later analysis.

Cutadapt, as I thought, is a tools for rubbish sub-sequences removal.

Guillaume Tahon

Thank you both for answering.

I'm indeed working with amplicon sequences. Forgot to mention that in the question. So the primer sequences are at the ends.

Francisco J. Enguita

Hi there

I have used this software called Trimmomatic to remove primers and also to remove bad quality reads. Give a try

http://www.usadellab.org/cms/?page=trimmomatic

Best of luck. Cheers

Paco

Jacob Israel Cervantes Luevano

Hi,

I used FASTX Toolkit , it works well !!

http://hannonlab.cshl.edu/fastx_toolkit/

Best

Jacob

Guillaume Tahon

Thank you all for your reply. I ended up using cutadapt. It removes most primers after several rounds, although it still misses a few. Luckily I could easily remove those after my alignment. Again, thank you all!

Pol Cuscó

I recommend that you use Seeq for trimming those primers. It will work with incomplete or degenerate sequences, so give it a try.

You can find instructions on how to do it here:

https://github.com/ezorita/seeq#v-using-seeq-as-a-sequence-trimmer

https://github.com/ezorita/seeq

Andrey Kechin

For removing primer sequences from NGS reads you can use cutPrimers (https://github.com/aakechin/cutPrimers). It has been created exactly for that!

Oscar Montoya

The following code uses cutadapt (a command line tool) to remove three pairs of primers (three forward and thre reverse) and their reverse complements (simply the same primers sequences but backwards):

for i in *_R1_001.fastq.gz

SAMPLE=$(echo ${i} | sed "s/_R1_\001\.fastq\.gz//")

echo ${SAMPLE}_R1_001.fastq.gz ${SAMPLE}_R2_001.fastq.gz

cutadapt -m 10 -O 17 -e 0 -q 20,20 -g "forwardPrimer1xxx" -g "forwardPrimer2xxx" -g "forwardPrimer3xxx" -a "forwardPrimer1InverseSequencexxx" -a "forwardPrimer2InverseSequencexxx" -a ""forwardPrimer3InverseSequencexxx" -G "reversePrimer1xxx" -G "reversePrimer2xxx" -G "reversePrimer3xxx" -A "reversePrimer1InverseSequencexxx" -A "reversePrimer2InverseSequencexxx" -A "reversePrimer3InverseSequencexxx" -o /path/to/write/output/${SAMPLE}_R1_001.fastq.gz -p /path/to/write/output/${SAMPLE}_R2_001.fastq.gz ${SAMPLE}_R1_001.fastq.gz ${SAMPLE}_R2_001.fastq.gz

done

The "xxx" avoids matching inner parts of the reads (http://cutadapt.readthedocs.io/en/stable/recipes.html#avoid-internal-adapter-matches).

Get the reverse complements of your primers here http://reverse-complement.com/.

If you only need to remove one set of primers (one forward and one reverse), remove the extra -g, -G, -a, and -A from the script as required.

To see how the sed function works, go to http://www.grymoire.com/Unix/Sed.html.

Help with NCBI table2asn

What is the Gross Primary Production anno 2016?

Vennerable in R (How to change font size)?

Why do i have more cDNA than RNA?

What is going wrong with my cloning experiment (pGEM-t kit) - I am getting immense amounts of false positives

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

How to confirm the site-directed mutagenesis result without performing NGS?

I can't see the ssDNA band after performing asymmetric PCR. Is there any way to do this?

Are there any instruments for studying time similar to the way it is in space?

Weak DAPI staining after immunohistochemistry - how to improve?

Why after performing site directed mutagenesis ,I don't see any colony after transformation?

Anyone having idea about VN primer for miRNA primer design ?