I received my fastq files (amplicon sequencing) back from the sequencing centre. To process data, we use the uparse pipeline. First we merge the forward and reverse reads.
However, all the sequences still have the amplification primers I used attached to them. Not all sequences have the same primersequence attached though. For example, for gene 1 I used a forward primer of 20 nucleotides long. In some sequences, I can find these 20 nucleotides, in others, I only find the 14 last nucleotides, etc.
Each sequence also has a quality score (each base gets a score). When primers are removed, the region in the quality score that stood for the primer also has to be removed. How can this be easily done?
Afterwards, we have to do dereplication and OTU clustering. This is why I want the primers removed. The primers were degenerate, so they would influence these processes, and I want to avoid this.
Also, the pipeline is designed to do the OTU clustering at 97%. However, I want to cluster at a higer percentage (as high as possible). Reason is that the DNA sequences will be translated to amino acids afterwards. 1 nucleotide difference could easily result in another amino acid sequence. If I cluster on a lower percentage, OTUs would be formed with several sequences, but this could mean those sequences have a different amino acid sequence. Since only 1 representative of each OTU is chosen, this would mean I lose amino acid sequence data.
We tried to cluster at higher percentages, so >97%. The process worked fine, so we get the final fasta file with all unique sequences. Only, the OTU table that is made isn't correct. A high percentage of OTUs are missing in this file. An explenation was that the pipeline is designed for 97% and if you go higher, problems could arise. For example, if you choose 99.9%, it could be that the sequence isn't found in the original files, so it is thrown out. This seems weird to me, because the sequence originates from this original file, so how can it not be found?
So in the end, what I'd like to have help with is
1) How to remove the primer sequences from my data, as well as the same region in the quality score of the sequence? Seqtk doesn't work btw
2) How can I cluster at a higher percentage than 97% and what is the maximum percentage that can be used without the OTU table being wrong?
3) What suggestions would you have to process our data, what programs could we use?
Many thanks in advance!