Suppose I have a fastq sample file with adaptor content in it. I believe adaptors in the sequence will not get mapped to the reference genome. If you have contrasting answers what is the chance of the adaptors getting mapped?
What tool(s) are you planning on using? You should refer directly to their documentation.
All of the assembly or mapping tools I've used in the past require adaptors and barcodes to be removed; this is often carried out during the demultiplexing stage.
Using BWA and Bowtie, I have always removed the adaptors and indexes/barcodes before assembly/alignment. This can be carried out using any number of tools that demultiplex raw .fastq files.
I'm not sure what you mean about the adaptors having unique features (other than they are used to fix DNA to the flow cell (assuming you're using Illumina tecnology)). The alignment protocols do not identify adaptors as such and will treat them as any other part of your sequences. Because there's a high degree of similarity (they're identical), those portions of your reads will be aligned to each other and you'll have a very poor assembly.
1. I believe you are mapping reads either to perform sequence assembly or call variants. One thing you ought to remember is that aligners have leeways or allowances for error in alignment. This will mean, adapter sequences however not exactly identical to genomic sequence might be mapped due to leeways in mapping. So if you use something like FASTQC and realize there are lots of adapters in your reads, consider trimming.
2. However, trimming is not as straight forward as one might think as different adapter removal tools employ slightly different algorithms in adapter detection and removal e.g. how they deal with adapters in 5' vs 3' ends, how far into the read to trim, matching error thresholds, paired-end reads handling, etc. So, use a tool that you know clearly what it is doing.
@Jacob R Price
I would use the term "demultiplexing" more specifically when referring to separating a multitude of reads from a sequencing platform into specific samples, rather than including adapter removal as part of that process.
To be honest, I have never really bothered myself with overrepresented sequences.
My hunch is, once adapter trimming is done then most of these should reduce. If that does not happen, then I could attribute the reason for the overrepresentation to any of these three given by the fastqc analysis:
"Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected."
If you are sure about your library then it could just mean that the sequence is quite common in your genome of interest.
Remember that these overrepresented sequences are 25-75bp (accroding to fastqc) and across a genome it is fairly reasonable to expect some similarity with that length!
The question is not if the adaptor sequence will map to the reference genome or not. The question is will your read sequence and adaptor attached to {adapto+read} it will map where it really belongs?
The answer is absolutely not. So go ahead and remove the adaptors.
I got a point, what I really want. If the adaptors are not mapping to the reference genome, it remains as unmapped right?
Rather than going for adaptor removal in the initial step, those unmapped files can be removed after aligning to the reference genome? What is your suggestion?
No, the adaptors in your fastq are part of the read sequence, the aligner will not put them as unmapped, rather it'll try to map the whole sequence including the adaptorl So no
Adapter trimming is the best practice, and this step is routinely included during demultiplexing. If you chose not to trim adapters then the read will usually still map to the genome, depending upon your settings in BWA or Bowtie2. In most cases, the aligner will clip off the adapter because it is discordant with the genomic DNA upstream or downstream from the mapped read.
Small sections of adapters may also map to the genome, but again this is highly dependent upon the settings that you use for BWA or Bowtie. If the adapter fragments map then they will likely lead to over-represented genomic sequences. If you are in doubt, you can go ahead and align the FASTQ then use IGV motif finder to search through the resulting pile-up for your adapter sequences. If you are doing WES or WGS then you should see high read depth in genomic regions matching your adapters (sequences available online). If you have only run a few MB of DNA for your assay then your adapters will be diluted across the genome and you will likely not see much overrepresentation of adapters on pile-up.
if noting is satisfactory, why don't you just align it yourself. Do a couple of scenarios, like aligning only the adaptors, aligning read without trimmed adaptors, and aligning reads before and after adaptor trimming
I have done it and the result is almost the same. That is why I have asked this question here. I just want to understand why adaptor trimming is necessary.
well if you get your fastq files from illumina instrument for example, they are usually trimmed already. Second thing, most mappers will perform clipping during the alignment, so it will still get mapped correctly, albeit with lower alignment score
@Artur: totally agree with you, trimmomatic would remove most commonly used illumina adaptors, and the barcodes should be removed in the demultiplexing step. I personally run the data through trimmomatic after adaptorRemoval, and perform quality trimming with seqtk, after that, I filter out all the reads that fall below a given length, the length then depends on what type of data you're working with: DNA/RNA, the length of the insert, ...