Hi everyone !

I have exported a small number (~500) of 145 pb sequences from a Stacks analyse.

I then used it as my "reference genome" (actually a very fragmented and unordered one, consequently !) to align a .fastq.gz file containing several millions of reads 1. My objective is to filter out, from my "reference", every single 145 pb sequence from which an aligned read actually has multiple hits.

e.g if a given read has a primary alignment on Seq A from reference, but also a secondary alignment on Seq B from reference, I want to filter out both Seq A and B. In the end, I hope to obtain a smaller "reference" with very specific sequences that I can amplify unambiguously by PCR.

I used BWA MEM to do the alignment, with parameters -a and -M turned on, but I would like to know if there would be another way (than making a script that works directly on the SAM file) to detect all sequences that I want to get rid of.

The enclosed document and SAMtools manual are both very helpful but I couldn't find any solution for my issue. Would somebody have any suggestion :-) ?

Thanks a lot !

Chrys

http://www.htslib.org/doc/samtools-1.2.html

Similar questions and discussions