I would like to get some suggestions and clarification about dealing with singletons in NGS of PCR amplicons. in the mothur SOP, the QC steps include the following:
initial screening
align.seqs (which will remove some sequence)
pre.culster (to remove sequences due to sequencing errors)
chemira check (to remove chimera)
And my questions are:
are singletons referred to singleton unique sequences, or singleton OTUs?
what are the effects of singleton and rare sequences on sequencing analysis (e.g. alpha and beta diversity)?
at which step should we remove singletons / rare sequences?
Hello. The singletons refer to one OTU that has a single representative in your data set. So you must get rid of them once you've calculated your OTUs (which is not yet the case yet when you are at the QC steps). Your precluster step is only a way to help get rid of your chimeras. Check the chimeras, get rid of them and then cluster into OTUs. Once this is done, you can get rid of singletons. I hope it helps. Cheers.
Hello. The singletons refer to one OTU that has a single representative in your data set. So you must get rid of them once you've calculated your OTUs (which is not yet the case yet when you are at the QC steps). Your precluster step is only a way to help get rid of your chimeras. Check the chimeras, get rid of them and then cluster into OTUs. Once this is done, you can get rid of singletons. I hope it helps. Cheers.
Thanks for your answer. I think preclustrer is a mothur SOP step to remove sequencing errors but not chimeras (maybe it also help to get rid of chimera). My question still stands when we considering the following situation:
suppose after all the QC steps on mothur SOP, we have a fasta file with uqinue sequences and the correspondent count table. and from the count table, we have for instant 99.9% of the unique sequences that each exist with one copy in the entire data set. Then when we run the cluster of cluster.split command and the subsequence commands in mothur to obtain OTUs with specific cut-off. what if this 99.9% of the unique sequences form one single OTU and this single OTU presents more than 1 or even more sequences?
for my understanding, all the QC steps in whatever platforms is to serve to facilitate the production of a reliable set of representative sequences (i.e. OTUs) and OTU table. Thus, if one questions the reliability of a singleton OTU, should we also be more careful about the singleton unique sequences that we use to make OTUs, and should we remove those "singleton and uniques sequences" first?