If you were a researcher interested in investigating a particular group of prokaryotes (e.g. Cyanobacteria) from an environmental sequencing project, which of the following approaches to taxonomy (using tools CLARK, Kraken, Centrifuge, Kaiju, etc. which utilize k-mer information) would you use?

  • Index the whole GenBank database to create a database for your classifier, classify contigs against all, then collect cyanobacterial hits for downstream analyses?
  • Index just the cyanobacterial entries in GenBank, and then use this to classify the contigs against the cyanobacterial database?
  • I am working independently with a colleague who used Approach 2 and only got one species hit while I worked using Approach 1 and got a very diverse result (several (>20) species are present according to my result).

    My colleague is firm with the results of their work. I, however, am not entirely convinced that there is only one species in the sample since we are doing classical isolation and we were able to get several unialgal isolates already. I would like to know which one of us is correct.

    Would constraining the database from the start put a bias to the classification? Why do we have such diverging results?

    More Angelo Joshua Victoria's questions See All
    Similar questions and discussions