Since it seems that RDP and Greengenes are not 100% active anymore, how can I compare my 16S seqs with ribosomal databases (except NCBI and linux based ARB-SILVA)?
If you're just interested in classification and want to avoid ARB, then mothur or QIIME are the way to go. Both provide several classification algorithms that you can just drop any 16S database into. mothur currently provides Greengenes (May 2013), SILVA SSU v119 and RDP version 10. From a quick glance at their website, QIIME has an older version of SILVA (111) but a newer version of Greengenes (August 2013). I'm not aware of any other authorities on 16S taxonomy, so even though the databases are each about a year old they still represent the best current effort.
If I was to make a guess, I'd say that these projects are slowing down a bit now because there are comparatively fewer full length 16S sequences being generated, with the various NGS platforms becoming the established method of choice. There's probably also a lot of focus being redirected into the comparative genomics/metagenomics space.
Thank you David, I agree with you about the NGS and metagenomics substantial substitution of "good old fashioned" 16S databases. Sometimes I need to do very simple things (identify a strain or a DGGE band) and I realize that my (2-3) years old tools are not there anymore!
@dwaite David, thanks for the valuable contributions on the Internets, including the mothur forum. Someone told me GG is biased towards soil taxa. Have you heard that?
Also, I have a specific GG vs SILVA case: in freshwater 16s V4 data, GG classified every sequence, SILVA failed to classify 3% to domain level. I can go ahead with GG, but then 3% is not much. Does SILVA offer any explicit advantages of GG, such as perhaps better classification to species level?
I'll start with a disclaimer that I'm not involved with either of those projects, but based on the people who are involved in Greengenes I wouldn't be surprised to see a bit of a bias towards human and soil sequences. Our lab has noted differences in a few taxonomic groupings when comparing Greengenes-RDP-SILVA but for the most part these are extremely minor, and tend to occur in the candidate phyla which is where these databases all diverge anyway. The attached study concluded that Greengenes produces the lowest amount of unclassified sequences which is consistent with our own work.
There are obviously technical differences between how the two projects obtain their taxonomies, but the only difference between the two that I think is relevant to end-users is that SILVA includes eukaryotic sequences in their database. This makes it useful for identifying eukaryotic mitochondrial sequences in amplicon data, although whether or not this matters depends on the environment you're sampling.
Article Impact of training sets on classification of high-throughput...