Hi all,

I've always wanted to try harvesting and analysing data from NCBI, but I've just never really known the best way forward. I know that I can easily download a whole bunch of data from NCBI by just searching for a group on the website, Geneious, or any other number of programs/scripts. But, I end up with a big mess of data that seems difficult to sort through and actually use.

So, for those who routinely download NCBI sequences — particularly using R — can you point me in the right direction of packages or tutorials to make it easier to:

  • Summarise the available data (say by species within a particular genus) on the online databases. Perhaps with an output table showing the number of sequences available for genes X, Y, and Z by taxon A, B, and C.
  • Choose the genes (i.e., several commonly used phylogenetics genes) and the species you want and download them into an alignment.
  • Follow up question — It doesn't look like these databases store a specimen identifiers. So, do I need to trawl through each paper's tables to then link different genes from a single specimen together. Or, do people usually just link genes together from the same species to get a full compliment (this seems dodgy and prone to author error in identification).

Perhaps there is already a document or paper that really summarises these issues and the best-practice nicely, but my search queries haven't found it yet.

Many thanks in advance!

James

More James Dorey's questions See All
Similar questions and discussions