taxizedb - Tools for Working with Taxonomic Databases on your Machine

taxize is a heavily used taxonomic toolbelt package in R - However, it makes web requests for nearly all methods. That is fine for most cases, but when the user has many, many names it is much more efficient to do requests to a local SQL database.

Not all taxonomic databases are publicly available, or possible to mash into a SQLized version. Taxonomic DB’s supported thus far:

  • NCBI: text files are provided by NCBI, which we stitch into a sqlite db
  • ITIS: they provide a sqlite dump, which we use here
  • The PlantList: created from stitching together csv files. this source is no longer updated as far as we can tell. they say they’ve moved focus to the World Flora Online
  • Catalogue of Life: created from Darwin Core Archive dump.
  • GBIF: created from Darwin Core Archive dump. right now we only have the taxonomy table (called gbif), but will add the other tables in the darwin core archive later
  • Wikidata: aggregated taxonomy of Open Tree of Life, GLoBI and Wikidata. On Zenodo, created by Joritt Poelen of GLOBI.

Update schedule for databases:

  • NCBI: since db_download_ncbi creates the database when the function is called, it’s updated whenever you run the function
  • ITIS: since ITIS provides the sqlite database as a download, you can delete the old file and run db_download_itis to get a new dump; they I think update the dumps every month or so
  • The PlantList: no longer updated, so you shouldn’t need to download this after the first download
  • Catalogue of Life: we have a script that we run on a server once per month to stitch together the sqlite database from the DCA, so updated once per month, but we’re not sure how frequently COL updates their DCA dumps
  • GBIF: we have a script that we run on a server once per month to stitch together the sqlite database from the DCA, so updated once per month, but we’re not sure how frequently GBIF updates their DCA dumps
  • Wikidata: last updated April 6, 2018. Scripts are available to update the data if you prefer to do it yourself.

Links:

Get in touch in the issues with any ideas on new data sources.

This package for each data sources performs the following tasks:

  • Download database - db_download_*
  • Create dplyr SQL backend - src_*

All databases are SQLite.

Using the src connection, use dplyr, etc. to do operations downstream. Or create your own database connection to the sqlite file.

install

cran version

install.packages("taxizedb")

dev version

devtools::install_github("ropensci/taxizedb")
library("taxizedb")
library("dplyr")

Download DBs

ITIS

The Plant List (TPL)

Catalogue of Life (COL)

connect to the DBs

By default src_* functions use a path to the cached database file. You can alternatively pass in your own path if you’ve put it somewhere else.

ITIS

TPL

COL

Local versions of taxize functions

A few of the key functions from taxize have been ported to taxizedb. Support is currently limited to the NCBI taxonomy database.

children accesses the nodes immediately descending from a given taxon

classification finds the lineage of a taxon

downstream finds all taxa descending from a taxon

All of these functions run very fast. It only takes a few seconds to find all bacterial taxa and count them:

Meta

ropensci