Backends for taxadb
Carl Boettiger, Kari Norman
taxadb is designed to work with a variety of different “backends” – software that works under the hood to store and retrieve the requested data.
taxadb has an intelligent default method selector which will attempt to use the best method available on your system, which means you can use
taxadb without having to worry about these details. However, to improve performance of
taxadb, becoming familiar with these backends can yield significant improvements in performance.
RSQLite is the default database backend if no suggested backend is detected.
RSQLite has no external software dependencies and will be automatically installed with
taxadb (it is a hard dependency as an imported rather than suggested package). The term
Lite indicates that SQLite does not require the separate “server” and “client” software model found on traditional databases such as MySQL, and SQLite is widely used in consumer software everywhere. RSQLite packages SQLite for R. It enables persistent local storage for R applications but will be slower than the alternatives. For certain operations it can be significantly slower.
MonetDBLite & duckdb
MonetDBLite is a modern alternative to
MonetDBLite is both more powerful than SQLite (in supporting a greater array of operations), and can run much faster. Filtering joins in particular can be much faster even than the in-memory operations of
dplyr. Because filtering joins lie at the heart of many
taxadb functions this can yield substantial improvements in performance. Unfortunately, the R interface,
MonetDBLite was removed from CRAN in April 2019. The package can still be installed from GitHub by running
devtools::install_github("hannesmuehleisen/MonetDBLite-R"), though this requires the appropriate compilers. The developer plans to replace MonetDBLite with
duckdb, (see https://github.com/duckdb/duckdb), but this is not yet feature complete and thus not yet fully compatible for
taxadb use. Because installation is more difficult,
MonetDBLite is not a required dependency, but will be used by default if
taxadb detects an existing installation.
duckdb support will be switched on as the first priority in the method waterfall.
taxadb can also be set to use in-memory only, without a backend. (Note that this is distinct from using
MonetDBLite with over
in-memory mode, because it uses only native R
data.frames to store data). This will tend to be faster that
RSQLite but slower than
duckdb. In this mode, data will persist over a single session but not between sessions (since memory is cleared when the user quits out of R). Note that many taxonomic tables are quite large when uncompressed, and users with less than 8-16GB of free RAM may find their machine becomes slow or unresponsive in this mode.
Manual control of the backend engine
Users can override the automatic preferences of
taxadb by setting the environmental variable
TAXADB_DRIVER. For example, running
Sys.setenv(TAXADB_DRIVER="RSQLite") will make
RSQLite the default driver, even if
MonetDBLite is installed.
The first time
taxadb accesses a data source, it will download and store the full dataset from that provider. Users can trigger a download ahead of time by running
td_create("fb") will create a local copy of the FishBase taxonomy. If a user does not call
taxadb simply downloads the data the first time that provider is queried – e.g.
filter_name("Homo sapiens", "gibf") will first download and install GBIF if that has not been done already. These download and install operations may be slow depending on your internet connection, but need be performed only once. Downloaded data is stored on your local harddisk and will persist between R sessions. The default location depends on the default set by your operating system (see the
rappdirs package). Users can configure this location by setting the environmental variable
TAXADB_HOME. For example, all unit tests in the package use temporary storage by setting
Sys.setenv(TAXADB_HOME=tempdir()), which is cleared out after the R session ends.
A user can install all available name providers up front with
td_create("all"). An overview of the available scientific name providers is found in the providers vignette.
taxadb will work just as well with any
DBI-compatible database backend (Postgres, MariaDB, etc). All
taxadb functions take an argument
taxadb_db, which is just a
DBI connection used by
dplyr. For example, we can create an in-memory RSQLite connection and use that to store data for a single session:
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") taxadb::get_ids("Homo sapiens", taxadb_db = con)
Users can also call the
td_connect() function to connect to
taxadb’s default databases. Running
td_connect() with no arguments will return the current default connection. This is a convenient way to confirm that your system is using the database engine you intended it to use. You can also use that connection to interact directly with the
taxadb databases (e.g. using