Backends for taxadb
Carl Boettiger, Kari Norman
2024-10-28
Source:vignettes/backends.Rmd
backends.Rmd
taxadb
is designed to work with a variety of different
“backends” – software that works under the hood to store and retrieve
the requested data. taxadb
has an intelligent default
method selector which will attempt to use the best method available on
your system, which means you can use taxadb
without having
to worry about these details. However, to improve performance of
taxadb
, becoming familiar with these backends can yield
significant improvements in performance.
RSQLite
RSQLite
is the default database backend if no suggested
backend is detected. RSQLite
has no external software
dependencies and will be automatically installed with
taxadb
(it is a hard dependency as an imported rather than
suggested package). The term Lite
indicates that SQLite
does not require the separate “server” and “client” software model found
on traditional databases such as MySQL, and SQLite is widely used in
consumer software everywhere. RSQLite packages SQLite for R. It enables
persistent local storage for R applications but will be slower than the
alternatives. For certain operations it can be significantly slower.
MonetDBLite & duckdb
MonetDBLite
is a modern alternative to
RSQLite
. MonetDBLite
is both more powerful
than SQLite (in supporting a greater array of operations), and can run
much faster. Filtering joins in particular can be much faster even than
the in-memory operations of dplyr
. Because filtering joins
lie at the heart of many taxadb
functions this can yield
substantial improvements in performance. Unfortunately, the R interface,
MonetDBLite
was removed from CRAN in April 2019. The
package can still be installed from GitHub by running
devtools::install_github("hannesmuehleisen/MonetDBLite-R")
,
though this requires the appropriate compilers. The developer plans to
replace MonetDBLite with duckdb
, (see https://github.com/duckdb/duckdb), but this is not yet
feature complete and thus not yet fully compatible for
taxadb
use. Because installation is more difficult,
MonetDBLite
is not a required dependency, but will be used
by default if taxadb
detects an existing installation.
duckdb
support will be switched on as the first priority in
the method waterfall.
in-memory
taxadb
can also be set to use in-memory only, without a
backend. (Note that this is distinct from using RSQlite
or
MonetDBLite
with over in-memory
mode, because
it uses only native R data.frame
s to store data). This will
tend to be faster that RSQLite
but slower than
MonetDBLite
or duckdb
. In this mode, data will
persist over a single session but not between sessions (since memory is
cleared when the user quits out of R). Note that many taxonomic tables
are quite large when uncompressed, and users with less than 8-16GB of
free RAM may find their machine becomes slow or unresponsive in this
mode.
Manual control of the backend engine
Users can override the automatic preferences of taxadb
by setting the environmental variable TAXADB_DRIVER
. For
example, running Sys.setenv(TAXADB_DRIVER="RSQLite")
will
make RSQLite
the default driver, even if
MonetDBLite
is installed.
Local storage
The first time taxadb
accesses a data source, it will
download and store the full dataset from that provider. Users can
trigger a download ahead of time by running td_create()
,
e.g. td_create("fb")
will create a local copy of the
FishBase taxonomy. If a user does not call td_create()
first, taxadb
simply downloads the data the first time that
provider is queried –
e.g. filter_name("Homo sapiens", "gibf")
will first
download and install GBIF if that has not been done already. These
download and install operations may be slow depending on your internet
connection, but need be performed only once. Downloaded data is stored
on your local harddisk and will persist between R sessions. The default
location depends on the default set by your operating system (see the
rappdirs
package). Users can configure this location by
setting the environmental variable TAXADB_HOME
. For
example, all unit tests in the package use temporary storage by setting
Sys.setenv(TAXADB_HOME=tempdir())
, which is cleared out
after the R session ends.
A user can install all available name providers up front with
td_create("all")
. An overview of the available scientific
name providers is found in the providers vignette.
Other backends
taxadb
will work just as well with any
DBI
-compatible database backend (Postgres, MariaDB, etc).
All taxadb
functions take an argument
taxadb_db
, which is just a DBI
connection used
by dplyr
. For example, we can create an in-memory RSQLite
connection and use that to store data for a single session:
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
taxadb::get_ids("Homo sapiens", taxadb_db = con)
Users can also call the td_connect()
function to connect
to taxadb
’s default databases. Running
td_connect()
with no arguments will return the current
default connection. This is a convenient way to confirm that your system
is using the database engine you intended it to use. You can also use
that connection to interact directly with the taxadb
databases (e.g. using dplyr
or DBI
functions).