Connect to GBIF remote directly. Can be much faster than downloading for one-off use or when using the package from a server in the same region as the data. See Details.
Usage
gbif_remote(
version = gbif_version(),
bucket = gbif_default_bucket(),
to_duckdb = FALSE,
safe = TRUE,
unset_aws = getOption("gbif_unset_aws", TRUE),
endpoint_override = Sys.getenv("AWS_S3_ENDPOINT", "s3.amazonaws.com"),
...
)
Arguments
- version
GBIF snapshot date
- bucket
GBIF bucket name (including region). A default can also be set using the option
gbif_default_bucket
, see options.- to_duckdb
Return a remote duckdb connection or arrow connection?
- safe
logical, default TRUE. Should we exclude columns
mediatype
andissue
? varchar datatype on these columns substantially slows downs queries.- unset_aws
Unset AWS credentials? GBIF is provided in a public bucket, so credentials are not needed, but having a AWS_ACCESS_KEY_ID or other AWS environmental variables set can cause the connection to fail. By default, this will unset any set environmental variables for the duration of the R session. This behavior can also be turned off globally by setting the option
gbif_unset_aws
to FALSE (e.g. to use an alternative network endpoint)- endpoint_override
optional parameter to
arrow::s3_bucket()
- ...
additional parameters passed to the
arrow::s3_bucket()
Value
a remote tibble tbl_sql
class object (by default), or a arrow
Dataset query if to_duckdb
is FALSE. In either case, users should call
[dplyr::collect]
on the final result to force evaluation and bring the
resulting data into
memory in R.
Details
Query performance is dramatically improved in queries that return only
a subset of columns. Consider using explicit select()
commands to return only
the columns you need.
A summary of this GBIF data, along with column meanings can be found at https://github.com/gbif/occurrence/blob/master/aws-public-data.md