This function is used to define a data source, which can then be added to a bowerbird data repository configuration. Passing the configuration object to bb_sync
will trigger a download of all of the data sources in that configuration.
Usage
bb_source(
id,
name,
description = NA_character_,
doc_url,
source_url,
citation,
license,
comment = NA_character_,
method,
postprocess,
authentication_note = NA_character_,
user = NA_character_,
password = NA_character_,
access_function = NA_character_,
data_group = NA_character_,
collection_size = NA,
warn_empty_auth = TRUE
)
Arguments
- id
string: (required) a unique identifier of the data source. If the data source has a DOI, use that. Otherwise, if the original data provider has an identifier for this dataset, that is probably a good choice here (include the data version number if there is one). The ID should be something that changes when the data set changes (is updated). A DOI is ideal for this
- name
string: (required) a unique name for the data source. This should be a human-readable but still concise name
- description
string: a plain-language description of the data source, provided so that users can get an idea of what the data source contains (for full details they can consult the
doc_url
link)- doc_url
string: (required) URL to the metadata record or other documentation of the data source
- source_url
character vector: one or more source URLs. Required for
bb_handler_rget
, although somemethod
functions might not require one- citation
string: (required) details of the citation for the data source
- license
string: (required) description of the license. For standard licenses (e.g. creative commons) include the license descriptor ("CC-BY", etc)
- comment
string: comments about the data source. If only part of the original data collection is mirrored, mention that here
- method
list (required): a list object that defines the function used to synchronize this data source. The first element of the list is the function name (as a string or function). Additional list elements can be used to specify additional parameters to pass to that function. Note that
bb_sync
automatically passes the data repository configuration object as the first parameter to the method handler function. If the handler function uses bb_rget (e.g.bb_handler_rget
), these extra parameters are passed through to thebb_rget
function- postprocess
list: each element of
postprocess
defines a postprocessing step to be run after the main synchronization has happened. Each element of this list can be a function or string function name, or a list in the style oflist(fun,arg1=val1,arg2=val2)
wherefun
is the function to be called andarg1
andarg2
are additional parameters to pass to that function- authentication_note
string: if authentication is required in order to access this data source, make a note of the process (include a URL to the registration page, if possible)
- user
string: username, if required
- password
string: password, if required
- access_function
string: can be used to suggest to users an appropriate function to read these data files. Provide the name of an R function or even a code snippet
- data_group
string: the name of the group to which this data source belongs. Useful for arranging sources in terms of thematic areas
- collection_size
numeric: approximate disk space (in GB) used by the data collection, if known. If the data are supplied as compressed files, this size should reflect the disk space used after decompression. If the data_source definition contains multiple source_url entries, this size should reflect the overall disk space used by all combined
- warn_empty_auth
logical: if
TRUE
, issue a warning if the data source requires authentication (authentication_note is not NA) but user and password have not been provided. Set this toFALSE
if you are defining a data source for others to use with their own credentials: they will typically call your data source constructor and then modify theuser
andpassword
components
Details
The method
parameter defines the handler function used to synchronize this data source, and any extra parameters that need to be passed to it.
Parameters marked as "required" are the minimal set needed to define a data source. Other parameters are either not relevant to all data sources (e.g. postprocess
, user
, password
) or provide metadata to users that is not strictly necessary to allow the data source to be synchronized (e.g. description
, access_function
, data_group
). Note that three of the "required" parameters (namely citation
, license
, and doc_url
) are not strictly needed by the synchronization code, but are treated as "required" because of their fundamental importance to reproducible science.
See vignette("bowerbird")
for more examples and discussion of defining data sources.
Examples
## a minimal definition for the GSHHG coastline data set:
my_source <- bb_source(
id = "gshhg_coastline",
name = "GSHHG coastline data",
doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg",
citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical,
High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996",
source_url = "ftp://ftp.soest.hawaii.edu/gshhg/",
license = "LGPL",
method = list("bb_handler_rget",level = 1, accept_download = "README|bin.*\\.zip$"))
## a more complete definition, which unzips the files after downloading and also
## provides an indication of the size of the dataset
my_source <- bb_source(
id = "gshhg_coastline",
name = "GSHHG coastline data",
description = "A Global Self-consistent, Hierarchical, High-resolution Geography Database",
doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg",
citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical,
High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996",
source_url = "ftp://ftp.soest.hawaii.edu/gshhg/*",
license = "LGPL",
method = list("bb_handler_rget", level = 1, accept_download = "README|bin.*\\.zip$"),
postprocess = list("bb_unzip"),
collection_size = 0.6)
## define a data repository configuration
cf <- bb_config("/my/repo/root")
## add this source to the repository
cf <- bb_add(cf, my_source)
if (FALSE) { # \dontrun{
## sync the repo
bb_sync(cf)
} # }