Define a data source

This function is used to define a data source, which can then be added to a bowerbird data repository configuration. Passing the configuration object to bb_sync will trigger a download of all of the data sources in that configuration.

Usage

bb_source(
  id,
  name,
  description = NA_character_,
  doc_url,
  source_url,
  citation,
  license,
  comment = NA_character_,
  method,
  postprocess,
  authentication_note = NA_character_,
  user = NA_character_,
  password = NA_character_,
  access_function = NA_character_,
  data_group = NA_character_,
  collection_size = NA,
  warn_empty_auth = TRUE
)

Arguments

id: string: (required) a unique identifier of the data source. If the data source has a DOI, use that. Otherwise, if the original data provider has an identifier for this dataset, that is probably a good choice here (include the data version number if there is one). The ID should be something that changes when the data set changes (is updated). A DOI is ideal for this
name: string: (required) a unique name for the data source. This should be a human-readable but still concise name
description: string: a plain-language description of the data source, provided so that users can get an idea of what the data source contains (for full details they can consult the doc_url link)
doc_url: string: (required) URL to the metadata record or other documentation of the data source
source_url: character vector: one or more source URLs. Required for bb_handler_rget, although some method functions might not require one
citation: string: (required) details of the citation for the data source
license: string: (required) description of the license. For standard licenses (e.g. creative commons) include the license descriptor ("CC-BY", etc)
comment: string: comments about the data source. If only part of the original data collection is mirrored, mention that here
method: list (required): a list object that defines the function used to synchronize this data source. The first element of the list is the function name (as a string or function). Additional list elements can be used to specify additional parameters to pass to that function. Note that bb_sync automatically passes the data repository configuration object as the first parameter to the method handler function. If the handler function uses bb_rget (e.g. bb_handler_rget), these extra parameters are passed through to the bb_rget function
postprocess: list: each element of postprocess defines a postprocessing step to be run after the main synchronization has happened. Each element of this list can be a function or string function name, or a list in the style of list(fun,arg1=val1,arg2=val2) where fun is the function to be called and arg1 and arg2 are additional parameters to pass to that function
authentication_note: string: if authentication is required in order to access this data source, make a note of the process (include a URL to the registration page, if possible)
user: string: username, if required
password: string: password, if required
access_function: string: can be used to suggest to users an appropriate function to read these data files. Provide the name of an R function or even a code snippet
data_group: string: the name of the group to which this data source belongs. Useful for arranging sources in terms of thematic areas
collection_size: numeric: approximate disk space (in GB) used by the data collection, if known. If the data are supplied as compressed files, this size should reflect the disk space used after decompression. If the data_source definition contains multiple source_url entries, this size should reflect the overall disk space used by all combined
warn_empty_auth: logical: if TRUE, issue a warning if the data source requires authentication (authentication_note is not NA) but user and password have not been provided. Set this to FALSE if you are defining a data source for others to use with their own credentials: they will typically call your data source constructor and then modify the user and password components

Value

a tibble with columns as per the function arguments (excluding warn_empty_auth)

Details

The method parameter defines the handler function used to synchronize this data source, and any extra parameters that need to be passed to it.

Parameters marked as "required" are the minimal set needed to define a data source. Other parameters are either not relevant to all data sources (e.g. postprocess, user, password) or provide metadata to users that is not strictly necessary to allow the data source to be synchronized (e.g. description, access_function, data_group). Note that three of the "required" parameters (namely citation, license, and doc_url) are not strictly needed by the synchronization code, but are treated as "required" because of their fundamental importance to reproducible science.

See vignette("bowerbird") for more examples and discussion of defining data sources.

Examples


## a minimal definition for the GSHHG coastline data set:

my_source <- bb_source(
   id = "gshhg_coastline",
   name = "GSHHG coastline data",
   doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg",
   citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical,
     High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996",
   source_url = "ftp://ftp.soest.hawaii.edu/gshhg/",
   license = "LGPL",
   method = list("bb_handler_rget",level = 1, accept_download = "README|bin.*\\.zip$"))

## a more complete definition, which unzips the files after downloading and also
##  provides an indication of the size of the dataset

my_source <- bb_source(
   id = "gshhg_coastline",
   name = "GSHHG coastline data",
   description = "A Global Self-consistent, Hierarchical, High-resolution Geography Database",
   doc_url = "http://www.soest.hawaii.edu/pwessel/gshhg",
   citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical,
     High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996",
   source_url = "ftp://ftp.soest.hawaii.edu/gshhg/*",
   license = "LGPL",
   method = list("bb_handler_rget", level = 1, accept_download = "README|bin.*\\.zip$"),
   postprocess = list("bb_unzip"),
   collection_size = 0.6)

## define a data repository configuration
cf <- bb_config("/my/repo/root")

## add this source to the repository
cf <- bb_add(cf, my_source)

if (FALSE) { # \dontrun{
## sync the repo
bb_sync(cf)
} # }

Usage

Arguments

Value

Details

See also

Examples

About

Community

Resources