Skip to contents

This function is used to define a data source, which can then be added to a bowerbird data repository configuration. Passing the configuration object to bb_sync will trigger a download of all of the data sources in that configuration.


  description = NA_character_,
  comment = NA_character_,
  authentication_note = NA_character_,
  user = NA_character_,
  password = NA_character_,
  access_function = NA_character_,
  data_group = NA_character_,
  collection_size = NA,
  warn_empty_auth = TRUE



string: (required) a unique identifier of the data source. If the data source has a DOI, use that. Otherwise, if the original data provider has an identifier for this dataset, that is probably a good choice here (include the data version number if there is one). The ID should be something that changes when the data set changes (is updated). A DOI is ideal for this


string: (required) a unique name for the data source. This should be a human-readable but still concise name


string: a plain-language description of the data source, provided so that users can get an idea of what the data source contains (for full details they can consult the doc_url link)


string: (required) URL to the metadata record or other documentation of the data source


character vector: one or more source URLs. Required for bb_handler_rget, although some method functions might not require one


string: (required) details of the citation for the data source


string: (required) description of the license. For standard licenses (e.g. creative commons) include the license descriptor ("CC-BY", etc)


string: comments about the data source. If only part of the original data collection is mirrored, mention that here


list (required): a list object that defines the function used to synchronize this data source. The first element of the list is the function name (as a string or function). Additional list elements can be used to specify additional parameters to pass to that function. Note that bb_sync automatically passes the data repository configuration object as the first parameter to the method handler function. If the handler function uses bb_rget (e.g. bb_handler_rget), these extra parameters are passed through to the bb_rget function


list: each element of postprocess defines a postprocessing step to be run after the main synchronization has happened. Each element of this list can be a function or string function name, or a list in the style of list(fun,arg1=val1,arg2=val2) where fun is the function to be called and arg1 and arg2 are additional parameters to pass to that function


string: if authentication is required in order to access this data source, make a note of the process (include a URL to the registration page, if possible)


string: username, if required


string: password, if required


string: can be used to suggest to users an appropriate function to read these data files. Provide the name of an R function or even a code snippet


string: the name of the group to which this data source belongs. Useful for arranging sources in terms of thematic areas


numeric: approximate disk space (in GB) used by the data collection, if known. If the data are supplied as compressed files, this size should reflect the disk space used after decompression. If the data_source definition contains multiple source_url entries, this size should reflect the overall disk space used by all combined


logical: if TRUE, issue a warning if the data source requires authentication (authentication_note is not NA) but user and password have not been provided. Set this to FALSE if you are defining a data source for others to use with their own credentials: they will typically call your data source constructor and then modify the user and password components


a tibble with columns as per the function arguments (excluding warn_empty_auth)


The method parameter defines the handler function used to synchronize this data source, and any extra parameters that need to be passed to it.

Parameters marked as "required" are the minimal set needed to define a data source. Other parameters are either not relevant to all data sources (e.g. postprocess, user, password) or provide metadata to users that is not strictly necessary to allow the data source to be synchronized (e.g. description, access_function, data_group). Note that three of the "required" parameters (namely citation, license, and doc_url) are not strictly needed by the synchronization code, but are treated as "required" because of their fundamental importance to reproducible science.

See vignette("bowerbird") for more examples and discussion of defining data sources.


## a minimal definition for the GSHHG coastline data set:

my_source <- bb_source(
   id = "gshhg_coastline",
   name = "GSHHG coastline data",
   doc_url = "",
   citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical,
     High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996",
   source_url = "",
   license = "LGPL",
   method = list("bb_handler_rget",level = 1, accept_download = "README|bin.*\\.zip$"))

## a more complete definition, which unzips the files after downloading and also
##  provides an indication of the size of the dataset

my_source <- bb_source(
   id = "gshhg_coastline",
   name = "GSHHG coastline data",
   description = "A Global Self-consistent, Hierarchical, High-resolution Geography Database",
   doc_url = "",
   citation = "Wessel, P., and W. H. F. Smith, A Global Self-consistent, Hierarchical,
     High-resolution Shoreline Database, J. Geophys. Res., 101, 8741-8743, 1996",
   source_url = "*",
   license = "LGPL",
   method = list("bb_handler_rget", level = 1, accept_download = "README|bin.*\\.zip$"),
   postprocess = list("bb_unzip"),
   collection_size = 0.6)

## define a data repository configuration
cf <- bb_config("/my/repo/root")

## add this source to the repository
cf <- bb_add(cf, my_source)

if (FALSE) {
## sync the repo