Skip to contents

This function provides similar, but simplified, functionality to the the command-line wget utility. It is based on the rvest package.


  level = 0,
  wait = 0,
  accept_follow = c("(/|\\.html?)$"),
  reject_follow = character(),
  accept_download = bb_rget_default_downloads(),
  accept_download_extra = character(),
  reject_download = character(),
  clobber = 1,
  no_parent = TRUE,
  no_parent_download = no_parent,
  no_check_certificate = FALSE,
  relative = FALSE,
  remote_time = TRUE,
  verbose = FALSE,
  show_progress = verbose,
  debug = FALSE,
  dry_run = FALSE,
  stop_on_download_error = FALSE,
  use_url_directory = TRUE,
  no_host = FALSE,
  cut_dirs = 0L,
  link_css = "a",




string: the URL to retrieve


integer >=0: recursively download to this maximum depth level. Specify 0 for no recursion


numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block users making too many requests in a short period of time


character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs matching all entries will be followed during the spidering process. Note that the first URL (provided via the url parameter) will always be visited, unless it matches the download criteria


character: as for accept_follow, but specifying URL regular expressions to reject


character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs that match all entries will be accepted for download. By default the accept_download parameter is that returned by bb_rget_default_downloads: use bb_rget_default_downloads() to see what that is


character: character vector with one or more entries. If provided, URLs will be accepted for download if they match all entries in accept_download OR all entries in accept_download_extra. This is a convenient method to add one or more extra download types, without needing to re-specify the defaults in accept_download


character: as for accept_regex, but specifying URL regular expressions to reject


string: username used to authenticate to the remote server


string: password used to authenticate to the remote server


numeric: 0=do not overwrite existing files, 1=overwrite if the remote file is newer than the local copy, 2=always overwrite existing files


logical: if TRUE, do not ever ascend to the parent directory when retrieving recursively. This is TRUE by default, bacause it guarantees that only the files below a certain hierarchy will be downloaded. Note that this check only applies to links on the same host as the starting url. If that URL links to files on another host, those links will be followed (unless relative = TRUE)


logical: similar to no_parent, but applies only to download links. A typical use case is to set no_parent to TRUE and no_parent_download to FALSE, in which case the spidering process (following links to find downloadable files) will not ascend to the parent directory, but files can be downloaded from a directory that is not within the parent


logical: if TRUE, don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate. This option might be useful if trying to download files from a server with an expired certificate, but it is clearly a security risk and so should be used with caution


logical: if TRUE, only follow relative links. This can be useful for restricting what is downloaded in recursive mode


logical: if TRUE, attempt to set the local file's time to that of the remote file


logical: print trace output?


logical: if TRUE, show download progress


logical: if TRUE, will print additional debugging information. If bb_rget is not behaving as expected, try setting this to TRUE


logical: if TRUE, spider the remote site and work out which files would be downloaded, but don't download them


logical: if TRUE, the download process will stop if any file download fails. If FALSE, the process will issue a warning and continue to the next file to download


character: if provided, then each url will be treated as a single URL (no recursion will be conducted). It will be downloaded to a file with name given force_local_filename, in a local directory determined by the url. force_local_filename should be a character vector of the same length as the url vector


logical: if TRUE, files will be saved into a local directory that follows the URL structure (e.g. files from http://some.where/place will be saved into directory some.where/place). If FALSE, files will be saved into the current directory


logical: if use_url_directory = TRUE, specifying no_host = TRUE will remove the host name from the directory (e.g. files from files from http://some.where/place will be saved into directory place)


integer: if use_url_directory = TRUE, specifying cut_dirs will remove this many directory levels from the path of the local directory where files will be saved (e.g. if cut_dirs = 2, files from http://some.where/place/baa/haa will be saved into directory some.where/haa. if cut_dirs = 1 and no_host = TRUE, files from http://some.where/place/baa/haa will be saved into directory baa/haa)

string: css selector that identifies links (passed as the css parameter to html_elements). Note that link elements must have an href attribute


named list: options to use with curl downloads, passed to the .list parameter of curl::new_handle


list: named list or arguments to provide to get_bucket_df and put_object. Files will be uploaded into that bucket instead of the local filesystem


a list with components 'ok' (TRUE/FALSE), 'files', and 'message' (error or other messages)


NOTE: this is still somewhat experimental.