A recursive download utility

This function provides similar functionality to the the command-line wget utility.

Usage

bb_rget(
  url,
  level = 0,
  wait = 0,
  accept_follow = c("(/|\\.html?)$"),
  reject_follow = character(),
  accept_download = bb_rget_default_downloads(),
  accept_download_extra = character(),
  reject_download = character(),
  user,
  password,
  clobber = 1,
  no_parent = TRUE,
  no_parent_download = no_parent,
  no_check_certificate = FALSE,
  relative = FALSE,
  remote_time = TRUE,
  verbose = FALSE,
  show_progress = verbose,
  debug = FALSE,
  dry_run = FALSE,
  stop_on_download_error = FALSE,
  retries = 0,
  force_local_filename,
  use_url_directory = TRUE,
  no_host = FALSE,
  cut_dirs = 0L,
  link_css = "a",
  link_href = "href",
  curl_opts,
  target_s3_args,
  download_link_rewrite
)

bb_rget_default_downloads()

Arguments

url

string: the URL to retrieve

level

integer >=0: recursively download to this maximum depth level. Specify 0 for no recursion

wait

numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block users making too many requests in a short period of time

accept_follow

character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs matching all entries will be followed during the spidering process. Note that the first URL (provided via the url parameter) will always be visited, unless it matches the download criteria

reject_follow

character: as for accept_follow, but specifying URL regular expressions to reject

accept_download

character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs that match all entries will be accepted for download. By default the accept_download parameter is that returned by bb_rget_default_downloads: use bb_rget_default_downloads() to see what that is

accept_download_extra

character: character vector with one or more entries. If provided, URLs will be accepted for download if they match all entries in accept_download OR all entries in accept_download_extra. This is a convenient method to add one or more extra download types, without needing to re-specify the defaults in accept_download

reject_download

character: as for accept_regex, but specifying URL regular expressions to reject

user

string: username used to authenticate to the remote server

password

string: password used to authenticate to the remote server

clobber

numeric: 0=do not overwrite existing files, 1=overwrite if the remote file is newer than the local copy, 2=always overwrite existing files

no_parent

logical: if TRUE, do not ever ascend to the parent directory when retrieving recursively. This is TRUE by default, bacause it guarantees that only the files below a certain hierarchy will be downloaded. Note that this check only applies to links on the same host as the starting url. If that URL links to files on another host, those links will be followed (unless relative = TRUE)

no_parent_download

logical: similar to no_parent, but applies only to download links. A typical use case is to set no_parent to TRUE and no_parent_download to FALSE, in which case the spidering process (following links to find downloadable files) will not ascend to the parent directory, but files can be downloaded from a directory that is not within the parent

no_check_certificate

logical: if TRUE, don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate. This option might be useful if trying to download files from a server with an expired certificate, but it is clearly a security risk and so should be used with caution

relative

logical: if TRUE, only follow relative links. This can be useful for restricting what is downloaded in recursive mode

remote_time

logical: if TRUE, attempt to set the local file's time to that of the remote file

verbose

logical: print trace output?

show_progress

logical: if TRUE, show download progress

debug

logical: if TRUE, will print additional debugging information. If bb_rget is not behaving as expected, try setting this to TRUE

dry_run

logical: if TRUE, spider the remote site and work out which files would be downloaded, but don't download them

stop_on_download_error

logical: if TRUE, the download process will stop if any file download fails. If FALSE, the process will issue a warning and continue to the next file to download

retries

integer: number of times to retry a request if it fails with a transient error (similar to curl, a transient error means a timeout, an FTP 4xx response code, or an HTTP 5xx response code

force_local_filename

character: if provided, then each url will be treated as a single URL (no recursion will be conducted). It will be downloaded to a file with name given force_local_filename, in a local directory determined by the url. force_local_filename should be a character vector of the same length as the url vector

use_url_directory

logical: if TRUE, files will be saved into a local directory that follows the URL structure (e.g. files from http://some.where/place will be saved into directory some.where/place). If FALSE, files will be saved into the current directory

no_host

logical: if use_url_directory = TRUE, specifying no_host = TRUE will remove the host name from the directory (e.g. files from files from http://some.where/place will be saved into directory place)

cut_dirs

integer: if use_url_directory = TRUE, specifying cut_dirs will remove this many directory levels from the path of the local directory where files will be saved (e.g. if cut_dirs = 2, files from http://some.where/place/baa/haa will be saved into directory some.where/haa. if cut_dirs = 1 and no_host = TRUE, files from http://some.where/place/baa/haa will be saved into directory baa/haa)

link_css

string: css selector that identifies links (passed as the css parameter to html_elements). Note that link elements must have an link_href attribute

link_href

string: the attribute of a link that gives the destination (i.e. the URL to follow)

curl_opts

named list: options to use with curl downloads, passed to the .list parameter of curl::new_handle

target_s3_args

list: named list or arguments to provide to get_bucket_df and put_object. Files will be uploaded into that bucket instead of the local filesystem

download_link_rewrite

function: if supplied, this function will be applied to each download link after it is scraped from the source page and expanded to an absolute URL but before it is checked against accept_download. This function should take three parameters:

x is a character vector of download link URLs
url is the starting URL (from which those download URLs were scraped)
content is the content of the starting URL, as an XML document as returned by [xml2::read_html()]

and it should return a copy of x, with entries appropriately modified.

Value

a list with components 'ok' (TRUE/FALSE), 'files', and 'message' (error or other messages)

Usage

Arguments

Value

About

Community

Resources