This function provides similar, but simplified, functionality to the the command-line wget
utility. It is based on the rvest
package.
Usage
bb_rget(
url,
level = 0,
wait = 0,
accept_follow = c("(/|\\.html?)$"),
reject_follow = character(),
accept_download = bb_rget_default_downloads(),
accept_download_extra = character(),
reject_download = character(),
user,
password,
clobber = 1,
no_parent = TRUE,
no_parent_download = no_parent,
no_check_certificate = FALSE,
relative = FALSE,
remote_time = TRUE,
verbose = FALSE,
show_progress = verbose,
debug = FALSE,
dry_run = FALSE,
stop_on_download_error = FALSE,
force_local_filename,
use_url_directory = TRUE,
no_host = FALSE,
cut_dirs = 0L,
link_css = "a",
curl_opts,
target_s3_args
)
bb_rget_default_downloads()
Arguments
- url
string: the URL to retrieve
- level
integer >=0: recursively download to this maximum depth level. Specify 0 for no recursion
- wait
numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block users making too many requests in a short period of time
- accept_follow
character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs matching all entries will be followed during the spidering process. Note that the first URL (provided via the
url
parameter) will always be visited, unless it matches the download criteria- reject_follow
character: as for
accept_follow
, but specifying URL regular expressions to reject- accept_download
character: character vector with one or more entries. Each entry specifies a regular expression that is applied to the complete URL. URLs that match all entries will be accepted for download. By default the
accept_download
parameter is that returned bybb_rget_default_downloads
: usebb_rget_default_downloads()
to see what that is- accept_download_extra
character: character vector with one or more entries. If provided, URLs will be accepted for download if they match all entries in
accept_download
OR all entries inaccept_download_extra
. This is a convenient method to add one or more extra download types, without needing to re-specify the defaults inaccept_download
- reject_download
character: as for
accept_regex
, but specifying URL regular expressions to reject- user
string: username used to authenticate to the remote server
- password
string: password used to authenticate to the remote server
- clobber
numeric: 0=do not overwrite existing files, 1=overwrite if the remote file is newer than the local copy, 2=always overwrite existing files
- no_parent
logical: if
TRUE
, do not ever ascend to the parent directory when retrieving recursively. This isTRUE
by default, bacause it guarantees that only the files below a certain hierarchy will be downloaded. Note that this check only applies to links on the same host as the startingurl
. If that URL links to files on another host, those links will be followed (unlessrelative = TRUE
)- no_parent_download
logical: similar to
no_parent
, but applies only to download links. A typical use case is to setno_parent
toTRUE
andno_parent_download
toFALSE
, in which case the spidering process (following links to find downloadable files) will not ascend to the parent directory, but files can be downloaded from a directory that is not within the parent- no_check_certificate
logical: if
TRUE
, don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate. This option might be useful if trying to download files from a server with an expired certificate, but it is clearly a security risk and so should be used with caution- relative
logical: if
TRUE
, only follow relative links. This can be useful for restricting what is downloaded in recursive mode- remote_time
logical: if
TRUE
, attempt to set the local file's time to that of the remote file- verbose
logical: print trace output?
- show_progress
logical: if
TRUE
, show download progress- debug
logical: if
TRUE
, will print additional debugging information. If bb_rget is not behaving as expected, try setting this toTRUE
- dry_run
logical: if
TRUE
, spider the remote site and work out which files would be downloaded, but don't download them- stop_on_download_error
logical: if
TRUE
, the download process will stop if any file download fails. IfFALSE
, the process will issue a warning and continue to the next file to download- force_local_filename
character: if provided, then each
url
will be treated as a single URL (no recursion will be conducted). It will be downloaded to a file with name givenforce_local_filename
, in a local directory determined by theurl
.force_local_filename
should be a character vector of the same length as theurl
vector- use_url_directory
logical: if
TRUE
, files will be saved into a local directory that follows the URL structure (e.g. files fromhttp://some.where/place
will be saved into directorysome.where/place
). IfFALSE
, files will be saved into the current directory- no_host
logical: if
use_url_directory = TRUE
, specifyingno_host = TRUE
will remove the host name from the directory (e.g. files from files fromhttp://some.where/place
will be saved into directoryplace
)- cut_dirs
integer: if
use_url_directory = TRUE
, specifyingcut_dirs
will remove this many directory levels from the path of the local directory where files will be saved (e.g. ifcut_dirs = 2
, files fromhttp://some.where/place/baa/haa
will be saved into directorysome.where/haa
. ifcut_dirs = 1
andno_host = TRUE
, files fromhttp://some.where/place/baa/haa
will be saved into directorybaa/haa
)- link_css
string: css selector that identifies links (passed as the
css
parameter tohtml_elements
). Note that link elements must have anhref
attribute- curl_opts
named list: options to use with
curl
downloads, passed to the.list
parameter ofcurl::new_handle
- target_s3_args
list: named list or arguments to provide to
get_bucket_df
andput_object
. Files will be uploaded into that bucket instead of the local filesystem