This function is an R wrapper to the command-line wget
utility, which is called using either the exec_wait
or the exec_internal
function from the sys package. Almost all of the parameters to bb_wget
are translated into command-line flags to wget
. Call bb_wget("help")
to get more information about wget's command line flags. If required, command-line flags without equivalent bb_wget
function parameters can be passed via the extra_flags
parameter.
Usage
bb_wget(
url,
recursive = TRUE,
level = 1,
wait = 0,
accept,
reject,
accept_regex,
reject_regex,
exclude_directories,
restrict_file_names,
progress,
user,
password,
output_file,
robots_off = FALSE,
timestamping = FALSE,
no_if_modified_since = FALSE,
no_clobber = FALSE,
no_parent = TRUE,
no_check_certificate = FALSE,
relative = FALSE,
adjust_extension = FALSE,
retr_symlinks = FALSE,
extra_flags = character(),
verbose = FALSE,
capture_stdout = FALSE,
quiet = FALSE,
debug = FALSE
)
Arguments
- url
string: the URL to retrieve
- recursive
logical: if true, turn on recursive retrieving
- level
integer >=0: recursively download to this maximum depth level. Only applicable if
recursive=TRUE
. Specify 0 for infinite recursion. See https://www.gnu.org/software/wget/manual/wget.html#Recursive-Download for more information about wget's recursive downloading- wait
numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block multiple successive requests, by introducing a delay between requests
- accept
character: character vector with one or more entries. Each entry specifies a comma-separated list of filename suffixes or patterns to accept. Note that if any of the wildcard characters '*', '?', '[', or ']' appear in an element of accept, it will be treated as a filename pattern, rather than a filename suffix. In this case, you have to enclose the pattern in quotes, for example
accept="\"*.csv\""
- reject
character: as for
accept
, but specifying filename suffixes or patterns to reject- accept_regex
character: character vector with one or more entries. Each entry provides a regular expression that is applied to the complete URL. Matching URLs will be accepted for download
- reject_regex
character: as for
accept_regex
, but specifying regular expressions to reject- exclude_directories
character: character vector with one or more entries. Each entry specifies a comma-separated list of directories you wish to exclude from download. Elements may contain wildcards
- restrict_file_names
character: vector of one of more strings from the set "unix", "windows", "nocontrol", "ascii", "lowercase", and "uppercase". See https://www.gnu.org/software/wget/manual/wget.html#index-Windows-file-names for more information on this parameter.
bb_config
sets this to "windows" by default: if you are downloading files from a server with a port (http://somewhere.org:1234/) Unix will allow the ":" as part of directory/file names, but Windows will not (the ":" will be replaced by "+"). Specifyingrestrict_file_names="windows"
causes Windows-style file naming to be used- progress
string: the type of progress indicator you wish to use. Legal indicators are "dot" and "bar". "dot" prints progress with dots, with each dot representing a fixed amount of downloaded data. The style can be adjusted: "dot:mega" will show 64K per dot and 3M per line; "dot:giga" shows 1M per dot and 32M per line. See https://www.gnu.org/software/wget/manual/wget.html#index-dot-style for more information
- user
string: username used to authenticate to the remote server
- password
string: password used to authenticate to the remote server
- output_file
string: save wget's output messages to this file
- robots_off
logical: by default wget considers itself to be a robot, and therefore won't recurse into areas of a site that are excluded to robots. This can cause problems with servers that exclude robots (accidentally or deliberately) from parts of their sites containing data that we want to retrieve. Setting
robots_off=TRUE
will add a "-e robots=off" flag, which instructs wget to behave as a human user, not a robot. See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion for more information about robot exclusion- timestamping
logical: if
TRUE
, don't re-retrieve a remote file unless it is newer than the local copy (or there is no local copy)- no_if_modified_since
logical: applies when retrieving recursively with timestamping (i.e. only downloading files that have changed since last download, which is achieved using
bb_config(...,clobber=1)
). The default method for timestamping is to issue an "If-Modified-Since" header on the request, which instructs the remote server not to return the file if it has not changed since the specified date. Some servers do not support this header. In these cases, trying usingno_if_modified_since=TRUE
, which will instead send a preliminary HEAD request to ascertain the date of the remote file- no_clobber
logical: if
TRUE
, skip downloads that would overwrite existing local files- no_parent
logical: if
TRUE
, do not ever ascend to the parent directory when retrieving recursively. This isTRUE
by default, bacause it guarantees that only the files below a certain hierarchy will be downloaded- no_check_certificate
logical: if
TRUE
, don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate. This option might be useful if trying to download files from a server with an expired certificate, but it is clearly a security risk and so should be used with caution- relative
logical: if
TRUE
, only follow relative links. This can sometimes be useful for restricting what is downloaded in recursive mode- adjust_extension
logical: if a file of type 'application/xhtml+xml' or 'text/html' is downloaded and the URL does not end with .htm or .html, this option will cause the suffix '.html' to be appended to the local filename. This can be useful when mirroring a remote site that has file URLs that conflict with directories (e.g. http://somewhere.org/this/page which has further content below it, say at http://somewhere.org/this/page/more. If "somewhere.org/this/page" is saved as a file with that name, that name can't also be used as the local directory name in which to store the lower-level content. Setting
adjust_extension=TRUE
will cause the page to be saved as "somewhere.org/this/page.html", thus resolving the conflict- retr_symlinks
logical: if
TRUE
, follow symbolic links during recursive download. Note that this will only follow symlinks to files, NOT to directories- extra_flags
character: character vector of additional command-line flags to pass to wget
- verbose
logical: print trace output?
- capture_stdout
logical: if
TRUE
, return 'stdout' and 'stderr' output in the returned object (see exec_internal from the sys package). Otherwise send these outputs to the console- quiet
logical: if
TRUE
, suppress wget's output- debug
logical: if
TRUE
, wget will print lots of debugging information. If wget is not behaving as expected, try setting this toTRUE
Value
the result of the system call (or if bb_wget("--help")
was called, a message will be issued). The returned object will have components 'status' and (if capture_stdout
was TRUE
) 'stdout' and 'stderr'