Skip to contents

This function is an R wrapper to the command-line wget utility, which is called using either the exec_wait or the exec_internal function from the sys package. Almost all of the parameters to bb_wget are translated into command-line flags to wget. Call bb_wget("help") to get more information about wget's command line flags. If required, command-line flags without equivalent bb_wget function parameters can be passed via the extra_flags parameter.

Usage

bb_wget(
  url,
  recursive = TRUE,
  level = 1,
  wait = 0,
  accept,
  reject,
  accept_regex,
  reject_regex,
  exclude_directories,
  restrict_file_names,
  progress,
  user,
  password,
  output_file,
  robots_off = FALSE,
  timestamping = FALSE,
  no_if_modified_since = FALSE,
  no_clobber = FALSE,
  no_parent = TRUE,
  no_check_certificate = FALSE,
  relative = FALSE,
  adjust_extension = FALSE,
  retr_symlinks = FALSE,
  extra_flags = character(),
  verbose = FALSE,
  capture_stdout = FALSE,
  quiet = FALSE,
  debug = FALSE
)

Arguments

url

string: the URL to retrieve

recursive

logical: if true, turn on recursive retrieving

level

integer >=0: recursively download to this maximum depth level. Only applicable if recursive=TRUE. Specify 0 for infinite recursion. See https://www.gnu.org/software/wget/manual/wget.html#Recursive-Download for more information about wget's recursive downloading

wait

numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block multiple successive requests, by introducing a delay between requests

accept

character: character vector with one or more entries. Each entry specifies a comma-separated list of filename suffixes or patterns to accept. Note that if any of the wildcard characters '*', '?', '[', or ']' appear in an element of accept, it will be treated as a filename pattern, rather than a filename suffix. In this case, you have to enclose the pattern in quotes, for example accept="\"*.csv\""

reject

character: as for accept, but specifying filename suffixes or patterns to reject

accept_regex

character: character vector with one or more entries. Each entry provides a regular expression that is applied to the complete URL. Matching URLs will be accepted for download

reject_regex

character: as for accept_regex, but specifying regular expressions to reject

exclude_directories

character: character vector with one or more entries. Each entry specifies a comma-separated list of directories you wish to exclude from download. Elements may contain wildcards

restrict_file_names

character: vector of one of more strings from the set "unix", "windows", "nocontrol", "ascii", "lowercase", and "uppercase". See https://www.gnu.org/software/wget/manual/wget.html#index-Windows-file-names for more information on this parameter. bb_config sets this to "windows" by default: if you are downloading files from a server with a port (http://somewhere.org:1234/) Unix will allow the ":" as part of directory/file names, but Windows will not (the ":" will be replaced by "+"). Specifying restrict_file_names="windows" causes Windows-style file naming to be used

progress

string: the type of progress indicator you wish to use. Legal indicators are "dot" and "bar". "dot" prints progress with dots, with each dot representing a fixed amount of downloaded data. The style can be adjusted: "dot:mega" will show 64K per dot and 3M per line; "dot:giga" shows 1M per dot and 32M per line. See https://www.gnu.org/software/wget/manual/wget.html#index-dot-style for more information

user

string: username used to authenticate to the remote server

password

string: password used to authenticate to the remote server

output_file

string: save wget's output messages to this file

robots_off

logical: by default wget considers itself to be a robot, and therefore won't recurse into areas of a site that are excluded to robots. This can cause problems with servers that exclude robots (accidentally or deliberately) from parts of their sites containing data that we want to retrieve. Setting robots_off=TRUE will add a "-e robots=off" flag, which instructs wget to behave as a human user, not a robot. See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion for more information about robot exclusion

timestamping

logical: if TRUE, don't re-retrieve a remote file unless it is newer than the local copy (or there is no local copy)

no_if_modified_since

logical: applies when retrieving recursively with timestamping (i.e. only downloading files that have changed since last download, which is achieved using bb_config(...,clobber=1)). The default method for timestamping is to issue an "If-Modified-Since" header on the request, which instructs the remote server not to return the file if it has not changed since the specified date. Some servers do not support this header. In these cases, trying using no_if_modified_since=TRUE, which will instead send a preliminary HEAD request to ascertain the date of the remote file

no_clobber

logical: if TRUE, skip downloads that would overwrite existing local files

no_parent

logical: if TRUE, do not ever ascend to the parent directory when retrieving recursively. This is TRUE by default, bacause it guarantees that only the files below a certain hierarchy will be downloaded

no_check_certificate

logical: if TRUE, don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate. This option might be useful if trying to download files from a server with an expired certificate, but it is clearly a security risk and so should be used with caution

relative

logical: if TRUE, only follow relative links. This can sometimes be useful for restricting what is downloaded in recursive mode

adjust_extension

logical: if a file of type 'application/xhtml+xml' or 'text/html' is downloaded and the URL does not end with .htm or .html, this option will cause the suffix '.html' to be appended to the local filename. This can be useful when mirroring a remote site that has file URLs that conflict with directories (e.g. http://somewhere.org/this/page which has further content below it, say at http://somewhere.org/this/page/more. If "somewhere.org/this/page" is saved as a file with that name, that name can't also be used as the local directory name in which to store the lower-level content. Setting adjust_extension=TRUE will cause the page to be saved as "somewhere.org/this/page.html", thus resolving the conflict

logical: if TRUE, follow symbolic links during recursive download. Note that this will only follow symlinks to files, NOT to directories

extra_flags

character: character vector of additional command-line flags to pass to wget

verbose

logical: print trace output?

capture_stdout

logical: if TRUE, return 'stdout' and 'stderr' output in the returned object (see exec_internal from the sys package). Otherwise send these outputs to the console

quiet

logical: if TRUE, suppress wget's output

debug

logical: if TRUE, wget will print lots of debugging information. If wget is not behaving as expected, try setting this to TRUE

Value

the result of the system call (or if bb_wget("--help") was called, a message will be issued). The returned object will have components 'status' and (if capture_stdout was TRUE) 'stdout' and 'stderr'

Examples

if (FALSE) { # \dontrun{
  ## get help about wget command line parameters
  bb_wget("help")
} # }