Make a wget call

This function is an R wrapper to the command-line wget utility, which is called using either the exec_wait or the exec_internal function from the sys package. Almost all of the parameters to bb_wget are translated into command-line flags to wget. Call bb_wget("help") to get more information about wget's command line flags. If required, command-line flags without equivalent bb_wget function parameters can be passed via the extra_flags parameter.

Usage

bb_wget(
  url,
  recursive = TRUE,
  level = 1,
  wait = 0,
  accept,
  reject,
  accept_regex,
  reject_regex,
  exclude_directories,
  restrict_file_names,
  progress,
  user,
  password,
  output_file,
  robots_off = FALSE,
  timestamping = FALSE,
  no_if_modified_since = FALSE,
  no_clobber = FALSE,
  no_parent = TRUE,
  no_check_certificate = FALSE,
  relative = FALSE,
  adjust_extension = FALSE,
  retr_symlinks = FALSE,
  extra_flags = character(),
  verbose = FALSE,
  capture_stdout = FALSE,
  quiet = FALSE,
  debug = FALSE
)

Arguments

url: string: the URL to retrieve
recursive: logical: if true, turn on recursive retrieving
level: integer >=0: recursively download to this maximum depth level. Only applicable if recursive=TRUE. Specify 0 for infinite recursion. See https://www.gnu.org/software/wget/manual/wget.html#Recursive-Download for more information about wget's recursive downloading
wait: numeric >=0: wait this number of seconds between successive retrievals. This option may help with servers that block multiple successive requests, by introducing a delay between requests
accept: character: character vector with one or more entries. Each entry specifies a comma-separated list of filename suffixes or patterns to accept. Note that if any of the wildcard characters '*', '?', '[', or ']' appear in an element of accept, it will be treated as a filename pattern, rather than a filename suffix. In this case, you have to enclose the pattern in quotes, for example accept="\"*.csv\""
reject: character: as for accept, but specifying filename suffixes or patterns to reject
accept_regex: character: character vector with one or more entries. Each entry provides a regular expression that is applied to the complete URL. Matching URLs will be accepted for download
reject_regex: character: as for accept_regex, but specifying regular expressions to reject
exclude_directories: character: character vector with one or more entries. Each entry specifies a comma-separated list of directories you wish to exclude from download. Elements may contain wildcards
restrict_file_names: character: vector of one of more strings from the set "unix", "windows", "nocontrol", "ascii", "lowercase", and "uppercase". See https://www.gnu.org/software/wget/manual/wget.html#index-Windows-file-names for more information on this parameter. bb_config sets this to "windows" by default: if you are downloading files from a server with a port (http://somewhere.org:1234/) Unix will allow the ":" as part of directory/file names, but Windows will not (the ":" will be replaced by "+"). Specifying restrict_file_names="windows" causes Windows-style file naming to be used
progress: string: the type of progress indicator you wish to use. Legal indicators are "dot" and "bar". "dot" prints progress with dots, with each dot representing a fixed amount of downloaded data. The style can be adjusted: "dot:mega" will show 64K per dot and 3M per line; "dot:giga" shows 1M per dot and 32M per line. See https://www.gnu.org/software/wget/manual/wget.html#index-dot-style for more information
user: string: username used to authenticate to the remote server
password: string: password used to authenticate to the remote server
output_file: string: save wget's output messages to this file
robots_off: logical: by default wget considers itself to be a robot, and therefore won't recurse into areas of a site that are excluded to robots. This can cause problems with servers that exclude robots (accidentally or deliberately) from parts of their sites containing data that we want to retrieve. Setting robots_off=TRUE will add a "-e robots=off" flag, which instructs wget to behave as a human user, not a robot. See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion for more information about robot exclusion
timestamping: logical: if TRUE, don't re-retrieve a remote file unless it is newer than the local copy (or there is no local copy)
no_if_modified_since: logical: applies when retrieving recursively with timestamping (i.e. only downloading files that have changed since last download, which is achieved using bb_config(...,clobber=1)). The default method for timestamping is to issue an "If-Modified-Since" header on the request, which instructs the remote server not to return the file if it has not changed since the specified date. Some servers do not support this header. In these cases, trying using no_if_modified_since=TRUE, which will instead send a preliminary HEAD request to ascertain the date of the remote file
no_clobber: logical: if TRUE, skip downloads that would overwrite existing local files
no_parent: logical: if TRUE, do not ever ascend to the parent directory when retrieving recursively. This is TRUE by default, bacause it guarantees that only the files below a certain hierarchy will be downloaded
no_check_certificate: logical: if TRUE, don't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate. This option might be useful if trying to download files from a server with an expired certificate, but it is clearly a security risk and so should be used with caution
relative: logical: if TRUE, only follow relative links. This can sometimes be useful for restricting what is downloaded in recursive mode
adjust_extension: logical: if a file of type 'application/xhtml+xml' or 'text/html' is downloaded and the URL does not end with .htm or .html, this option will cause the suffix '.html' to be appended to the local filename. This can be useful when mirroring a remote site that has file URLs that conflict with directories (e.g. http://somewhere.org/this/page which has further content below it, say at http://somewhere.org/this/page/more. If "somewhere.org/this/page" is saved as a file with that name, that name can't also be used as the local directory name in which to store the lower-level content. Setting adjust_extension=TRUE will cause the page to be saved as "somewhere.org/this/page.html", thus resolving the conflict
retr_symlinks: logical: if TRUE, follow symbolic links during recursive download. Note that this will only follow symlinks to files, NOT to directories
extra_flags: character: character vector of additional command-line flags to pass to wget
verbose: logical: print trace output?
capture_stdout: logical: if TRUE, return 'stdout' and 'stderr' output in the returned object (see exec_internal from the sys package). Otherwise send these outputs to the console
quiet: logical: if TRUE, suppress wget's output
debug: logical: if TRUE, wget will print lots of debugging information. If wget is not behaving as expected, try setting this to TRUE

Value

the result of the system call (or if bb_wget("--help") was called, a message will be issued). The returned object will have components 'status' and (if capture_stdout was TRUE) 'stdout' and 'stderr'

Examples

if (FALSE) { # \dontrun{
  ## get help about wget command line parameters
  bb_wget("help")
} # }

Usage

Arguments

Value

See also

Examples

About

Community

Resources