Download multiple robotstxt files

Usage

get_robotstxts(
  domain,
  warn = TRUE,
  force = FALSE,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = c(1, 0),
  use_futures = FALSE,
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)

Arguments

domain: domain from which to download robots.txt file
warn: warn about being unable to download domain/robots.txt because of
force: if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,
user_agent: HTTP user-agent string to be used to retrieve robots.txt file from domain
ssl_verifypeer: either 1 (default) or 0, if 0 it disables SSL peer verification, which might help with robots.txt file retrieval
use_futures: Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own.
verbose: make function print out more information
rt_request_handler: handler function that handles request according to the event handlers specified
rt_robotstxt_http_getter: function that executes HTTP request
on_server_error: request state handler for any 5xx status
on_client_error: request state handler for any 4xx HTTP status that is not 404
on_not_found: request state handler for HTTP status 404
on_redirect: request state handler for any 3xx HTTP status
on_domain_change: request state handler for any 3xx HTTP status where domain did change as well
on_file_type_mismatch: request state handler for content type other than 'text/plain'
on_suspect_content: request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)

Download multiple robotstxt files

Usage

Arguments

About

Community

Resources