adapted in part from the blog post Curling - exploring web request options
Most times you request data from the web, you should have no problem. However, you eventually will run into problems. In addition, there are advanced things you can do modifying requests to web resources that fall in the advanced stuff category.
Requests to web resources are served over the
http protocol via curl.
curl is a command line tool and library for transferring data with URL syntax, supporting (lots of protocols) .
curl has many options that you may not know about.
I’ll go over some of the common and less commonly used curl options, and try to explain why you may want to use some of them.
You can go to the source, that is the curl book at https://ec.haxx.se/. In R:
curl::curl_options() for finding curl options. which gives information for each curl option, including the libcurl variable name (e.g.,
CURLOPT_CERTINFO) and the type of variable (e.g., logical).
Perhaps the canonical way to use curl is on the command line. You can get curl for your operating system at http://curl.haxx.se/download.html, though hopefully you already have curl. Once you have curl, you can have lots of fun. For example, get the contents of the Google landing page:
Note: if you are on windows you may require extra setup if you want to play with curl on the command line. OSX and linux have it by default. On Windows 8, installing the latest version from here http://curl.haxx.se/download.html#Win64 worked for me.
crul you have to set curl options per each object, so not globally across all HTTP requests. We may allow the global curl option setting in the future.
We recommend using
... to allow users to pass in curl options. For example, lets say you have a function in a package
To make it easy for users to pass in curl options use an
Then we can pass in any combination of acceptable curl options:
foo(verbose = TRUE) #> verbose curl output
You can instead make users pass in a list, e.g.:
Then a user has to pass curl options like:
foo(opts = list(verbose = TRUE))
Set a timeout for a request. If request exceeds timeout, request stops.
Why use this? You sometimes are working with a web resource that is somewhat unreliable. For example, if you want to run a script on a server that may take many hours, and the web resource could be down at some point during that time, you could set the timeout and error catch the response so that the script doesn’t hang on a server that’s not responding. Another example could be if you call a web resource in an R package. In your test suite, you may want to test that a web resource is responding quickly, so you could set a timeout, and not test if that fails.
Print detailed info on a curl call
Just do a
HEAD request so we don’t have to deal with big output
HttpClient$new("https://httpbin.org", opts = list(verbose = TRUE))$head() #> > HEAD / HTTP/1.1 #> Host: httpbin.org #> User-Agent: libcurl/7.54.0 r-curl/3.2 crul/0.5.4.9521 #> Accept: */* #> Accept-Encoding: gzip, deflate #> #> < HTTP/1.1 200 OK #> < Connection: keep-alive #> < Server: gunicorn/19.8.1 #> < Date: Fri, 06 Jul 2018 17:56:50 GMT #> < Content-Type: text/html; charset=utf-8 #> < Content-Length: 8344 #> < Access-Control-Allow-Origin: * #> < Access-Control-Allow-Credentials: true #> < Via: 1.1 vegur
Why use this? As you can see verbose output gives you lots of information that may be useful for debugging a request. You typically don’t need verbose output unless you want to inspect a request.
Add headers to modify requests, including authentication, setting content-type, accept type, etc.
HttpClient$new(headers = list(...))
x <- HttpClient$new("https://httpbin.org", headers = list( Accept = "application/json", foo = "bar" ), opts = list(verbose = TRUE) ) x$head() #> > HEAD / HTTP/1.1 #> Host: httpbin.org #> User-Agent: libcurl/7.54.0 r-curl/3.2 crul/0.5.4.9521 #> Accept-Encoding: gzip, deflate #> Accept: application/json #> foo: bar #> #> < HTTP/1.1 200 OK #> < Connection: keep-alive #> < Server: gunicorn/19.8.1 #> < Date: Fri, 06 Jul 2018 17:59:15 GMT #> < Content-Type: text/html; charset=utf-8 #> < Content-Length: 8344 #> < Access-Control-Allow-Origin: * #> < Access-Control-Allow-Credentials: true #> < Via: 1.1 vegur
Why use this? For some web resources, using headers is mandatory, and
httr makes including them quite easy. Headers are nice too because e.g., passing authentication in the header instead of the URL string means your private data is not as exposed to prying eyes.
Set authentication details for a resource
auth() for basic username/password authentication
auth(user = "foo", pwd = "bar") #> $userpwd #>  "foo:bar" #> #> $httpauth #>  1 #> #> attr(,"class") #>  "auth" #> attr(,"type") #>  "basic"
To use an API key, this depends on the data provider. They may request it one or either of the header
HttpClient$new("https://httpbin.org/get", headers = list(Authorization = "Bearer 234kqhrlj2342"))
or as a query parameter (which is passed in the URL string)
HttpClient$new("https://httpbin.org/get", query = list(api_key = "<your key>"))
Another authentication option is OAuth. OAuth is not supported in
crul yet. You can always do OAuth with
httr and then take your token and pass it in as a header/etc. with
Print curl progress
HttpClient$new(progress = fxn)
x <- HttpClient$new("https://httpbin.org/get", progress = httr::progress()) #> |==================================| 100%
Why use this? As you could imagine, this is increasingly useful as a request for a web resource takes longer and longer. For very long requests, this will help you know approximately when a request will finish.
When behind a proxy, give authentication details for your proxy.
HttpClient$new(proxies = proxy("http://22.214.171.124:3128", "foo", "bar"))
prox <- proxy("126.96.36.199", port = 80, username = "username", password = "password") HttpClient$new("http://www.google.com/search", proxies = prox)
Why use this? Most of us likely don’t need to worry about this. However, if you are in a work place, or maybe in certain geographic locations, you may have to use a proxy. I haven’t personally used a proxy in R, so any feedback on this is great.
Some resources require a user-agent string.
HttpClient$new(headers = list(User-Agent
HttpClient$new(opts = list(useragent = "foobar"))
both result in the same thing
Why use this? This is set by default in a http request, as you can see in the first example above for user agent. Some web APIs require that you set a specific user agent. For example, the GitHub API requires that you include a user agent string in the header of each request that is your username or the name of your application so they can contact you if there is a problem.