adapted in part from the blog post Curling - exploring web request options
Most times you request data from the web, you should have no problem. However, you eventually will run into problems. In addition, there are advanced things you can do modifying requests to web resources that fall in the advanced stuff category.
Requests to web resources are served over the http
protocol via curl. curl
is a command line tool and library for transferring data with URL
syntax, supporting (lots of protocols) . curl
has many
options that you may not know about.
I’ll go over some of the common and less commonly used curl options, and try to explain why you may want to use some of them.
Discover curl options
You can go to the source, that is the curl book at https://everything.curl.dev/. In R:
curl::curl_options()
for finding curl options. which gives
information for each curl option, including the libcurl variable name
(e.g., CURLOPT_CERTINFO
) and the type of variable (e.g.,
logical).
Other ways to use curl besides R
Perhaps the canonical way to use curl is on the command line. You can get curl for your operating system at https://curl.se/download.html, though hopefully you already have curl. Once you have curl, you can have lots of fun. For example, get the contents of the Google landing page:
- If you like that you may also like httpie, a Python command line tool that is a little more convenient than curl (e.g., JSON output is automatically parsed and colorized).
- Alot of data from the web is in JSON format. A great command line
tool to pair with
curl
is jq.
Note: if you are on windows you may require extra setup if you want to play with curl on the command line. OSX and linux have it by default. On Windows 8, installing the latest version from here https://curl.se/download.html#Win64 worked for me.
general info
With crul
you have to set curl options per each object,
so not globally across all HTTP requests. We may allow the global curl
option setting in the future.
using curl options in other packages
We recommend using ...
to allow users to pass in curl
options. For example, lets say you have a function in a package
foo <- function() {
z <- crul::HttpClient$new(url = yoururl)
z$get()
}
To make it easy for users to pass in curl options use an
...
foo <- function(...) {
z <- crul::HttpClient$new(url = yoururl, opts = list(...))
z$get()
}
Then we can pass in any combination of acceptable curl options:
foo(verbose = TRUE)
#> verbose curl output
You can instead make users pass in a list, e.g.:
foo <- function(opts = list()) {
z <- crul::HttpClient$new(url = yoururl, opts = opts)
z$get()
}
Then a user has to pass curl options like:
foo(opts = list(verbose = TRUE))
timeout
Set a timeout for a request. If request exceeds timeout, request stops.
relevant commands:
timeout_ms=<integer>
HttpClient$new("https://www.google.com/search",
opts = list(timeout_ms = 1))$get()
#> Error in curl::curl_fetch_memory(x$url$url, handle = x$url$handle) :
#> Timeout was reached: Operation timed out after 35 milliseconds with 0 bytes received
Why use this? You sometimes are working with a web resource that is somewhat unreliable. For example, if you want to run a script on a server that may take many hours, and the web resource could be down at some point during that time, you could set the timeout and error catch the response so that the script doesn’t hang on a server that’s not responding. Another example could be if you call a web resource in an R package. In your test suite, you may want to test that a web resource is responding quickly, so you could set a timeout, and not test if that fails.
verbose
Print detailed info on a curl call
relevant commands:
verbose=<boolean>
Just do a HEAD
request so we don’t have to deal with big
output
HttpClient$new("https://hb.opencpu.org",
opts = list(verbose = TRUE))$head()
#> > HEAD / HTTP/1.1
#> Host: hb.opencpu.org
#> User-Agent: libcurl/7.54.0 r-curl/3.2 crul/0.5.4.9521
#> Accept: */*
#> Accept-Encoding: gzip, deflate
#>
#> < HTTP/1.1 200 OK
#> < Connection: keep-alive
#> < Server: gunicorn/19.8.1
#> < Date: Fri, 06 Jul 2018 17:56:50 GMT
#> < Content-Type: text/html; charset=utf-8
#> < Content-Length: 8344
#> < Access-Control-Allow-Origin: *
#> < Access-Control-Allow-Credentials: true
#> < Via: 1.1 vegur
Why use this? As you can see verbose output gives you lots of information that may be useful for debugging a request. You typically don’t need verbose output unless you want to inspect a request.
headers
Add headers to modify requests, including authentication, setting content-type, accept type, etc.
relevant commands:
HttpClient$new(headers = list(...))
x <- HttpClient$new("https://hb.opencpu.org",
headers = list(
Accept = "application/json",
foo = "bar"
),
opts = list(verbose = TRUE)
)
x$head()
#> > HEAD / HTTP/1.1
#> Host: hb.opencpu.org
#> User-Agent: libcurl/7.54.0 r-curl/3.2 crul/0.5.4.9521
#> Accept-Encoding: gzip, deflate
#> Accept: application/json
#> foo: bar
#>
#> < HTTP/1.1 200 OK
#> < Connection: keep-alive
#> < Server: gunicorn/19.8.1
#> < Date: Fri, 06 Jul 2018 17:59:15 GMT
#> < Content-Type: text/html; charset=utf-8
#> < Content-Length: 8344
#> < Access-Control-Allow-Origin: *
#> < Access-Control-Allow-Credentials: true
#> < Via: 1.1 vegur
Why use this? For some web resources, using headers is
mandatory, and httr
makes including them quite easy.
Headers are nice too because e.g., passing authentication in the header
instead of the URL string means your private data is not as exposed to
prying eyes.
authenticate
Set authentication details for a resource
relevant commands:
auth()
for basic username/password authentication
auth(user = "foo", pwd = "bar")
#> $userpwd
#> [1] "foo:bar"
#>
#> $httpauth
#> [1] 1
#>
#> attr(,"class")
#> [1] "auth"
#> attr(,"type")
#> [1] "basic"
To use an API key, this depends on the data provider. They may request it one or either of the header
HttpClient$new("https://hb.opencpu.org/get", headers = list(Authorization = "Bearer 234kqhrlj2342"))
or as a query parameter (which is passed in the URL string)
HttpClient$new("https://hb.opencpu.org/get", query = list(api_key = "<your key>"))
Another authentication option is OAuth. OAuth is not supported in
crul
yet. You can always do OAuth with httr
and then take your token and pass it in as a header/etc. with
crul
.
cookies
Set or get cookies.
relevant commands:
Set cookies (just showing response headers)
x <- HttpClient$new(url = "https://www.google.com", opts = list(verbose = TRUE))
res <- x$get()
#> < HTTP/1.1 200 OK
#> < Date: Fri, 06 Jul 2018 23:25:49 GMT
#> < Expires: -1
#> < Cache-Control: private, max-age=0
#> < Content-Type: text/html; charset=ISO-8859-1
#> < P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
#> < Content-Encoding: gzip
#> < Server: gws
#> < X-XSS-Protection: 1; mode=block
#> < X-Frame-Options: SAMEORIGIN
#> * Added cookie 1P_JAR="2018-07-06-23" for domain google.com, path /, expire 1533511549
#> < Set-Cookie: 1P_JAR=2018-07-06-23; expires=Sun, 05-Aug-2018 23:25:49 GMT; path=/; domain=.google.com
#> * Added cookie NID="134=yt47WC-2mhTgQpkSCMz_ySTig3MCJD5Bx_lNj_aVLAwKu8SPMX-CCowKfU8zSv2cJ2OjiX2LTrYnhWMGvIDieCC419v0VHvlm4Hl9xln9-r4MZwcnqwTZQPT72VNE0uA" for domain google.com, path /, expire 1546730749
#> < Set-Cookie: NID=134=yt47WC-2mhTgQpkSCMz_ySTig3MCJD5Bx_lNj_aVLAwKu8SPMX-CCowKfU8zSv2cJ2OjiX2LTrYnhWMGvIDieCC419v0VHvlm4Hl9xln9-r4MZwcnqwTZQPT72VNE0uA; expires=Sat, 05-Jan-2019 23:25:49 GMT; path=/; domain=.google.com; HttpOnly
#> < Alt-Svc: quic=":443"; ma=2592000; v="43,42,41,39,35"
#> < Transfer-Encoding: chunked
If there are cookies in a response, you can access them with
curl::handle_cookies
like:
curl::handle_cookies(res$handle)
#> domain flag path secure expiration name
#> 1 .google.com TRUE / FALSE 2018-08-05 16:25:16 1P_JAR
#> 2 #HttpOnly_.google.com TRUE / FALSE 2019-01-05 15:25:16 NID
#> value
#> 1 2018-07-06-23
#> 2 134=4E_Zo-cY8hRLNSj47jRJQML0CPQ8Ip__ ...
progress
Print curl progress
relevant commands:
HttpClient$new(progress = fxn)
x <- HttpClient$new("https://hb.opencpu.org/get", progress = httr::progress())
#> |==================================| 100%
Why use this? As you could imagine, this is increasingly useful as a request for a web resource takes longer and longer. For very long requests, this will help you know approximately when a request will finish.
proxies
When behind a proxy, give authentication details for your proxy.
relevant commands:
HttpClient$new(proxies = proxy("http://97.77.104.22:3128", "foo", "bar"))
prox <- proxy("125.39.66.66", port = 80, username = "username", password = "password")
HttpClient$new("http://www.google.com/search", proxies = prox)
Why use this? Most of us likely don’t need to worry about this. However, if you are in a work place, or maybe in certain geographic locations, you may have to use a proxy. I haven’t personally used a proxy in R, so any feedback on this is great.
user agent
Some resources require a user-agent string.
relevant commands:
-
HttpClient$new(headers = list(
User-Agent= "foobar"))
OR HttpClient$new(opts = list(useragent = "foobar"))
both result in the same thing
Why use this? This is set by default in a http request, as you can see in the first example above for user agent. Some web APIs require that you set a specific user agent. For example, the GitHub API requires that you include a user agent string in the header of each request that is your username or the name of your application so they can contact you if there is a problem.