Background

The Cold Spring Harbour Laboratory API provides a direct interface to the medRxiv and bioRxiv databases. However, the API does not allow you to perform searches, instead providing two endpoints that return either all content between two specified dates or all information held on a particular DOI.

medrxivr provides two convenience functions for importing the data provided by these endpoints in R: mx_api_content() and mx_api_doi(). The results of either function can then be passed to mx_search() for searching.

By date range (mx_api_content())

The format of this endpoint is https://api.biorxiv.org/details/[server]/[interval]/[cursor] where ‘interval’ must be two YYYY-MM-DD dates separated by ‘/’. Where metadata for multiple papers is returned, results are paginated with 100 papers served in a call. The ‘cursor’ value can be used to iterate through the result.

mx_api_content() automatically moves through the pages for you, capturing all records returned by the endpoint and returning them as an R object. For instance, https://api.biorxiv.org/details/medrxiv/2020-01-01/2020-01-31/0 will output 100 results (if that many remain) within the date range of 2020-01-01 to 2020-01-31 beginning from result 1. To import this into R as a dataframe:

medrxiv_data <- mx_api_content(from_date = "2020-01-01", 
                               to_date = "2020-01-05")
#> Total number of records found: 33


biorxiv_data <- mx_api_content(server = "biorxiv",
                               from_date = "2020-01-01", 
                               to_date = "2020-01-05")
#> Total number of records found: 286

By DOI (mx_api_doi())

https://api.biorxiv.org/details/[server]/[DOI] returns detail for a single manuscript. For instance, https://api.biorxiv.org/details/medrxiv/10.1101/2020.02.25.20021568 will output metadata for the medRxiv paper with DOI 10.1101/2020.02.25.20021568. To import the results from this endpoint into R as a dataframe:

mx_api_doi(doi = "10.1101/2020.02.25.20021568")
#> # A tibble: 2 x 14
#>   doi   title authors author_correspo… author_correspo… date  version license
#>   <chr> <chr> <chr>   <chr>            <chr>            <chr> <chr>   <chr>  
#> 1 10.1… Deep… Chen, … Honggang Yu      Renmin Hospital… 2020… 1       cc_by_…
#> 2 10.1… Deep… Chen, … Honggang Yu      Renmin Hospital… 2020… 2       cc_by_…
#> # … with 6 more variables: category <chr>, abstract <chr>, published <lgl>,
#> #   node <int>, link_page <chr>, link_pdf <chr>

Accessing the raw API data

Both functions contain a clean argument with is set to TRUE by default. This is to ensure that the datasets returned by the mx_api_*() functions can immediately be passed to mx_search(). However, there may be occasions where this is not required, and so setting this argument to FALSE will return the raw data provided by the API endpoints. For example:

mx_api_content(to_date = "2019-07-01", clean = FALSE)
#> Total number of records found: 32
#> # A tibble: 32 x 13
#>    doi   title authors author_correspo… author_correspo… date  version type 
#>    <chr> <chr> <chr>   <chr>            <chr>            <chr> <chr>   <chr>
#>  1 10.1… Mole… Daniel… Robert Castelo   "Department of … 2019… 1       ""   
#>  2 10.1… Croh… Orna G… Orna G Ehrlich   "Crohn\\'s & Co… 2019… 1       ""   
#>  3 10.1… Upda… Joshua… Joshua D Wallach "Yale School of… 2019… 1       ""   
#>  4 10.1… Pred… Oliver… Olivera Stojano… "Institute of C… 2019… 1       ""   
#>  5 10.1… Pros… Nathan… Nathan Brajer    "Duke Universit… 2019… 1       ""   
#>  6 10.1… Tren… Brian … Ben Goldacre     "University of … 2019… 1       ""   
#>  7 10.1… 18F-… Nicola… Nicolas Nicastro "University of … 2019… 1       ""   
#>  8 10.1… Perc… Sistan… Lena H Ting      "Emory Universi… 2019… 1       ""   
#>  9 10.1… Prox… Tesfa … Tesfa Dejenie H… "Department of … 2019… 1       ""   
#> 10 10.1… Tran… Alexan… Alexandre Vivot  "APHP"           2019… 1       ""   
#> # … with 22 more rows, and 5 more variables: license <chr>, category <chr>,
#> #   abstract <chr>, published <chr>, server <chr>