Interacting with the Cold Spring Harbour Laboratory API
Yaoxiang Li
2024-12-05
Source:vignettes/medrxiv-api.Rmd
medrxiv-api.Rmd
Background
The Cold Spring Harbour Laboratory API provides a direct interface to the medRxiv and bioRxiv databases. However, the API does not allow you to perform searches, instead providing two endpoints that return either all content between two specified dates or all information held on a particular DOI.
medrxivr
provides two convenience functions for
importing the data provided by these endpoints in R:
mx_api_content()
and mx_api_doi()
. The results
of either function can then be passed to mx_search()
for
searching.
By date range (mx_api_content()
)
The format of this endpoint is https://api.biorxiv.org/details/[server]/[interval]/[cursor] where ‘interval’ must be two YYYY-MM-DD dates separated by ‘/’. Where metadata for multiple papers is returned, results are paginated with 100 papers served in a call. The ‘cursor’ value can be used to iterate through the result.
mx_api_content()
automatically moves through the pages
for you, capturing all records returned by the endpoint and returning
them as an R object. For instance, https://api.biorxiv.org/details/medrxiv/2020-01-01/2020-01-31/0
will output 100 results (if that many remain) within the date range of
2020-01-01 to 2020-01-31 beginning from result 1. To import this into R
as a dataframe:
medrxiv_data <- mx_api_content(from_date = "2020-01-01",
to_date = "2020-01-05")
#> Estimated total number of records as per API metadata: 33
#> Number of records retrieved from API: 33
biorxiv_data <- mx_api_content(server = "biorxiv",
from_date = "2020-01-01",
to_date = "2020-01-05")
#> Estimated total number of records as per API metadata: 286
#> Number of records retrieved from API: 286
By DOI (mx_api_doi()
)
https://api.biorxiv.org/details/[server]/[DOI] returns detail for a single manuscript. For instance, https://api.biorxiv.org/details/medrxiv/10.1101/2020.02.25.20021568 will output metadata for the medRxiv paper with DOI 10.1101/2020.02.25.20021568. To import the results from this endpoint into R as a dataframe:
mx_api_doi(doi = "10.1101/2020.02.25.20021568")
#> # A tibble: 2 × 15
#> doi title authors author_corresponding author_corresponding…¹ date version
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 10.11… Deep… Chen, … Honggang Yu Renmin Hospital of Wu… 2020… 1
#> 2 10.11… Deep… Chen, … Honggang Yu Renmin Hospital of Wu… 2020… 2
#> # ℹ abbreviated name: ¹author_corresponding_institution
#> # ℹ 8 more variables: license <chr>, category <chr>, jatsxml <chr>,
#> # abstract <chr>, published <chr>, node <int>, link_page <chr>,
#> # link_pdf <chr>
Accessing the raw API data
Both functions contain a clean
argument with is set to
TRUE
by default. This is to ensure that the datasets
returned by the mx_api_*()
functions can immediately be
passed to mx_search()
. However, there may be occasions
where this is not required, and so setting this argument to
FALSE
will return the raw data provided by the API
endpoints. For example:
mx_api_content(to_date = "2019-07-01", clean = FALSE)
#> Estimated total number of records as per API metadata: 32
#> Number of records retrieved from API: 32
#> # A tibble: 32 × 14
#> doi title authors author_corresponding author_corresponding…¹ date version
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 10.1… Mole… Daniel… Robert Castelo "Department of Experi… 2019… 1
#> 2 10.1… Croh… Orna G… Orna G Ehrlich "Crohn\\'s & Colitis … 2019… 1
#> 3 10.1… Upda… Joshua… Joshua D Wallach "Yale School of Publi… 2019… 1
#> 4 10.1… Pred… Oliver… Olivera Stojanovic "Institute of Cogniti… 2019… 1
#> 5 10.1… Pros… Nathan… Nathan Brajer "Duke University Scho… 2019… 1
#> 6 10.1… Tren… Brian … Ben Goldacre "University of Oxford" 2019… 1
#> 7 10.1… 18F-… Nicola… Nicolas Nicastro "University of Cambri… 2019… 1
#> 8 10.1… Perc… Sistan… Lena H Ting "Emory University" 2019… 1
#> 9 10.1… Prox… Tesfa … Tesfa Dejenie Habte… "Department of Epidem… 2019… 1
#> 10 10.1… Tran… Alexan… Alexandre Vivot "APHP" 2019… 1
#> # ℹ 22 more rows
#> # ℹ abbreviated name: ¹author_corresponding_institution
#> # ℹ 7 more variables: type <chr>, license <chr>, category <chr>, jatsxml <chr>,
#> # abstract <chr>, published <chr>, server <chr>