Allows for progammatic searching of the arXiv pre-print repository.

arxiv_search(query = NULL, id_list = NULL, start = 0, limit = 10,
  sort_by = c("submitted", "updated", "relevance"), ascending = TRUE,
  batchsize = 100, force = FALSE, output_format = c("data.frame",
  "list"), sep = "|")



Search pattern as a string; a vector of such strings also allowed, in which case the elements are combined with AND.


arXiv doc IDs, as comma-delimited string or a vector of such strings


An offset for the start of search


Maximum number of records to return.


How to sort the results (ignored if id_list is provided)


If TRUE, sort in ascending order; else descending (ignored if id_list is provided)


Maximum number of records to request at one time


If TRUE, force search request even if it seems extreme


Indicates whether output should be a data frame or a list.


String to use to separate multiple authors, affiliations, DOI links, and categories, in the case that output_format="data.frame".


If output_format="data.frame", the result is a data frame with each row being a manuscript and columns being the various fields.

If output_format="list", the result is a list parsed from the XML output of the search, closer to the raw output from arXiv.

The data frame format has the following columns.

[,1]idarXiv ID
[,2]submitteddate first submitted
[,3]updateddate last updated
[,4]titlemanuscript title
[,6]authorsauthor names
[,7]affiliationsauthor affiliations
[,8]link_abstracthyperlink to abstract
[,9]link_pdfhyperlink to pdf
[,10]link_doihyperlink to DOI
[,11]commentauthors' comment
[,12]journal_refjournal reference
[,13]doipublished DOI
[,14]primary_categoryprimary category
[,15]categoriesall categories

The contents are all strings; missing values are empty strings ("").

The columns authors, affiliations, link_doi, and categories may have multiple entries separated by sep (by default, "|").

The result includes an attribute "search_info" that includes information about the details of the search parameters, including the time at which it was completed. Another attribute "total_results" is the total number of records that match the query.

See also


old_delay <- getOption("aRxiv_delay") options(aRxiv_delay=1) # \donttest{ # search for author Peter Hall with deconvolution in title z <- arxiv_search(query = 'au:"Peter Hall" AND ti:deconvolution', limit=2) attr(z, "total_results") # total no. records matching query
#> [1] 4
#> [1] "A ridge-parameter approach to deconvolution" #> [2] "On deconvolution with repeated measurements"
# search for a set of documents by arxiv identifiers z <- arxiv_search(id_list = c("0710.3491v1", "0804.0713v1", "1003.0315v1")) # can also use a comma-separated string z <- arxiv_search(id_list = "0710.3491v1,0804.0713v1,1003.0315v1") # Journal references, if available z$journal_ref
#> [1] "Annals of Statistics 2007, Vol. 35, No. 4, 1535-1558" #> [2] "Annals of Statistics 2008, Vol. 36, No. 2, 665-685" #> [3] ""
# search for a range of dates (in this case, one day) z <- arxiv_search("submittedDate:[199701010000 TO 199701012400]", limit=2) # } options(aRxiv_delay=old_delay)