ft_get is a one stop shop to fetch full text of articles, either XML or PDFs. We have specific support for PLOS via the rplos package, Entrez via the rentrez package, and arXiv via the aRxiv package. For other publishers, we have helpers to ft_get to sort out links for full text based on user input. Articles are saved on disk. See Details for help on how to use this function.

ft_get(
  x,
  from = NULL,
  type = "xml",
  try_unknown = TRUE,
  bmcopts = list(),
  entrezopts = list(),
  elifeopts = list(),
  elsevieropts = list(),
  sciencedirectopts = list(),
  wileyopts = list(),
  crossrefopts = list(),
  progress = FALSE,
  ...
)

ft_get_ls()

Arguments

x

Either identifiers for papers, either DOIs (or other ids) as a list of character strings, or a character vector, OR an object of class ft, as returned from ft_search()

from

Source to query. Optional.

type

(character) one of xml (default), pdf, or plain (Elsevier and ScienceDirect only). We choose to go with xml as the default as it has structure that a machine can reason about, but you are of course free to try to get xml, pdf, or plain (in the case of Elsevier and ScienceDirect).

try_unknown

(logical) if publisher plugin not already known, we try to fetch full text link either using ftdoi package or from Crossref. If not found at ftdoi or at Crossref we skip with a warning. If found with ftdoi or Crossref we attempt to download. Only applicable in character and list S3 methods. Default: TRUE

bmcopts

BMC options. parameter DEPRECATED

entrezopts

Entrez options, a named list. See rentrez::entrez_search() and entrez_fetch()

elifeopts

eLife options, a named list.

elsevieropts

Elsevier options, a named list. Use retain_non_ft=TRUE to retain files that do not actually have full text but likely only have an abstract. By default we set retain_non_ft=FALSE so that if we detect that you only got an abstract back, we delete it and report an error that you likely don't have access.

sciencedirectopts

Elsevier ScienceDirect options, a named list.

wileyopts

Wiley options, a named list.

crossrefopts

Crossref options, a named list.

progress

(logical) whether to show progress bar or not. default: FALSE. if TRUE, we use utils::txtProgressBar() and utils::setTxtProgressBar() to create the progress bar; and each progress bar connection is closed on function exit. A progress bar is run for each data source. Works for all S3 methods except ft_get.links. When articles are not already downloaded you see the progress bar. If articles are already downloaded/cached, normally we throw messages saying so, but if a progress bar is requested, then the messages are suppressed to not interrupt the progress bar.

...

curl options passed on to crul::HttpClient, see examples below

Value

An object of class ft_data (of type S3) with slots for each of the publishers. The returned object is split up by publishers because the full text format is the same within publisher - which should facilitate text mining downstream as different steps may be needed for each publisher's content.

Note that we have a print method for ft_data so you see something like this:

<fulltext text>
[Docs] 4
[Source] ext - /Users/foobar/Library/Caches/R/fulltext
[IDs] 10.2307/1592482 10.2307/1119209 10.1037/11755-024 ...

Within each publisher there is a list with the elements:

  • found: number of full text articles found

  • dois: the DOIs given and searched for

  • data

    • backend: the backend. right now only ext for "by file extension", we may add other backends in the future, thus we retain this

    • cache_path: the base directory path for file caching

    • path: if file retrieved the full path to the file. if file not retrived this is NULL

    • data: if text extracted (see ft_collect()) the text will be here, but until then this is NULL

  • opts: the options given like article type, dois

  • errors: data.frame of errors, with two columns for article id and error

Details

There are various ways to use ft_get:

  • Pass in only DOIs - leave from parameter NULL. This route will first query Crossref API for the publisher of the DOI, then we'll use the appropriate method to fetch full text from the publisher. If a publisher is not found for the DOI, then we'll throw back a message telling you a publisher was not found.

  • Pass in DOIs (or other pub IDs) and use the from parameter. This route means we don't have to make an extra API call to Crossref (thus, this route is faster) to determine the publisher for each DOI. We go straight to getting full text based on the publisher.

  • Use ft_search() to search for articles. Then pass that output to this function, which will use info in that object. This behaves the same as the previous option in that each DOI has publisher info so we know how to get full text for each DOI.

Note that some publishers are available via Entrez, but often not recent articles, where "recent" may be a few months to a year or so. In that case, make sure to specify the publisher, or else you'll get back no data.

Important Access Notes

See Rate Limits and Authentication in fulltext-package for rate limiting and authentication information, respectively.

In particular, take note that when fetching full text from Wiley and Elsevier, the only way that's done (unless it's one of their OA papers) is through the Crossref TDM flow in which you need a Crossref TDM API key and your institution needs to have access to the exact journal you are trying to fetch a paper from. If your institution doesn't have access you may still get a result, but likely its only the abstract. Pretty much the same is true when fetching from ScienceDirect directly. You need to have an Elsevier API key that is valid for their TDM/article API. See Authentication in fulltext-package for details.

Notes on the type parameter

Type is sometimes ignored, sometimes used. For certain data sources, they only accept one type. By data source/publisher:

  • PLOS: pdf and xml

  • Entrez: only xml

  • eLife: pdf and xml

  • Pensoft: pdf and xml

  • arXiv: only pdf

  • BiorXiv: only pdf

  • Elsevier: xml and plain

  • Elsevier ScienceDirect: xml and plain

  • Wiley: pdf and xml

  • Peerj: pdf and xml

  • Informa: only pdf

  • FrontiersIn: pdf and xml

  • Copernicus: pdf and xml

  • Scientific Societies: only pdf

  • Cambridge: only pdf

  • Crossref: depends on the publisher

  • other data sources/publishers: there are too many to cover here - will try to make a helper in the future for what is covered by different publishers

How data is stored

ft_get used to have many options for "backends". We have simplified this to one option. That one option is that all full text files are written to disk on your machine. You can choose where these files are stored.

In addition, files are named by their IDs (usually DOIs), and the file extension for the full text type (pdf or xml usually). This makes inspecting the files easy.

Data formats

xml full text is stored in .xml files. pdf is stored in .pdf files. And plain text is stored in .txt files.

Reusing cached articles

All files are written to disk and we check for a file matching the given DOI/ID on each request - if found we use it and throw message saying so.

Caching

Previously, you could set caching options in each ft_get function call. We've simplified this to only setting caching options through the function cache_options_set() - and you can get your cache options using cache_options_get(). See those docs for help on caching.

Notes on specific publishers

  • arXiv: The IDs passed are not actually DOIs, though they look similar. Thus, there's no way to not pass in the from parameter as we can't determine unambiguously that the IDs passed in are from arXiv.org.

  • bmc: Is a hot mess since the Springer acquisition. It's been removed as an officially supported plugin, some DOIs from them may still work when passed in here, who knows, it's a mess.

Warnings

You will see warnings thrown in the R shell or in the resulting object. See ft_get-warnings for more information on what warnings mean.

See also

Examples

# List publishers included ft_get_ls()
#> [1] "aaas" "aip" "amersocclinoncol" #> [4] "amersocmicrobiol" "arxiv" "biorxiv" #> [7] "bmc" "cambridge" "cob" #> [10] "copernicus" "crossref" "elife" #> [13] "elsevier" "entrez" "frontiersin" #> [16] "ieee" "informa" "instinvestfil" #> [19] "jama" "microbiology" "peerj" #> [22] "pensoft" "plos" "pnas" #> [25] "royalsocchem" "roysoc" "sciencedirect" #> [28] "scientificsocieties" "wiley"
if (FALSE) { # If you just have DOIs and don't know the publisher ## PLOS ft_get('10.1371/journal.pone.0086169') # Collect all errors from across papers # similarly can combine from different publishers as well res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.aaaa'), from = "elife") res$elife$errors ## PeerJ ft_get('10.7717/peerj.228') ft_get('10.7717/peerj.228', type = "pdf") ## eLife ### xml ft_get('10.7554/eLife.03032') res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife") res$elife respdf <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife", type = "pdf") respdf$elife elife_xml <- ft_get('10.7554/eLife.03032', from = "elife") library(magrittr) elife_xml %<>% ft_collect() elife_xml$elife ### pdf elife_pdf <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife", type = "pdf") elife_pdf$elife elife_pdf %<>% ft_collect() elife_pdf %>% ft_extract() ## some BMC DOIs will work, but some may not, who knows ft_get(c('10.1186/2049-2618-2-7', '10.1186/2193-1801-3-7'), from = "entrez") ## FrontiersIn res <- ft_get(c('10.3389/fphar.2014.00109', '10.3389/feart.2015.00009')) res res$frontiersin ## Hindawi - via Entrez res <- ft_get(c('10.1155/2014/292109','10.1155/2014/162024', '10.1155/2014/249309')) res res$hindawi res$hindawi$data$path res %>% ft_collect() %>% .$hindawi ## F1000Research - via Entrez x <- ft_get('10.12688/f1000research.6522.1') ## Two different publishers via Entrez - retains publisher names res <- ft_get(c('10.1155/2014/292109', '10.12688/f1000research.6522.1')) res$hindawi res$f1000research ## Thieme - ### coverage is hit and miss, it's not great ft_get('10.1055/s-0032-1316462') ## Pensoft ft_get('10.3897/mycokeys.22.12528') ## Copernicus out <- ft_get(c('10.5194/angeo-31-2157-2013', '10.5194/bg-12-4577-2015')) out$copernicus ## arXiv - only pdf, you have to pass in the from parameter res <- ft_get(x='cond-mat/9309029', from = "arxiv") res$arxiv res %>% ft_extract %>% .$arxiv ## bioRxiv - only pdf res <- ft_get(x='10.1101/012476') res$biorxiv ## AAAS - only pdf res <- ft_get(x='10.1126/science.276.5312.548') res$aaas # The Royal Society res <- ft_get("10.1098/rspa.2007.1849") ft_get(c("10.1098/rspa.2007.1849", "10.1098/rstb.1970.0037", "10.1098/rsif.2006.0142")) ## Karger Publisher (x <- ft_get('10.1159/000369331')) x$karger ## MDPI Publisher (x <- ft_get('10.3390/nu3010063')) x$mdpi ft_get('10.3390/nu7085279') ft_get(c('10.3390/nu3010063', '10.3390/nu7085279')) # Scientific Societies ## this is a paywall article, you may not have access or you may x <- ft_get("10.1094/PHYTO-04-17-0144-R") x$scientificsocieties # Informa x <- ft_get("10.1080/03088839.2014.926032") ft_get("10.1080/03088839.2013.863435") ## CogentOA - part of Inform/Taylor Francis now ft_get('10.1080/23311916.2014.938430') library(rplos) (dois <- searchplos(q="*:*", fl='id', fq=list('doc_type:full',"article_type:\"research article\""), limit=5)$data$id) ft_get(dois) ft_get(c('10.7717/peerj.228','10.7717/peerj.234')) # elife ft_get('10.7554/eLife.04300', from='elife') ft_get(c('10.7554/eLife.04300', '10.7554/eLife.03032'), from='elife') ## search for elife papers via Entrez dois <- ft_search("elife[journal]", from = "entrez") ft_get(dois) # Frontiers in Pharmacology (publisher: Frontiers) doi <- '10.3389/fphar.2014.00109' ft_get(doi, from="entrez") # Hindawi Journals ft_get(c('10.1155/2014/292109','10.1155/2014/162024','10.1155/2014/249309'), from='entrez') # Frontiers Publisher - Frontiers in Aging Nueroscience res <- ft_get("10.3389/fnagi.2014.00130", from='entrez') res$entrez # Search entrez, get some DOIs (res <- ft_search(query='ecology', from='entrez')) res$entrez$data$doi ft_get(res$entrez$data$doi[1], from='entrez') ft_get(res$entrez$data$doi[1:3], from='entrez') # Search entrez, and pass to ft_get() (res <- ft_search(query='ecology', from='entrez')) ft_get(res) # elsevier, ugh ## set an environment variable like Sys.setenv(CROSSREF_TDM = "your key") ### an open access article ft_get(x = "10.1016/j.trac.2016.01.027", from = "elsevier") ### non open access article #### If you don't have access, by default you get abstract only, and we ##### treat it as an error as we assume you want full text ft_get(x = "10.1016/j.trac.2016.05.027", from = "elsevier") #### If you want to retain whatever Elsevier gives you ##### set "retain_non_ft = TRUE" ft_get(x = "10.1016/j.trac.2016.05.027", from = "elsevier", elsevieropts = list(retain_non_ft = TRUE)) # sciencedirect ## set an environment variable like Sys.setenv(ELSEVIER_TDM_KEY = "your key") ft_get(x = "10.1016/S0140-6736(13)62329-6", from = "sciencedirect") # wiley, ugh ft_get(x = "10.1006/asle.2001.0035", from = "wiley", type = "pdf") ## xml ft_get(x = "10.1111/evo.13812", from = "wiley") ## highwire fiasco paper ft_get(x = "10.3732/ajb.1300053", from = "wiley") ft_get(x = "10.3732/ajb.1300053", from = "wiley", type = "pdf") # IEEE, ugh ft_get('10.1109/TCSVT.2012.2221191', type = "pdf") # AIP Publishing ft_get('10.1063/1.4967823', try_unknown = TRUE) # PNAS ft_get('10.1073/pnas.1708584115', try_unknown = TRUE) # American Society for Microbiology ft_get('10.1128/cvi.00178-17') # American Society of Clinical Oncology ft_get('10.1200/JCO.18.00454') # American Institute of Physics ft_get('10.1063/1.4895527') # American Chemical Society ft_get(c('10.1021/la903074z', '10.1021/jp048806z')) # Royal Society of Chemistry ft_get('10.1039/c8cc06410e') # From ft_links output ## Crossref (res2 <- ft_search(query = 'ecology', from = 'crossref', limit = 3, crossrefopts = list(filter = list(has_full_text=TRUE, member=98)))) (out <- ft_links(res2)) (ress <- ft_get(x = out, type = "pdf")) ress$crossref (x <- ft_links("10.1111/2041-210X.12656", "crossref")) (y <- ft_get(x)) ## Cambridge x <- ft_get("10.1017/s0922156598230305") x$cambridge z <- ft_get("10.1017/jmo.2019.20") z$cambridge m <- ft_get("10.1017/S0266467419000270") m$cambridge ## No publisher plugin provided yet ft_get('10.1037/10740-005') ### no link available for this DOI res <- ft_get('10.1037/10740-005', try_unknown = TRUE) res[[1]] # Get a progress bar - off by default library(rplos) (dois <- searchplos(q="*:*", fl='id', fq=list('doc_type:full',"article_type:\"research article\""), limit=5)$data$id) ## when articles not already downloaded you see the progress bar b <- ft_get(dois, progress = TRUE) ## if articles already downloaded/cached, normally we through messages ## saying so b <- ft_get(dois, progress = FALSE) ## but if a progress bar is requested, then the messages are suppressed b <- ft_get(dois, progress = TRUE) # curl options ft_get("10.1371/journal.pcbi.1002487", verbose = TRUE) ft_get('10.3897/mycokeys.22.12528', from = "pensoft", verbose = TRUE) }