Get chunks of XML articles

Package API

  • pub_tabularize
  • pub_guess_publisher
  • pub_sections
  • pub_chunks
  • pub_providers

The main workhorse function is pub_chunks(). It allows you to pull out sections of articles from many different publishers (see next section below) WITHOUT having to know how to parse/navigate XML. XML has a steep learning curve, and can require quite a bit of Googling to sort out how to get to different parts of an XML document.

The other main function is pub_tabularize() - which takes the output of pub_chunks() and coerces into a data.frame for easier downstream processing.

Supported publishers/sources

  • eLife
  • PLOS
  • Entrez/Pubmed
  • Elsevier
  • Hindawi
  • Pensoft
  • PeerJ
  • Copernicus
  • Frontiers
  • F1000 Research

If you know of other publishers or sources that provide XML let us know by opening an issue.

We’ll continue adding additional publishers.

Installation

Stable version

install.packages("pubchunks")

Development version from GitHub

devtools::install_github("ropensci/pubchunks")

Load library

library('pubchunks')

Get a random XML article

library(rcrossref)
library(dplyr)

res <- cr_works(filter = list(
    full_text_type = "application/xml", 
    license_url="http://creativecommons.org/licenses/by/4.0/"))
links <- bind_rows(res$data$link) %>% filter(content.type == "application/xml")
download.file(links$URL[1], (i <- tempfile(fileext = ".xml")))
pub_chunks(i)
#> <pub chunks>
#>   from: file
#>   publisher/journal: scientific_research_publishing/Open Journal of Social Sciences
#>   sections: all
#>   showing up to first 5: 
#>    front (n=2): nested list
#>    body (n=40): Educational behaviors refer to the activities or a
#>    back (n=1): nested list
#>    title (n=1): Inspection on Reality of Kindergarten Teachers’ Ed ...
#>    doi (n=1): 10.4236/jss.2014.29048
download.file(links$URL[13], (j <- tempfile(fileext = ".xml")))
pub_chunks(j)
#> <pub chunks>
#>   from: file
#>   publisher/journal: hindawi/Case Reports in Gastrointestinal Medicine
#>   sections: all
#>   showing up to first 5: 
#>    front (n=2): nested list
#>    body (n=12): The American Association for the Study of Liver Di
#>    back (n=4): nested list
#>    title (n=1): Yogi Detox Tea: A Potential Cause of Acute Liver F ...
#>    doi (n=1): 10.1155/2017/3540756
download.file(links$URL[20], (k <- tempfile(fileext = ".xml")))
pub_chunks(k)
#> <pub chunks>
#>   from: file
#>   publisher/journal: hindawi/Advances in Materials Science and Engineering
#>   sections: all
#>   showing up to first 5: 
#>    front (n=2): nested list
#>    body (n=74): Nowadays, most of the service bridges are close or
#>    back (n=3): nested list
#>    title (n=1): Cubic Function-Based Bayesian Dynamic Linear Predi ...
#>    doi (n=1): 10.1155/2017/7460378

Meta

rofooter