pub_chunks makes it easy to extract sections of an article. You can extract just authors across all articles, or all references sections, or the complete text of each article. Then you can pass the output downstream for visualization and analysis.

pub_chunks(x, sections = "all", provider = NULL)

Arguments

x

one of the following:

  • file path for an XML file

  • a character string of XML, a list (of file paths, or XML in a character string, or xml_document objects)

  • or an object of class fulltext::ft_data, the output from a call to fulltext::ft_get()

sections

(character) What elements to get, can be one or more in a vector or list. See pub_sections() for options. optional. Default is to get all sections. See Details.

provider

(character) a single publisher name. see pub_providers() for options. required. If you select the wrong provider for the XML you have you may or may not get what you need :). By default this is NULL and we use pub_guess_publisher() to guess the publisher; we may get it wrong. You can override our guessing by passing in a name.

Value

A list, named by the section selected. sections not found or not in accepted list return NULL or zero length list. A ".publisher" list element gets attached to each list output, even when no data is found. When fulltext::ft_get output is passed in here, the list is named by the publisher, then within each publisher is a list of articles named by their identifiers (e.g. DOIs).

Details

Options for the sections parameter:

  • front - Publisher, journal and article metadata elements

  • body - Body of the article

  • back - Back of the article, acknowledgments, author contributions, references

  • title - Article title

  • doi - Article DOI

  • categories - Publisher's categories, if any

  • authors - Authors

  • aff - Affiliation (includes author names)

  • keywords - Keywords

  • abstract - Article abstract

  • executive_summary - Article executive summary

  • refs - References

  • refs_dois - References DOIs - if available

  • publisher - Publisher name

  • journal_meta - Journal metadata

  • article_meta - Article metadata

  • acknowledgments - Acknowledgments

  • permissions - Article permissions

  • history - Dates, recieved, published, accepted, etc.

Examples

# a file path to an XML file x <- system.file("examples/elsevier_1.xml", package = "pubchunks") pub_chunks(x, "title")
#> <pub chunks> #> from: file #> publisher/journal: elsevier/Nefrología (English Edition) #> sections: title #> showing up to first 5: #> title (n=1): Bibliographic references in PubMed and other searc ...
pub_chunks(x, "authors")
#> <pub chunks> #> from: file #> publisher/journal: elsevier/Nefrología (English Edition) #> sections: authors #> showing up to first 5: #> authors (n=1): Martínez-Castelao, Alberto
pub_chunks(x, "acknowledgments")
#> <pub chunks> #> from: file #> publisher/journal: elsevier/Nefrología (English Edition) #> sections: acknowledgments #> showing up to first 5: #> acknowledgments (n=0):
pub_chunks(x, "refs")
#> <pub chunks> #> from: file #> publisher/journal: elsevier/Nefrología (English Edition) #> sections: refs #> showing up to first 5: #> refs (n=6): Martínez-Castelao A., J.L. Górriz, J. Bover, J. Se
pub_chunks(x, c("title", "refs"))
#> <pub chunks> #> from: file #> publisher/journal: elsevier/Nefrología (English Edition) #> sections: title, refs #> showing up to first 5: #> title (n=1): Bibliographic references in PubMed and other searc ... #> refs (n=6): Martínez-Castelao A., J.L. Górriz, J. Bover, J. Se
if (FALSE) { # works the same with the xml already in a string xml <- paste0(readLines(x), collapse = "") pub_chunks(xml, "title") # also works if you've already read in the XML (with xml2 pkg) xml <- paste0(readLines(x), collapse = "") xml <- xml2::read_xml(xml) pub_chunks(xml, "title") # Hindawi x <- system.file("examples/hindawi_1.xml", package = "pubchunks") pub_chunks(x, "abstract") pub_chunks(x, "authors") pub_chunks(x, "aff") pub_chunks(x, "title") pub_chunks(x, "refs")$refs pub_chunks(x, c("abstract", "title", "authors", "refs")) # Pensoft x <- system.file("examples/pensoft_1.xml", package = "pubchunks") pub_chunks(x, "abstract") pub_chunks(x, "aff") pub_chunks(x, "title") pub_chunks(x, "refs")$refs pub_chunks(x, c("abstract", "title", "authors", "refs")) # Peerj x <- system.file("examples/peerj_1.xml", package = "pubchunks") pub_chunks(x, "abstract") pub_chunks(x, "authors") pub_chunks(x, "aff") pub_chunks(x, "title") pub_chunks(x, "refs")$refs pub_chunks(x, c("abstract", "title", "authors", "refs")) # Frontiers x <- system.file("examples/frontiers_1.xml", package = "pubchunks") pub_chunks(x, "authors") pub_chunks(x, "aff") pub_chunks(x, "refs")$refs pub_chunks(x, c("doi", "abstract", "title", "authors", "refs", "abstract")) # eLife x <- system.file("examples/elife_1.xml", package = "pubchunks") pub_chunks(x, "authors") pub_chunks(x, "aff") pub_chunks(x, "refs")$refs pub_chunks(x, c("doi", "title", "authors", "refs")) # f1000research x <- system.file("examples/f1000research_3.xml", package = "pubchunks") pub_chunks(x, "title") pub_chunks(x, "aff") pub_chunks(x, "refs")$refs pub_chunks(x, c("doi", "title", "authors", "keywords", "refs")) # Copernicus x <- system.file("examples/copernicus_1.xml", package = "pubchunks") pub_chunks(x, c("doi", "abstract", "title", "authors", "refs")) pub_chunks(x, "aff") pub_chunks(x, "refs")$refs # MDPI x <- system.file("examples/mdpi_1.xml", package = "pubchunks") x <- system.file("examples/mdpi_2.xml", package = "pubchunks") pub_chunks(x, "title") pub_chunks(x, "aff") pub_chunks(x, "refs")$refs vv <- pub_chunks(x, c("doi", "title", "authors", "keywords", "refs", "abstract", "categories")) vv$doi vv$title vv$authors vv$keywords vv$refs vv$abstract vv$categories # Many inputs at once x <- system.file("examples/frontiers_1.xml", package = "pubchunks") y <- system.file("examples/elife_1.xml", package = "pubchunks") z <- system.file("examples/f1000research_1.xml", package = "pubchunks") pub_chunks(list(x, y, z), c("doi", "title", "authors", "refs")) # non-XML files/content are xxx? # pub_chunks('foo bar') # Pubmed brief XML files (abstract only) x <- system.file("examples/pubmed_brief_1.xml", package = "pubchunks") pub_chunks(x, "title") # Pubmed full XML files x <- system.file("examples/pubmed_full_1.xml", package = "pubchunks") pub_chunks(x, "title") # using output of fulltext::ft_get() if (requireNamespace("fulltext", quietly = TRUE)) { library("fulltext") # single x <- fulltext::ft_get('10.7554/eLife.03032') pub_chunks(fulltext::ft_collect(x), sections="authors") # many dois <- c('10.1371/journal.pone.0086169', '10.1371/journal.pone.0155491', '10.7554/eLife.03032') x <- fulltext::ft_get(dois) pub_chunks(fulltext::ft_collect(x), sections="authors") # as.ft_data() function x <- ft_collect(as.ft_data()) names(x) x$cached pub_chunks(x, "title") pub_chunks(x, "title") %>% pub_tabularize() } }