Extract text from a file

extract_text(file, pages = NULL, area = NULL, password = NULL,
  encoding = NULL, copy = FALSE)

Arguments

file

A character string specifying the path or URL to a PDF file.

pages

An optional integer vector specifying pages to extract from.

area

An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages.

password

Optionally, a character string containing a user password to access a secured PDF.

encoding

Optionally, a character string specifying an encoding for the text, to be passed to the assignment method of Encoding.

copy

Specifies whether the original local file(s) should be copied to tempdir() before processing. FALSE by default. The argument is ignored if file is URL.

Value

If pages = NULL (the default), a length 1 character vector, otherwise a vector of length length(pages).

Details

This function converts the contents of a PDF file into a single unstructured character string.

See also

Author

Thomas J. Leeper <[email protected]>

Examples

# \donttest{ # simple demo file f <- system.file("examples", "text.pdf", package = "tabulizer") # extract all text extract_text(f)
#> [1] "To cite R in publications use:\nR Core Team (2018). R: A language and environment for statistical computing.\nR Foundation for Statistical Computing, Vienna, Austria. URL\nhttps://www.R-project.org/.\nA BibTeX entry for LaTeX users is\[email protected]{,\ntitle = {R: A Language and Environment for Statistical Computing},\nauthor = {{R Core Team}},\norganization = {R Foundation for Statistical Computing},\naddress = {Vienna, Austria},\nyear = {2018},\nurl = {https://www.R-project.org/},\n}\nWe have invested a lot of time and effort in creating R, please cite it when using\nit for data analysis. See also ‘citation(“pkgname”)’ for citing R packages.\nTo cite R in publications use:\nR Core Team (2018). R: A language and environment for statistical computing.\nR Foundation for Statistical Computing, Vienna, Austria. URL\nhttps://www.R-project.org/.\nA BibTeX entry for LaTeX users is\[email protected]{,\ntitle = {R: A Language and Environment for Statistical Computing},\nauthor = {{R Core Team}},\norganization = {R Foundation for Statistical Computing},\naddress = {Vienna, Austria},\nyear = {2018},\nurl = {https://www.R-project.org/},\n}\nWe have invested a lot of time and effort in creating R, please cite it when using\nit for data analysis. See also ‘citation(“pkgname”)’ for citing R packages.\n"
# extract all text from page 1 only extract_text(f, pages = 1)
#> [1] "To cite R in publications use:\nR Core Team (2018). R: A language and environment for statistical computing.\nR Foundation for Statistical Computing, Vienna, Austria. URL\nhttps://www.R-project.org/.\nA BibTeX entry for LaTeX users is\[email protected]{,\ntitle = {R: A Language and Environment for Statistical Computing},\nauthor = {{R Core Team}},\norganization = {R Foundation for Statistical Computing},\naddress = {Vienna, Austria},\nyear = {2018},\nurl = {https://www.R-project.org/},\n}\nWe have invested a lot of time and effort in creating R, please cite it when using\nit for data analysis. See also ‘citation(“pkgname”)’ for citing R packages.\n"
# extract text from selected area only extract_text(f, area = list(c(209.4, 140.5, 304.2, 500.8)))
#> [1] "@Manual{,\ntitle = {R: A Language and Environment for Statistical Computing},\nauthor = {{R Core Team}},\norganization = {R Foundation for Statistical Computing},\naddress = {Vienna, Austria},\nyear = {2018},\nurl = {https://www.R-project.org/},\n}\n" #> [2] "@Manual{,\ntitle = {R: A Language and Environment for Statistical Computing},\nauthor = {{R Core Team}},\norganization = {R Foundation for Statistical Computing},\naddress = {Vienna, Austria},\nyear = {2018},\nurl = {https://www.R-project.org/},\n}\n"
# }