Extract text from a file
Usage
extract_text(
file,
pages = NULL,
area = NULL,
password = NULL,
encoding = NULL,
copy = FALSE
)
Arguments
- file
A character string specifying the path or URL to a PDF file.
- pages
An optional integer vector specifying pages to extract from.
- area
An optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page. As a convenience, a list of length 1 can be used to extract the same area from all (specified) pages.
- password
Optionally, a character string containing a user password to access a secured PDF.
- encoding
Optionally, a character string specifying an encoding for the text, to be passed to the assignment method of
Encoding
.- copy
Specifies whether the original local file(s) should be copied to
tempdir()
before processing.FALSE
by default. The argument is ignored iffile
is URL.
Value
If pages = NULL
(the default), a length 1 character vector, otherwise a vector of length length(pages)
.
Details
This function converts the contents of a PDF file into a single unstructured character string.
Examples
# simple demo file
f <- system.file("examples", "fortytwo.pdf", package = "tabulapdf")
# extract all text
extract_text(f)
#> [1] "42 is the number from which the meaning of life, the universe, and everything can be derived.\n42 is the number from which the meaning of life, the universe, and everything can be derived.\n"
# extract all text from page 1 only
extract_text(f, pages = 1)
#> [1] "42 is the number from which the meaning of life, the universe, and everything can be derived.\n"
# extract text from selected area only
extract_text(f, area = list(c(209.4, 140.5, 304.2, 500.8)))
#> [1] "\n" "\n"