Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a pdf file.
pdf_info(pdf, opw = "", upw = "") pdf_text(pdf, opw = "", upw = "") pdf_data(pdf, font_info = FALSE, opw = "", upw = "") pdf_fonts(pdf, opw = "", upw = "") pdf_attachments(pdf, opw = "", upw = "") pdf_toc(pdf, opw = "", upw = "") pdf_pagesize(pdf, opw = "", upw = "")
file path or raw vector with pdf data
string with owner password to open pdf
string with user password to open pdf
if TRUE, extract font-data for each box. Be careful, this requires a very recent version of poppler and will error otherwise.
pdf_text function renders all textboxes on a text canvas
and returns a character vector of equal length to the number of pages in the
PDF file. On the other hand,
pdf_data is more low level and
returns one data frame per page, containing one row for each textbox in the PDF.
pdf_data requires a recent version of libpoppler
which might not be available on all Linux systems.
pdf_data in R packages, condition use on
poppler_config()$has_pdf_data which shows if this function can be
used on the current system. For Ubuntu 16.04 (Xenial) and 18.04 (Bionic)
you can use the PPA
with backports of Poppler 0.74.0.
Poppler is pretty verbose when encountering minor errors in PDF files,
pdf_text. These messages are usually safe
to ignore, use
suppressMessages to hide them altogether.