Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text.
Roundtrip test: render PDF to image and OCR it back to text
# Full roundtrip test: render PDF to image and OCR it back to text curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf") orig <- pdftools::pdf_text("R-intro.pdf") # Render pdf to png image img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400) # Extract text from png image text <- ocr(img_file) unlink(img_file) cat(text)
On Windows and MacOS the package binary package can be installed from CRAN:
Installation from source on Linux or OSX requires the
Tesseract library (see below).
sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng
On Ubuntu Xenial and Ubuntu Bionic you can use this PPA to get the latest version of Tesseract:
sudo add-apt-repository ppa:cran/tesseract sudo apt-get install -y libtesseract-dev tesseract-ocr-eng
sudo yum install tesseract-devel leptonica-devel
sudo yum install epel-release sudo yum install tesseract-devel leptonica-devel
On OS-X use tesseract from Homebrew:
brew install tesseract
Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR results for other languages you can to install the appropriate training data. On Windows and OSX you can do this in R using
On Linux you need to install the appropriate training data from your distribution. For example to install the spanish training data:
Alternatively you can manually download training data from github and store it in a path on disk that you pass in the
datapath parameter or set a default path via the
TESSDATA_PREFIX environment variable. Note that the Tesseract 4 and Tesseract 3 use different training data format. Make sure to download training data from the branch that matches your libtesseract version.