Extract text or metadata from over a thousand file types. Get either plain text or structured XHTML content.
Getting Started
The tika_text function will extract plain text from many types of documents. It is a good place to start. Please read the Vignette also.
Other main functions include tika_xml and tika_html that get a structured XHMTL rendition. The tika_json function gets metadata as `.json`, with XHMTL content.
The tika_json_text function gets metadata as `.json`, with plain text content.
tika is the main function the others above inherit from.
Use tika_fetch to download files with a file extension matching the Content-Type.
Author
Maintainer: Sasha Goodman goodmansasha@gmail.com
Authors:
The Apache Software Foundation [copyright holder]
Other contributors:
Julia Silge (Reviewed the package for rOpenSci, see https://github.com/ropensci/software-review/issues/191/) [reviewer]
David Gohel (Reviewed the package for rOpenSci, see https://github.com/ropensci/software-review/issues/191/) [reviewer]
