Tika can parse and extract text from almost anything, including zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, you can get the text and metadata for each file nested inside of the zip file. This recursive output is currently used for the jsonified mode. See: https://wiki.apache.org/tika/RecursiveMetadata
The document content is XHTML in the "X-TIKA:content" field.
If output_dir
is specified, files will have the .json
file extension.
Arguments
- input
Character vector describing the paths and/or urls to the input documents.
- ...
Other parameters to be sent to
tika()
.
Value
A character vector in the same order and with the same length as input
, of unparsed json
. Unprocessed files are as.character(NA)
.
Examples
# \donttest{
batch <- c(
system.file("extdata", "jsonlite.pdf", package = "rtika"),
system.file("extdata", "curl.pdf", package = "rtika"),
system.file("extdata", "table.docx", package = "rtika"),
system.file("extdata", "xml2.pdf", package = "rtika"),
system.file("extdata", "R-FAQ.html", package = "rtika"),
system.file("extdata", "calculator.jpg", package = "rtika"),
system.file("extdata", "tika.apache.org.zip", package = "rtika")
)
json <- tika_json(batch)
# }