Gets the path to the Tika App .jar
installed by tika_install()
.
Details
The tika_jar()
function also checks if the .jar
is actually on the file system.
The file path is used by all of the tika()
functions by default.
Alternative Uses
You can call Apache Tika directly, as shown in the examples here.
It is better to use the sys
package and avoid system2()
,
which has caused erratic, intermittent errors with Tika.
Examples
# \donttest{
jar <- tika_jar()
# see help
sys::exec_wait('java',c('-jar',jar, '--help'))
#> usage: java -jar tika-app.jar [option...] [file...]
#>
#> Options:
#> -? or --help Print this usage message
#> -v or --verbose Print debug level messages
#> -V or --version Print the Apache Tika version number
#>
#> -g or --gui Start the Apache Tika GUI
#> -f or --fork Use Fork Mode for out-of-process extraction
#>
#> --config=<tika-config.xml>
#> TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config !
#> --dump-minimal-config Print minimal TikaConfig
#> --dump-current-config Print current TikaConfig
#> --dump-static-config Print static config
#> --dump-static-full-config Print static explicit config
#>
#> -x or --xml Output XHTML content (default)
#> -h or --html Output HTML content
#> -t or --text Output plain text content (body)
#> -T or --text-main Output plain text content (main content only via boilerpipe handler)
#> -A or --text-all Output all text content
#> -m or --metadata Output only metadata
#> -j or --json Output metadata in JSON
#> -y or --xmp Output metadata in XMP
#> -J or --jsonRecursive Output metadata and content from all
#> embedded files (choose content type
#> with -x, -h, -t or -m; default is -x)
#> -a or --async Run Tika in async mode; must specify details in a tikaConfig file
#> -l or --language Output only language
#> -d or --detect Detect document type
#> --digest=X Include digest X (md2, md5, sha1,
#> sha256, sha384, sha512
#> -eX or --encoding=X Use output encoding X
#> -pX or --password=X Use document password X
#> -z or --extract Extract all attachements into current directory
#> --extract-dir=<dir> Specify target directory for -z
#> -r or --pretty-print For JSON, XML and XHTML outputs, adds newlines and
#> whitespace, for better readability
#>
#> --list-parsers
#> List the available document parsers
#> --list-parser-details
#> List the available document parsers and their supported mime types
#> --list-parser-details-apt
#> List the available document parsers and their supported mime types in apt format.
#> --list-detectors
#> List the available document detectors
#> --list-met-models
#> List the available metadata models, and their supported keys
#> --list-supported-types
#> List all known media types and related information
#>
#>
#> --compare-file-magic=<dir>
#> Compares Tika's known media types to the File(1) tool's magic directory
#> Description:
#> Apache Tika will parse the file(s) specified on the
#> command line and output the extracted text content
#> or metadata to standard output.
#>
#> Instead of a file name you can also specify the URL
#> of a document to be parsed.
#>
#> If no file name or URL is specified (or the special
#> name "-" is used), then the standard input stream
#> is parsed. If no arguments were given and no input
#> data is available, the GUI is started instead.
#>
#> - GUI mode
#>
#> Use the "--gui" (or "-g") option to start the
#> Apache Tika GUI. You can drag and drop files from
#> a normal file explorer to the GUI window to extract
#> text content and metadata from the files.
#>
#> - Batch mode
#>
#> Simplest method.
#> Specify two directories as args with no other args:
#> java -jar tika-app.jar <inputDirectory> <outputDirectory>
#>
#> Batch Options:
#> -i or --inputDir Input directory
#> -o or --outputDir Output directory
#> -numConsumers Number of processing threads
#> -bc Batch config file
#> -maxRestarts Maximum number of times the
#> watchdog process will restart the forked process.
#> -timeoutThresholdMillis Number of milliseconds allowed to a parse
#> before the process is terminated and restarted
#> -fileList List of files to process, with
#> paths relative to the input directory
#> -includeFilePat Regular expression to determine which
#> files to process, e.g. "(?i)\.pdf"
#> -excludeFilePat Regular expression to determine which
#> files to avoid processing, e.g. "(?i)\.pdf"
#> -maxFileSizeBytes Skip files longer than this value
#>
#> Control the type of output with -x, -h, -t and/or -J.
#>
#> To modify forked process jvm args, prepend "J" as in:
#> -JXmx4g or -JDlog4j.configuration=file:log4j.xml.
#> [1] 0
# detect language of web page
sys::exec_wait('java',c('-jar',jar, '--language','https://tika.apache.org/'))
#> en
#> [1] 0
# }