Skip to contents

Gets the path to the Tika App .jar installed by tika_install().

Usage

tika_jar()

Value

A string describing the file path to the Tika App .jar file. If not found, NA.

Details

The tika_jar() function also checks if the .jar is actually on the file system.

The file path is used by all of the tika() functions by default.

Alternative Uses

You can call Apache Tika directly, as shown in the examples here.

It is better to use the sys package and avoid system2(), which has caused erratic, intermittent errors with Tika.

Examples

# \donttest{
jar <- tika_jar()
# see help
sys::exec_wait('java',c('-jar',jar, '--help'))
#> usage: java -jar tika-app.jar [option...] [file...]
#> 
#> Options:
#>     -?  or --help          Print this usage message
#>     -v  or --verbose       Print debug level messages
#>     -V  or --version       Print the Apache Tika version number
#> 
#>     -g  or --gui           Start the Apache Tika GUI
#>     -f  or --fork          Use Fork Mode for out-of-process extraction
#> 
#>     --config=<tika-config.xml>
#>         TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config !
#>     --dump-minimal-config  Print minimal TikaConfig
#>     --dump-current-config  Print current TikaConfig
#>     --dump-static-config   Print static config
#>     --dump-static-full-config  Print static explicit config
#> 
#>     -x  or --xml           Output XHTML content (default)
#>     -h  or --html          Output HTML content
#>     -t  or --text          Output plain text content (body)
#>     -T  or --text-main     Output plain text content (main content only via boilerpipe handler)
#>     -A  or --text-all      Output all text content
#>     -m  or --metadata      Output only metadata
#>     -j  or --json          Output metadata in JSON
#>     -y  or --xmp           Output metadata in XMP
#>     -J  or --jsonRecursive Output metadata and content from all
#>                            embedded files (choose content type
#>                            with -x, -h, -t or -m; default is -x)
#>     -a  or --async         Run Tika in async mode; must specify details in a tikaConfig file
#>     -l  or --language      Output only language
#>     -d  or --detect        Detect document type
#>            --digest=X      Include digest X (md2, md5, sha1,
#>                                sha256, sha384, sha512
#>     -eX or --encoding=X    Use output encoding X
#>     -pX or --password=X    Use document password X
#>     -z  or --extract       Extract all attachements into current directory
#>     --extract-dir=<dir>    Specify target directory for -z
#>     -r  or --pretty-print  For JSON, XML and XHTML outputs, adds newlines and
#>                            whitespace, for better readability
#> 
#>     --list-parsers
#>          List the available document parsers
#>     --list-parser-details
#>          List the available document parsers and their supported mime types
#>     --list-parser-details-apt
#>          List the available document parsers and their supported mime types in apt format.
#>     --list-detectors
#>          List the available document detectors
#>     --list-met-models
#>          List the available metadata models, and their supported keys
#>     --list-supported-types
#>          List all known media types and related information
#> 
#> 
#>     --compare-file-magic=<dir>
#>          Compares Tika's known media types to the File(1) tool's magic directory
#> Description:
#>     Apache Tika will parse the file(s) specified on the
#>     command line and output the extracted text content
#>     or metadata to standard output.
#> 
#>     Instead of a file name you can also specify the URL
#>     of a document to be parsed.
#> 
#>     If no file name or URL is specified (or the special
#>     name "-" is used), then the standard input stream
#>     is parsed. If no arguments were given and no input
#>     data is available, the GUI is started instead.
#> 
#> - GUI mode
#> 
#>     Use the "--gui" (or "-g") option to start the
#>     Apache Tika GUI. You can drag and drop files from
#>     a normal file explorer to the GUI window to extract
#>     text content and metadata from the files.
#> 
#> - Batch mode
#> 
#>     Simplest method.
#>     Specify two directories as args with no other args:
#>          java -jar tika-app.jar <inputDirectory> <outputDirectory>
#> 
#> Batch Options:
#>     -i  or --inputDir          Input directory
#>     -o  or --outputDir         Output directory
#>     -numConsumers              Number of processing threads
#>     -bc                        Batch config file
#>     -maxRestarts               Maximum number of times the 
#>                                watchdog process will restart the forked process.
#>     -timeoutThresholdMillis    Number of milliseconds allowed to a parse
#>                                before the process is terminated and restarted
#>     -fileList                  List of files to process, with
#>                                paths relative to the input directory
#>     -includeFilePat            Regular expression to determine which
#>                                files to process, e.g. "(?i)\.pdf"
#>     -excludeFilePat            Regular expression to determine which
#>                                files to avoid processing, e.g. "(?i)\.pdf"
#>     -maxFileSizeBytes          Skip files longer than this value
#> 
#>     Control the type of output with -x, -h, -t and/or -J.
#> 
#>     To modify forked process jvm args, prepend "J" as in:
#>     -JXmx4g or -JDlog4j.configuration=file:log4j.xml.
#> [1] 0
# detect language of web page
sys::exec_wait('java',c('-jar',jar, '--language','https://tika.apache.org/'))
#> en
#> [1] 0
# }