Recompute the tokens for a document or corpus

Given a TextReuseTextDocument or a TextReuseCorpus, this function recomputes the tokens and hashes with the functions specified. Optionally, it can also recompute the minhash signatures.

Usage

tokenize(
  x,
  tokenizer,
  ...,
  hash_func = hash_string,
  minhash_func = NULL,
  keep_tokens = FALSE,
  keep_text = TRUE
)

Arguments

x: A TextReuseTextDocument or TextReuseCorpus.
tokenizer: A function to split the text into tokens. See tokenizers.
...: Arguments passed on to the tokenizer.
hash_func: A function to hash the tokens. See hash_string.
minhash_func: A function to create minhash signatures. See minhash_generator.
keep_tokens: Should the tokens be saved in the document that is returned or discarded?
keep_text: Should the text be saved in the document that is returned or discarded?

Value

The modified TextReuseTextDocument or TextReuseCorpus.

Examples

dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
corpus <- tokenize(corpus, tokenize_ngrams)
head(tokens(corpus[[1]]))
#> [1] "4 every action"      "every action shall"  "action shall be"    
#> [4] "shall be prosecuted" "be prosecuted in"    "prosecuted in the"

Recompute the tokens for a document or corpus

Usage

Arguments

Value

Examples

About

Community

Resources