Given a TextReuseTextDocument
or a
TextReuseCorpus
, this function recomputes the tokens and hashes
with the functions specified. Optionally, it can also recompute the minhash signatures.
Usage
tokenize(
x,
tokenizer,
...,
hash_func = hash_string,
minhash_func = NULL,
keep_tokens = FALSE,
keep_text = TRUE
)
Arguments
- x
- tokenizer
A function to split the text into tokens. See
tokenizers
.- ...
Arguments passed on to the
tokenizer
.- hash_func
A function to hash the tokens. See
hash_string
.- minhash_func
A function to create minhash signatures. See
minhash_generator
.- keep_tokens
Should the tokens be saved in the document that is returned or discarded?
- keep_text
Should the text be saved in the document that is returned or discarded?
Value
The modified TextReuseTextDocument
or
TextReuseCorpus
.
Examples
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
corpus <- tokenize(corpus, tokenize_ngrams)
head(tokens(corpus[[1]]))
#> [1] "4 every action" "every action shall" "action shall be"
#> [4] "shall be prosecuted" "be prosecuted in" "prosecuted in the"