A minhash value is calculated by hashing the strings in a character vector to
integers and then selecting the minimum value. Repeated minhash values are
generated by using different hash functions: these different hash functions
are created by using performing a bitwise XOR
operation
(bitwXor
) with a vector of random integers. Since it is vital
that the same random integers be used for each document, this function
generates another function which will always use the same integers. The
returned function is intended to be passed to the hash_func
parameter
of TextReuseTextDocument
.
References
Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3. See also Matthew Casperson, "Minhash for Dummies" (November 14, 2013).
Examples
set.seed(253)
minhash <- minhash_generator(10)
# Example with a TextReuseTextDocument
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc <- TextReuseTextDocument(file = file, hash_func = minhash,
keep_tokens = TRUE)
hashes(doc)
#> [1] -2132446047 -2134404886 -2138686164 -2143119093 -2140599954 -2145733916
#> [7] -2136140472 -2140442115 -2145758614 -2145359786
# Example with a character vector
is.character(tokens(doc))
#> [1] TRUE
minhash(tokens(doc))
#> [1] -2132446047 -2134404886 -2138686164 -2143119093 -2140599954 -2145733916
#> [7] -2136140472 -2140442115 -2145758614 -2145359786