Skip to contents

A minhash value is calculated by hashing the strings in a character vector to integers and then selecting the minimum value. Repeated minhash values are generated by using different hash functions: these different hash functions are created by using performing a bitwise XOR operation (bitwXor) with a vector of random integers. Since it is vital that the same random integers be used for each document, this function generates another function which will always use the same integers. The returned function is intended to be passed to the hash_func parameter of TextReuseTextDocument.

Usage

minhash_generator(n = 200, seed = NULL)

Arguments

n

The number of minhashes that the returned function should generate.

seed

An option parameter to set the seed used in generating the random numbers to ensure that the same minhash function is used on repeated applications.

Value

A function which will take a character vector and return n minhashes.

References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3. See also Matthew Casperson, "Minhash for Dummies" (November 14, 2013).

See also

Examples

set.seed(253)
minhash <- minhash_generator(10)

# Example with a TextReuseTextDocument
file <- system.file("extdata/legal/ny1850-match.txt", package = "textreuse")
doc <- TextReuseTextDocument(file = file, hash_func = minhash,
                             keep_tokens = TRUE)
hashes(doc)
#>  [1] -2132446047 -2134404886 -2138686164 -2143119093 -2140599954 -2145733916
#>  [7] -2136140472 -2140442115 -2145758614 -2145359786

# Example with a character vector
is.character(tokens(doc))
#> [1] TRUE
minhash(tokens(doc))
#>  [1] -2132446047 -2134404886 -2138686164 -2143119093 -2140599954 -2145733916
#>  [7] -2136140472 -2140442115 -2145758614 -2145359786