Skip to contents

Locality sensitive hashing (LSH) discovers potential matches among a corpus of documents quickly, so that only likely pairs can be compared.

Usage

lsh(x, bands, progress = interactive())

Arguments

x

A TextReuseCorpus or TextReuseTextDocument.

bands

The number of bands to use for locality sensitive hashing. The number of hashes in the documents in the corpus must be evenly divisible by the number of bands. See lsh_threshold and lsh_probability for guidance in selecting the number of bands and hashes.

progress

Display a progress bar while comparing documents.

Value

A data frame (with the additional class lsh_buckets), containing a column with the document IDs and a column with their LSH signatures, or buckets.

Details

Locality sensitive hashing is a technique for detecting document similarity that does not require pairwise comparisons. When comparing pairs of documents, the number of pairs grows rapidly, so that only the smallest corpora can be compared pairwise in a reasonable amount of computation time. Locality sensitive hashing, on the other hand, takes a document which has been tokenized and hashed using a minhash algorithm. (See minhash_generator.) Each set of minhash signatures is then broken into bands comprised of a certain number of rows. (For example, 200 minhash signatures might be broken down into 20 bands each containing 10 rows.) Each band is then hashed to a bucket. Documents with identical rows in a band will be hashed to the same bucket. The likelihood that a document will be marked as a potential duplicate is proportional to the number of bands and inversely proportional to the number of rows in each band.

This function returns a data frame with the additional class lsh_buckets. The LSH technique only requires that the signatures for each document be calculated once. So it is possible, as long as one uses the same minhash function and the same number of bands, to combine the outputs from this function at different times. The output can thus be treated as a kind of cache of LSH signatures.

To extract pairs of documents from the output of this function, see lsh_candidates.

References

Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (Cambridge University Press, 2011), ch. 3. See also Matthew Casperson, "Minhash for Dummies" (November 14, 2013).

Examples

dir <- system.file("extdata/legal", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
corpus <- TextReuseCorpus(dir = dir,
                          tokenizer = tokenize_ngrams, n = 5,
                          minhash_func = minhash)
buckets <- lsh(corpus, bands = 50)
#> Warning: `gather_()` was deprecated in tidyr 1.2.0.
#>  Please use `gather()` instead.
#>  The deprecated feature was likely used in the textreuse package.
#>   Please report the issue at <https://github.com/ropensci/textreuse/issues>.
buckets
#> # A tibble: 150 × 2
#>    doc          buckets                         
#>    <chr>        <chr>                           
#>  1 ca1851-match 1653a3d599f47612dc64a415c4c7580f
#>  2 ca1851-match 5d3ce291f14f7eca70d81e4838f12b6e
#>  3 ca1851-match a8a4062fc440f32bf9bf0567190a1775
#>  4 ca1851-match e25603c25a8d5cb84115ea4a6e8cff4f
#>  5 ca1851-match ef09e88edc17f41b67cb34a9a5aa9bf9
#>  6 ca1851-match ce9f95aaa696129725f0426db8839b55
#>  7 ca1851-match fe12f2a331a6c4d743ab29d458daa709
#>  8 ca1851-match f9ed8256709250c972a2f1178a57ef08
#>  9 ca1851-match a8cbc2b56af775f3ea97e55d690b8ffa
#> 10 ca1851-match b73076c1a6e01ec8b80d4c65c04343f8
#> # ℹ 140 more rows