Skip to contents

Build an inverted index from tokens to the documents that contain them. This is useful for finding document pairs that share one or more n-grams without comparing every document pair. The corpus must be created with keep_tokens = TRUE.

Usage

token_index(corpus, min_doc_count = 2, max_doc_count = Inf)

Arguments

corpus

A TextReuseCorpus with retained tokens.

min_doc_count

Minimum number of documents a token must appear in to be retained. Increase this to remove rare tokens.

max_doc_count

Maximum number of documents a token may appear in to be retained. Decrease this to remove very common tokens.

Value

A textreuse_token_index data frame with columns token, docs, and n_docs.