How does pkgmatch work?

The pkgmatch package finds packages, as well as individual functions, which best match a given input. Inputs can be text descriptions, sections of code, or even entire R packages. pkgmatch finds the most closely matching packages using a token-frequency algorithm. This document describes how the pkgmatch package works.

Summary

Input chunking:
- Packages as either text or code
- Text chunks:
  - Entire package text, including README, vignettes, and all function documentation entries
  - Package text without function documentation entries
- Code chunks: None; entire package code as sole input chunk
LM sizes: Input = 8,096 tokens, embedding length = 768
Similarities assessed from “Inverse Document Frequency” weighting of terms, via “BM25” values.
Final ranking:
- “Reciprocal Rank Fusion” algorithm of Cormack, Clarke, & Büttcher (2009).
Reproducibility: Entirely deterministic and reproducible

Input forms

Text input of packages includes text from the package description (in the “DESCRIPTION” file), followed by text from the “README” file (where present), text of all vignettes, text of all function descriptions, and parameters of each function. All code is removed from these text sources, including removal of all code chunks in .Rmd documents.

A reduced version of the input text is also submitted to the token-frequency algorithms, through removing the text of function descriptions, leaving only package description, README and vignette text. Text inputs and outputs are thus considered in two forms: the full form, generally appended with _with_fns, and the reduced form appended with _wo_fns.

Inputs as text or code

All functions accept an input parameter, with an accompanying input_is_code logical flag. If not specifically set, the value of input_is_code flag is determined by first passing the input to an internal function, text_is_code(). That function defines inputs as code if the number of recognisable words in the input is equal to or greater than a default threshold of 98%. The code file in which that function is defined contains a short script at the bottom which was used to define this threshold. This automated distinction between text and code inputs can always be over-ridden by specifying a value for the input_is_code parameter.

Similarities between inputs and corpora of R packages

The algorithms described here were applied to all R packages from rOpenSci’s suite, as well as from CRAN. Most functions accept a “corpus” parameter, with values of either “ropensci” or “cran” to compare inputs to these specified corpora. The following sections describe algorithms this package uses to assess similarities between inputs and all packages within these corpora.

Token-frequency similarities

The pkgmatch package passes all input to an internal version of the “Best Match 25” (BM25) algorithm, an algorithm that is commonly used in conjunction with LM outputs. The BM25 algorithm generates vectors of numeric values quantifying the similarity between a given input and all packages (or functions) within the specified corpus.

Importantly, the version developed here separates the two steps required in the algorithm, of calculating inverse document frequencies (IDFs) of all terms in the specified corpora, and then using those IDFs to calculate final scores for input documents. The package is distributed along with pre-calculated values of IDFs for all terms (both text and code) for both corpora. These are automatically downloaded the first time any functions are called, and from that time on, always available for immediate use in all subsequent calculations.

Tokens for text input

For text input, the BM25 algorithm requires inputs to be converted to “tokens”, for which this package relies on rOpenSci’s tokenizers package. The BM25 values then assess similarities through adding similarities for all tokens shared between two documents, weighted by the inverse of token frequencies across all packages within the specified corpus. This inverse weighting ensures that similarities are more influenced by similar usage of infrequent tokens, and less by usage of common tokens.

Tokens for code input

For code input, tokens for BM25 input are taken as the names of all functions called, pre-pended with namespaces, such as base::print(). Function call are identified from the “treesitter” algorithm, using functionality of the treesitter R package. This package is currently restricted to the R language only, and thus code input in this package considers code from /R sub-directories only, and excludes any source code in other languages.

All function calls in all packages within each corpus are first converted to IDFs. Calculation of BM25 values are weighted by these so that, as for word-token frequencies, resultant similarities are most strongly affected by similarities in usage of less common functions, and relatively unaffected by similarities in usage of common functions.

Combining similarities with “rank fusion”

Finally, the best matching outputs from the various combinations of input chunks and algorithms are combined using the Reciprocal Rank Fusion (RRF) algorithm of Cormack, Clarke, & Büttcher (2009). Scores for each combination of input chunk and algorithm are converted to ranks, and all ranks submitted to the RRF algorithm. A default value of k = 60 is used at all times, as recommended by the original paper, which also suggests very low sensitivity of results to variations in this value. Where inputs are entire packages, final rankings are generated separately for code and text.

References

Cormack, G. V., Clarke, C. L. A. & Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval 758–759 (ACM, Boston MA USA, 2009). doi:10.1145/1571941.1572114.