The pkgmatch package finds packages, as well as
individual functions, which best match a given input. Inputs can be text
descriptions, sections of code, or even entire R packages.
pkgmatch finds the most closely matching packages using a
token-frequency
algorithm. This document describes how the pkgmatch
package works.
Summary
- Input chunking:
- Packages as either text or code
- Text chunks:
- Entire package text, including README, vignettes, and all function documentation entries
- Package text without function documentation entries
- Code chunks: None; entire package code as sole input chunk
- LM sizes: Input = 8,096 tokens, embedding length = 768
- Similarities assessed from “Inverse Document Frequency” weighting of terms, via “BM25” values.
- Final ranking:
- “Reciprocal Rank Fusion” algorithm of Cormack, Clarke, & Büttcher (2009).
- Reproducibility: Entirely deterministic and reproducible
Input forms
Text input of packages includes text from the package description (in
the “DESCRIPTION” file), followed by text from the “README” file (where
present), text of all vignettes, text of all function descriptions, and
parameters of each function. All code is removed from these text
sources, including removal of all code chunks in .Rmd
documents.
A reduced version of the input text is also submitted to the
token-frequency algorithms, through removing the text of function
descriptions, leaving only package description, README and vignette
text. Text inputs and outputs are thus considered in two forms: the full
form, generally appended with _with_fns, and the reduced
form appended with _wo_fns.
Inputs as text or code
All functions accept an input parameter, with an
accompanying input_is_code logical flag. If not
specifically set, the value of input_is_code flag is
determined by first passing the input to an internal
function, text_is_code(). That function defines inputs as
code if the number of recognisable words in the input is equal to or
greater than a default threshold of 98%. The code
file in which that function is defined contains a short script at
the bottom which was used to define this threshold. This automated
distinction between text and code inputs can always be over-ridden by
specifying a value for the input_is_code parameter.
Similarities between inputs and corpora of R packages
The algorithms described here were applied to all R packages from rOpenSci’s suite, as well as from CRAN. Most functions accept a “corpus” parameter, with values of either “ropensci” or “cran” to compare inputs to these specified corpora. The following sections describe algorithms this package uses to assess similarities between inputs and all packages within these corpora.
Token-frequency similarities
The pkgmatch package passes all input to an internal
version of the “Best
Match 25” (BM25) algorithm, an algorithm that is commonly used in
conjunction with LM outputs. The BM25 algorithm generates vectors of
numeric values quantifying the similarity between a given input and all
packages (or functions) within the specified corpus.
Importantly, the version developed here separates the two steps required in the algorithm, of calculating inverse document frequencies (IDFs) of all terms in the specified corpora, and then using those IDFs to calculate final scores for input documents. The package is distributed along with pre-calculated values of IDFs for all terms (both text and code) for both corpora. These are automatically downloaded the first time any functions are called, and from that time on, always available for immediate use in all subsequent calculations.
Tokens for text input
For text input, the BM25 algorithm requires inputs to be converted to
“tokens”, for which this package relies on rOpenSci’s
tokenizers package. The BM25 values then assess
similarities through adding similarities for all tokens shared between
two documents, weighted by the inverse of token frequencies across all
packages within the specified corpus. This inverse weighting ensures
that similarities are more influenced by similar usage of infrequent
tokens, and less by usage of common tokens.
Tokens for code input
For code input, tokens for BM25 input are taken as the names of all
functions called, pre-pended with namespaces, such as
base::print(). Function call are identified from the
“treesitter” algorithm, using functionality of the treesitter
R package. This package is currently restricted to the R language only,
and thus code input in this package considers code from /R
sub-directories only, and excludes any source code in other
languages.
All function calls in all packages within each corpus are first converted to IDFs. Calculation of BM25 values are weighted by these so that, as for word-token frequencies, resultant similarities are most strongly affected by similarities in usage of less common functions, and relatively unaffected by similarities in usage of common functions.
Combining similarities with “rank fusion”
Finally, the best matching outputs from the various combinations of
input chunks and algorithms are combined using the Reciprocal
Rank Fusion (RRF) algorithm of Cormack, Clarke, & Büttcher
(2009). Scores for each combination of input chunk and algorithm are
converted to ranks, and all ranks submitted to the RRF algorithm. A
default value of k = 60 is used at all times, as
recommended by the original paper, which also suggests very low
sensitivity of results to variations in this value. Where inputs are
entire packages, final rankings are generated separately for code and
text.
References
Cormack, G. V., Clarke, C. L. A. & Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval 758–759 (ACM, Boston MA USA, 2009). doi:10.1145/1571941.1572114.
