Find R packages matching an input of either text or another package
Source:R/similar-pkgs.R
pkgmatch_similar_pkgs.Rd
This function accepts as input
either a text description, or
a path to a local R package, and returns information on R packages which
best match that input. Matches are found from within a specified "corpus",
currently all packages from either rOpenSci's package suite, or from
CRAN.
The returned object has a default print
method which prints the best 5
matches directly to the screen, yet returns information on all packages
within the specified corpus. This information is in the form of a
data.frame
, with one column for the package name, and one or more
additional columns of integer ranks for each package. There is also a head
method to print the first few entries of these full data (default n = 5
).
To see all data, use as.data.frame()
.
Ranks are obtained from scores derived from:
Cosine similarities between Language Model (LM) embeddings for the
input
, and corresponding embeddings for the specified corpus."Best Match 25" (BM25) scores based on document token frequencies.
Ranks for text matches are generally obtained from packages both including
and excluding function descriptions as part of the package text. This
results in up to four scores for each input. These scores are then combined
to a final ranking using the Reciprocal Rank Fusion (RRF) algorithm. The
additional parameter of lm_proportion
determines the extent to which the
final ranking weights the LM versus BM25 components.
Finally, all components of this function are locally cached for each call
(by the memoise package), so additional calls to this function with
the same input
and corpus
should be much faster than initial calls. This
means the effect of changing lm_proportion
can easily be examined by
simply repeating calls to this function.
Usage
pkgmatch_similar_pkgs(
input,
corpus = "ropensci",
embeddings = NULL,
idfs = NULL,
input_is_code = text_is_code(input),
lm_proportion = 0.5,
n = 5L,
browse = FALSE
)
Arguments
- input
Either a path to local source code of an R package, or a text string.
- corpus
If
embeddings
oridfs
parameters are not specified, they are automatically downloaded for the corpus specified by this parameter. Must be one of "ropensci" or "cran". The function will then return the most similar package from the specified corpus.- embeddings
Large Language Model embeddings for all rOpenSci packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory.
- idfs
Inverse Document Frequency tables for all rOpenSci packages, generated from pkgmatch_bm25. If not provided, pre-generated IDF tables will be downloaded and stored in a local cache directory.
- input_is_code
A binary flag indicating whether
input
is code or plain text. Ignored ifinput
is path to a local package; otherwise can be used to force appropriate interpretation if input type.- lm_proportion
A value between 0 and 1 to control the relative contributions of results from Language Models ("LMs") versus results from traditional token-frequency models. Final rankings are generated by combining these two kinds of results, so that
lm_proportion = 0
will return results from token frequency analyses only, whilelm_proportion = 1
will return results from LMs only.- n
When the result of this function is printed to screen, the top
n
packages will be displayed.- browse
If
TRUE
, automatically open webpages of the topn
matches in local browser.
Value
A data.frame
with a "package" column naming packages, and one or
more columns of package ranks in terms of text similarity and, if input
is
a local path to an entire R package, of similarity in code structure. As
described above, the default print
method prints package names only. To
see full result, use as.data.frame()
.
Note
The first time this function is run without passing either
embeddings
or idfs
, required values will be automatically downloaded and
stored in a locally persistent cache directory. Especially for the "cran"
corpus, this downloading may take quite some time.
See also
input_is_code
Other main:
pkgmatch_similar_fns()
Examples
if (FALSE) { # \dontrun{
input <- "Download open spatial data from NASA"
p <- pkgmatch_similar_pkgs (input)
p # Default print method, lists 5 best matching packages
head (p) # Shows first 5 rows of full `data.frame` object
# This second call will be much faster than first call:
p2 <- pkgmatch_similar_pkgs (input, lm_proportion = 0.25)
} # }