
Find R packages matching an input of either text or another package
Source:R/similar-pkgs.R
pkgmatch_similar_pkgs.Rd
This function accepts as input
either a text description, or
a path to a local R package, and ranks all R packages within the specified
corpus in terms of how well they match that input. The "corpus" argument can
specify either rOpenSci's package suite, or
CRAN.
Ranks are obtained from scores derived from:
Cosine similarities between Language Model (LM) embeddings for the
input
, and corresponding embeddings for the specified corpus."Best Match 25" (BM25) scores based on document token frequencies.
For text input, ranks are generally obtained for packages both including and
excluding function descriptions as part of the package text, giving two sets
of ranks for a given input. Where input is an entire R package, separate
ranks are also calculated for package code and text, thus giving four
distinct ranks. The function ultimately returns a single rank, derived by
combining individual ranks using the Reciprocal Rank Fusion (RRF) algorithm. The
additional parameter of lm_proportion
determines the extent to which the
final ranking weights the LM versus BM25 components.
Finally, all components of this function are locally cached for each call
(by the memoise package), so additional calls to this function with
the same input
and corpus
should be much faster than initial calls. This
means the effect of changing lm_proportion
can easily be examined by
simply repeating calls to this function.
Usage
pkgmatch_similar_pkgs(
input,
corpus = NULL,
embeddings = NULL,
idfs = NULL,
input_is_code = text_is_code(input),
lm_proportion = 0.5,
n = 5L,
browse = FALSE
)
Arguments
- input
Either a text string, a path to local source code of an R package, or the name of any installed R package.
- corpus
Must be specified as one of "ropensci" or "cran". If
embeddings
oridfs
parameters are not specified, they will be automatically downloaded for the corpus specified by this parameter. The function will then return the most similar package from the specified corpus. Note that calculations willcorpus = "cran"
will generally take longer, because the corpus is much larger.- embeddings
Large Language Model embeddings for a suite of packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory.
- idfs
Inverse Document Frequency tables for a suite of packages, generated from pkgmatch_bm25. If not provided, pre-generated IDF tables will be downloaded and stored in a local cache directory.
- input_is_code
A binary flag indicating whether
input
is code or plain text. Ignored ifinput
is path to a local package; otherwise can be used to force appropriate interpretation of input type.- lm_proportion
A value between 0 and 1 to control the relative contributions of results from Language Models ("LMs") versus results from traditional token-frequency models. Final rankings are generated by combining these two kinds of results, so that
lm_proportion = 0
will return results from token frequency analyses only, whilelm_proportion = 1
will return results from LMs only.- n
When the result of this function is printed to screen, the top
n
packages will be displayed.- browse
If
TRUE
, automatically open webpages of the topn
matches in local browser.
Value
A data.frame
with a "package" column naming packages, and one or
more columns of package ranks in terms of text similarity and, if input
is
an R package, of similarity in code structure.
The returned object has a default print
method which prints the best 5
matches directly to the screen, yet returns information on all packages
within the specified corpus. This information is in the form of a
data.frame
, with one column for the package name, and one or more
additional columns of integer ranks for each package. There is also a head
method to print the first few entries of these full data (default n = 5
).
To see all data, use as.data.frame()
. See the example below for how to
manipulate these objects.
Note
The first time this function is run without passing either
embeddings
or idfs
, required values will be automatically downloaded and
stored in a locally persistent cache directory. Especially for the "cran"
corpus, this downloading may take quite some time.
See also
input_is_code
Other main:
pkgmatch_similar_fns()
Examples
# The following function simulates remote data in temporary directory, to
# enable package usage without downloading. Do not run for normal usage.
generate_pkgmatch_example_data ()
#> This function resets the cache directory used by 'pkgmatch'
#> to a temporary path. To restore functionality with full data,
#> you'll either need to restart your R session, or set an
#> environment variable named 'PKGMATCH_CACHE_DIR' to the
#> desired path. Default path is /tmp/Rtmp3FnoNI/pkgmatch_ex_data
input <- "curl" # Name of a single installed package
p <- pkgmatch_similar_pkgs (input, corpus = "cran")
p # Default print method, lists 5 best matching packages
#> $text
#> [1] "crul" "AmpGram" "CancerGram" "httr2" "mRpostman"
#>
#> $code
#> [1] "AmpGram" "mRpostman" "RCurl" "CancerGram" "crul"
#>
head (p) # Shows first 5 rows of full `data.frame` object
#> package version text_rank code_rank
#> 1 crul 1.5.0 1 5
#> 2 AmpGram 1.0 2 1
#> 3 CancerGram 1.0.0 3 4
#> 4 httr2 1.1.2 4 6
#> 5 mRpostman 1.1.4 5 2
# This second call modifies default combining of results equally from language
# model and token frequency (BM25) results. It will be much faster than first
# call, because previously generated embeddings are re-used.
p2 <- pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = 0.25)
# Example demonstrating how to combine results using different values of
# `lm_proportion`. Input is a package, so result has columns for "text_rank"
# and "code_rank".
lm_props <- 0:10 / 10
res <- lapply (lm_props, function (p) {
nm_text <- sprintf ("text_rank_p%02.0f", p * 10)
nm_code <- sprintf ("code_rank_p%02.0f", p * 10)
res <- pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = p) |>
dplyr::rename ({{nm_text}} := "text_rank", {{nm_code}} := "code_rank") |>
dplyr::arrange (package)
if (p > 0) {
res <- dplyr::select (res, -package, -version)
}
return (res)
})
res <- do.call (cbind, res)
# That then has paired columns of (text rank, code rank) for each of the
# 11 values of `lm_props`.
head (res)
#> package version text_rank_p00 code_rank_p00 text_rank_p01 code_rank_p01
#> 1 AmpGram 1.0 5 5 5 5
#> 2 CancerGram 1.0.0 4 4 4 4
#> 3 RCurl 1.98 3 3 3 3
#> 4 crul 1.5.0 2 2 2 2
#> 5 curl 6.2.2 6 6 6 6
#> 6 httr 1.4.7 7 7 7 7
#> text_rank_p02 code_rank_p02 text_rank_p03 code_rank_p03 text_rank_p04
#> 1 5 5 4 5 2
#> 2 4 4 3 4 3
#> 3 3 3 5 2 5
#> 4 1 2 1 3 1
#> 5 6 6 8 7 8
#> 6 7 7 6 9 7
#> code_rank_p04 text_rank_p05 code_rank_p05 text_rank_p06 code_rank_p06
#> 1 2 2 1 2 1
#> 2 4 3 4 4 2
#> 3 3 6 3 6 3
#> 4 5 1 5 1 6
#> 5 8 8 8 10 9
#> 6 9 7 9 5 8
#> text_rank_p07 code_rank_p07 text_rank_p08 code_rank_p08 text_rank_p09
#> 1 2 1 2 1 2
#> 2 4 2 4 2 4
#> 3 6 3 7 4 8
#> 4 1 6 3 7 3
#> 5 10 9 10 9 9
#> 6 5 8 5 8 5
#> code_rank_p09 text_rank_p10 code_rank_p10
#> 1 1 2 1
#> 2 3 4 3
#> 3 4 8 4
#> 4 7 3 7
#> 5 9 9 9
#> 6 8 5 8