Skip to contents

This function accepts as input either a text description, or a path to a local R package, and ranks all R packages within the specified corpus in terms of how well they match that input. The "corpus" argument can specify either rOpenSci's package suite, or CRAN.

Ranks are obtained from scores derived from:

  • Cosine similarities between Language Model (LM) embeddings for the input, and corresponding embeddings for the specified corpus.

  • "Best Match 25" (BM25) scores based on document token frequencies.

For text input, ranks are generally obtained for packages both including and excluding function descriptions as part of the package text, giving two sets of ranks for a given input. Where input is an entire R package, separate ranks are also calculated for package code and text, thus giving four distinct ranks. The function ultimately returns a single rank, derived by combining individual ranks using the Reciprocal Rank Fusion (RRF) algorithm. The additional parameter of lm_proportion determines the extent to which the final ranking weights the LM versus BM25 components.

Finally, all components of this function are locally cached for each call (by the memoise package), so additional calls to this function with the same input and corpus should be much faster than initial calls. This means the effect of changing lm_proportion can easily be examined by simply repeating calls to this function.

Usage

pkgmatch_similar_pkgs(
  input,
  corpus = NULL,
  embeddings = NULL,
  idfs = NULL,
  input_is_code = text_is_code(input),
  lm_proportion = 0.5,
  n = 5L,
  browse = FALSE
)

Arguments

input

Either a text string, a path to local source code of an R package, or the name of any installed R package.

corpus

Must be specified as one of "ropensci" or "cran". If embeddings or idfs parameters are not specified, they will be automatically downloaded for the corpus specified by this parameter. The function will then return the most similar package from the specified corpus. Note that calculations will corpus = "cran" will generally take longer, because the corpus is much larger.

embeddings

Large Language Model embeddings for a suite of packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory.

idfs

Inverse Document Frequency tables for a suite of packages, generated from pkgmatch_bm25. If not provided, pre-generated IDF tables will be downloaded and stored in a local cache directory.

input_is_code

A binary flag indicating whether input is code or plain text. Ignored if input is path to a local package; otherwise can be used to force appropriate interpretation of input type.

lm_proportion

A value between 0 and 1 to control the relative contributions of results from Language Models ("LMs") versus results from traditional token-frequency models. Final rankings are generated by combining these two kinds of results, so that lm_proportion = 0 will return results from token frequency analyses only, while lm_proportion = 1 will return results from LMs only.

n

When the result of this function is printed to screen, the top n packages will be displayed.

browse

If TRUE, automatically open webpages of the top n matches in local browser.

Value

A data.frame with a "package" column naming packages, and one or more columns of package ranks in terms of text similarity and, if input is an R package, of similarity in code structure.

The returned object has a default print method which prints the best 5 matches directly to the screen, yet returns information on all packages within the specified corpus. This information is in the form of a data.frame, with one column for the package name, and one or more additional columns of integer ranks for each package. There is also a head method to print the first few entries of these full data (default n = 5). To see all data, use as.data.frame(). See the example below for how to manipulate these objects.

Note

The first time this function is run without passing either embeddings or idfs, required values will be automatically downloaded and stored in a locally persistent cache directory. Especially for the "cran" corpus, this downloading may take quite some time.

See also

input_is_code

Other main: pkgmatch_similar_fns()

Examples

# The following function simulates remote data in temporary directory, to
# enable package usage without downloading. Do not run for normal usage.
generate_pkgmatch_example_data ()
#> This function resets the cache directory used by 'pkgmatch'
#> to a temporary path. To restore functionality with full data,
#> you'll either need to restart your R session, or set an
#> environment variable named 'PKGMATCH_CACHE_DIR' to the
#> desired path. Default path is /tmp/Rtmp3FnoNI/pkgmatch_ex_data

input <- "curl" # Name of a single installed package
p <- pkgmatch_similar_pkgs (input, corpus = "cran")
p # Default print method, lists 5 best matching packages
#> $text
#> [1] "crul"       "AmpGram"    "CancerGram" "httr2"      "mRpostman" 
#> 
#> $code
#> [1] "AmpGram"    "mRpostman"  "RCurl"      "CancerGram" "crul"      
#> 
head (p) # Shows first 5 rows of full `data.frame` object
#>      package version text_rank code_rank
#> 1       crul   1.5.0         1         5
#> 2    AmpGram     1.0         2         1
#> 3 CancerGram   1.0.0         3         4
#> 4      httr2   1.1.2         4         6
#> 5  mRpostman   1.1.4         5         2

# This second call modifies default combining of results equally from language
# model and token frequency (BM25) results. It will be much faster than first
# call, because previously generated embeddings are re-used.
p2 <- pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = 0.25)

# Example demonstrating how to combine results using different values of
# `lm_proportion`. Input is a package, so result has columns for "text_rank"
# and "code_rank".
lm_props <- 0:10 / 10
res <- lapply (lm_props, function (p) {
    nm_text <- sprintf ("text_rank_p%02.0f", p * 10)
    nm_code <- sprintf ("code_rank_p%02.0f", p * 10)
    res <- pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = p) |>
        dplyr::rename ({{nm_text}} := "text_rank", {{nm_code}} := "code_rank") |>
        dplyr::arrange (package)
    if (p > 0) {
        res <- dplyr::select (res, -package, -version)
    }
    return (res)
})
res <- do.call (cbind, res)

# That then has paired columns of (text rank, code rank) for each of the
# 11 values of `lm_props`.
head (res)
#>      package version text_rank_p00 code_rank_p00 text_rank_p01 code_rank_p01
#> 1    AmpGram     1.0             5             5             5             5
#> 2 CancerGram   1.0.0             4             4             4             4
#> 3      RCurl    1.98             3             3             3             3
#> 4       crul   1.5.0             2             2             2             2
#> 5       curl   6.2.2             6             6             6             6
#> 6       httr   1.4.7             7             7             7             7
#>   text_rank_p02 code_rank_p02 text_rank_p03 code_rank_p03 text_rank_p04
#> 1             5             5             4             5             2
#> 2             4             4             3             4             3
#> 3             3             3             5             2             5
#> 4             1             2             1             3             1
#> 5             6             6             8             7             8
#> 6             7             7             6             9             7
#>   code_rank_p04 text_rank_p05 code_rank_p05 text_rank_p06 code_rank_p06
#> 1             2             2             1             2             1
#> 2             4             3             4             4             2
#> 3             3             6             3             6             3
#> 4             5             1             5             1             6
#> 5             8             8             8            10             9
#> 6             9             7             9             5             8
#>   text_rank_p07 code_rank_p07 text_rank_p08 code_rank_p08 text_rank_p09
#> 1             2             1             2             1             2
#> 2             4             2             4             2             4
#> 3             6             3             7             4             8
#> 4             1             6             3             7             3
#> 5            10             9            10             9             9
#> 6             5             8             5             8             5
#>   code_rank_p09 text_rank_p10 code_rank_p10
#> 1             1             2             1
#> 2             3             4             3
#> 3             4             8             4
#> 4             7             3             7
#> 5             9             9             9
#> 6             8             5             8