Skip to contents

This function accepts as input either a text description, or a path to a local R package, and returns information on R packages which best match that input. Matches are found from within a specified "corpus", currently all packages from either rOpenSci's package suite, or from CRAN.

The returned object has a default print method which prints the best 5 matches directly to the screen, yet returns information on all packages within the specified corpus. This information is in the form of a data.frame, with one column for the package name, and one or more additional columns of integer ranks for each package. There is also a head method to print the first few entries of these full data (default n = 5). To see all data, use as.data.frame().

Ranks are obtained from scores derived from:

  • Cosine similarities between Large Language Model (LLM) embeddings for the input, and corresponding embeddings for the specified corpus.

  • "Best Match 25" (BM25) scores based on document token frequencies.

Ranks for text matches are generally obtained from packages both including and excluding function descriptions as part of the package text. This results in up to four scores for each input. These scores are then combined to a final ranking using the Reciprocal Rank Fusion (RRF) algorithm. The additional parameter of llm_proportion determines the extent to which the final ranking weights the LLM versus BM25 components.

Finally, all components of this function are locally cached for each call (by the memoise package), so additional calls to this function with the same input and corpus should be much faster than initial calls. This means the effect of changing llm_proportion can easily be examined by simply repeating calls to this function.

Usage

pkgmatch_similar_pkgs(
  input,
  corpus = "ropensci",
  embeddings = NULL,
  idfs = NULL,
  input_is_code = text_is_code(input),
  llm_proportion = 0.5,
  n = 5L,
  browse = FALSE
)

Arguments

input

Either a path to local source code of an R package, or a text string.

corpus

If embeddings or idfs parameters are not specified, they are automatically downloaded for the corpus specified by this parameter. Must be one of "ropensci" or "cran". The function will then return the most similar package from the specified corpus.

embeddings

Large Language Model embeddings for all rOpenSci packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory.

idfs

Inverse Document Frequency tables for all rOpenSci packages, generated from pkgmatch_bm25. If not provided, pre-generated IDF tables will be downloaded and stored in a local cache directory.

input_is_code

A binary flag indicating whether input is code or plain text. Ignored if input is path to a local package; otherwise can be used to force appropriate interpretation if input type.

llm_proportion

A value between 0 and 1 to control the relative contributions of results from Large Language Models ("LLMs") versus results from traditional token-frequency models. Final rankings are generated by combining these two kinds of results, so that llm_proportion = 0 will return results from token frequency analyses only, while llm_proportion = 1 will return results from LLMs only.

n

When the result of this function is printed to screen, the top n packages will be displayed.

browse

If TRUE, automatically open webpages of the top n matches in local browser.

Value

A data.frame with a "package" column naming packages, and one or more columns of package ranks in terms of text similarity and, if input is a local path to an entire R package, of similarity in code structure. As described above, the default print method prints package names only. To see full result, use as.data.frame().

Note

The first time this function is run without passing either embeddings or idfs, required values will be automatically downloaded and stored in a locally persistent cache directory. Especially for the "cran" corpus, this downloading may take quite some time.

See also

input_is_code

Other main: pkgmatch_similar_fns()

Examples

if (FALSE) { # \dontrun{
input <- "Download open spatial data from NASA"
p <- pkgmatch_similar_pkgs (input)
p # Default print method, lists 5 best matching packages
head (p) # Shows first 5 rows of full `data.frame` object
# This second call will be much faster than first call:
p2 <- pkgmatch_similar_pkgs (input, llm_proportion = 0.25)
} # }