Skip to contents

This function accepts a vector of either names of installed packages, or paths to local source code directories, and calculates language model (LM) embeddings for both text descriptions within the package (documentation, including of functions), and for the entire code base. Embeddings may also be calculating separately for all function descriptions.

The embeddings are currently retrieved from a local 'ollama' server (https://ollama.com) running Jina AI embeddings (https://ollama.com/jina/jina-embeddings-v2-base-en for text, and https://ollama.com/ordis/jina-embeddings-v2-base-code for code).

Usage

pkgmatch_embeddings_from_pkgs(
  packages = NULL,
  n_chunks = 5L,
  functions_only = FALSE
)

Arguments

packages

A vector of either names of installed packages, or local paths to directories containing R packages.

n_chunks

Number of randomly permuted chunks of input text to use to generate average embeddings. Values should generally be > 1, because the text of many packages exceeds the context window for the language models, and so permutations ensure that all text is captured in resultant embeddings. Note, however, that computation times scale linearly with this value.

functions_only

If TRUE, calculate embeddings for function descriptions only. This is intended to generate a separate set of embeddings which can then be used to match plain-text queries of functions, rather than entire packages.

Value

If !functions_only, a list of two matrices of embeddings: one for the text descriptions of the specified packages, including individual descriptions of all package functions, and one for the entire code base. For functions_only, a single matrix of embeddings for all function descriptions.

Note

Although it is technically much faster to perform the extraction of text and code in parallel, doing so generates unpredictable errors in extracting tarballs, which frequently cause the whole process to crash. The only way to safely ensure that all tarballs are successfully extracted and code parsed it to run this single-threaded.

See also

Other embeddings: pkgmatch_embeddings_from_text()

Examples

packages <- "curl"
emb_fns <- pkgmatch_embeddings_from_pkgs (packages, functions_only = TRUE)
#> Generating text embeddings for function descriptions ...
colnames (emb_fns) # All functions the package
#>  [1] "curl::Gettinginmemory"         "curl::Downloadingtodisk"      
#>  [3] "curl::Streamingdata"           "curl::#Nonblockingconnections"
#>  [5] "curl::Asyncrequests"           "curl::Errorautomatically"     
#>  [7] "curl::Checkmanually"           "curl::Settinghandleoptions"   
#>  [9] "curl::ENUM(long)options"       "curl::DisablingHTTP/2"        
#> [11] "curl::Readingcookies"          "curl::Onreusinghandles"       
#> [13] "curl::Postingforms"            "curl::Usingpipes"             
#> [15] "curl::curl"                    "curl::curl_download"          
#> [17] "curl::curl_echo"               "curl::curl_escape"            
#> [19] "curl::curl_fetch"             
emb_pkg <- pkgmatch_embeddings_from_pkgs (packages, functions_only = FALSE)
#> Generating text embeddings [1 / 2] ...
#> Generating text embeddings [2 / 2] ...
#> Generating code embeddings ...
names (emb_pkg)
#> [1] "text_with_fns" "text_wo_fns"   "code"         
colnames (emb_pkg$text_with_fns) # "curl"
#> [1] "curl"