
Return raw embeddings from package text and function definitions.
Source:R/embeddings.R
pkgmatch_embeddings_from_pkgs.Rd
This function accepts a vector of either names of installed packages, or paths to local source code directories, and calculates language model (LM) embeddings for both text descriptions within the package (documentation, including of functions), and for the entire code base. Embeddings may also be calculating separately for all function descriptions.
The embeddings are currently retrieved from a local 'ollama' server (https://ollama.com) running Jina AI embeddings (https://ollama.com/jina/jina-embeddings-v2-base-en for text, and https://ollama.com/ordis/jina-embeddings-v2-base-code for code).
Arguments
- packages
A vector of either names of installed packages, or local paths to directories containing R packages.
- n_chunks
Number of randomly permuted chunks of input text to use to generate average embeddings. Values should generally be > 1, because the text of many packages exceeds the context window for the language models, and so permutations ensure that all text is captured in resultant embeddings. Note, however, that computation times scale linearly with this value.
- functions_only
If
TRUE
, calculate embeddings for function descriptions only. This is intended to generate a separate set of embeddings which can then be used to match plain-text queries of functions, rather than entire packages.
Value
If !functions_only
, a list of two matrices of embeddings: one for
the text descriptions of the specified packages, including individual
descriptions of all package functions, and one for the entire code base. For
functions_only
, a single matrix of embeddings for all function
descriptions.
Note
Although it is technically much faster to perform the extraction of text and code in parallel, doing so generates unpredictable errors in extracting tarballs, which frequently cause the whole process to crash. The only way to safely ensure that all tarballs are successfully extracted and code parsed it to run this single-threaded.
See also
Other embeddings:
pkgmatch_embeddings_from_text()
Examples
packages <- "curl"
emb_fns <- pkgmatch_embeddings_from_pkgs (packages, functions_only = TRUE)
#> Generating text embeddings for function descriptions ...
colnames (emb_fns) # All functions the package
#> [1] "curl::Gettinginmemory" "curl::Downloadingtodisk"
#> [3] "curl::Streamingdata" "curl::#Nonblockingconnections"
#> [5] "curl::Asyncrequests" "curl::Errorautomatically"
#> [7] "curl::Checkmanually" "curl::Settinghandleoptions"
#> [9] "curl::ENUM(long)options" "curl::DisablingHTTP/2"
#> [11] "curl::Readingcookies" "curl::Onreusinghandles"
#> [13] "curl::Postingforms" "curl::Usingpipes"
#> [15] "curl::curl" "curl::curl_download"
#> [17] "curl::curl_echo" "curl::curl_escape"
#> [19] "curl::curl_fetch"
emb_pkg <- pkgmatch_embeddings_from_pkgs (packages, functions_only = FALSE)
#> Generating text embeddings [1 / 2] ...
#> Generating text embeddings [2 / 2] ...
#> Generating code embeddings ...
names (emb_pkg)
#> [1] "text_with_fns" "text_wo_fns" "code"
colnames (emb_pkg$text_with_fns) # "curl"
#> [1] "curl"