Skip to contents

This vignette presumes that you’ve read the “How does pkgmatch work?” vignette, which initially explains that,

The “pkgmatch” package package finds packages, as well as individual functions, which best match a given input. Inputs can be text descriptions, sections of code, or even entire R packages. “pkgmatch” finds the most closely matching packages using a combination of Language Models (LMs, or equivalently, “LLMs” for large language models), and traditional token-frequency algorithms.

This vignette digs more deeply into the question of why “pkgmatch” may sometimes fail to produce expected results. In answering that question, it is important to understand that LMs effectively rely on compressed representations of input data. That compression is in the form of vectors of “embeddings” which transform textual input to vectors of numeric values. The vectors here, and in many LM-based systems, comprise 768 individual values. No matter how long an input is, it will always be represented in the embedding space by a vector of 768 numeric elements. This representation is thus inherently “lossy”, and therefore inherently approximate. “pkgmatch” works by matching the embedding vectors of any input to pre-computed data sets of embedding vectors from the specific corpora. Because all embeddings are approximate, matching is also unavoidably approximate. Any expected match may thus not necessarily appear as the best-matched result from {pkgmatch}. Nevertheless, to the extent that the approximations are accurate, expected matches should appear somewhere within the first few matches. This vignette explores why even that approximate expectation may sometimes fail to happen.

Note that all results in this vignette are pre-generated and hard-coded, because they rely on complex and time-consuming outputs of locally-running language models. While all code within the entire vignette may be run directly as is, results may differ slightly from those shown here.

Demonstration of unexpected results

To start, we need to load the package:

library (pkgmatch)
#> ollama is not installed. Please follow installation instructions at https://ollama.com.

The entire vignette is based on results from a single prompt, for which the expected result is the lubridate package:

input <- "Package that works with dates and times in tidy format"

By default, the object returned from the pkmgatch_similar_pkgs() function() prints the top five matching packages:

pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "datefixR"        "fpp3"            "rebus.datetimes" "sweep"          
#> [5] "datetime"

And while expected results may not necessarily always be in the first position, the absence of lubridate from the top five matches is indeed unexpected. Understanding why this arises requires diving into the code used by {pkgmatch} to obtain those matches. The following sub-sections re-generate full packages embeddings for the five packages listed above plus the expected result of lubridate, in order to compare results for that expected package with those of the five others.

Obtain package source

The following code downloads and extracts tarballs of source code for the five packages listed above, plus lubridate.

pkgs <- c (
    "datefixR", "fpp3", "rebus.datetimes", "sweep", "datetime", "lubridate"
)
path <- utils::download.packages (
    pkgs,
    destdir = fs::path_temp (),
    repos = "https://cloud.r-project.org"
)
chk <- lapply (path [, 2], function (p) {
    utils::untar (p, exdir = fs::path_temp (), tar = "internal")
})
pkg_paths <- fs::path (fs::path_temp (), pkgs)
stopifnot (all (fs::dir_exists (pkg_paths)))

Embeddings from package source

Embeddings used in {pkgmatch} are generated from the pkgmatch_embeddings_from_pkgs() function. This function accepts one main input parameter specifying paths to one or more local directories containing source code.

emb <- pkgmatch_embeddings_from_pkgs (pkg_paths)
str (emb)
#> Generating text embeddings [1 / 2] ...
#> Generating text embeddings [2 / 2] ...
#> Generating code embeddings ...
#> List of 3
#>  $ text_with_fns: num [1:768, 1:6] -0.838 0.669 0.202 -0.686 -0.985 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:6] "datefixR" "fpp3" "rebus.datetimes" "sweep" ...
#>  $ text_wo_fns  : num [1:768, 1:6] 0.201 0.808 -0.34 -0.186 0.858 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:6] "datefixR" "fpp3" "rebus.datetimes" "sweep" ...
#>  $ code         : num [1:768, 1:6] 0.482 -0.588 0.226 -0.889 -0.949 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:6] "datefixR" "fpp3" "rebus.datetimes" "sweep" ...

Those three list items in the returned object are embeddings from the full package text including text of all function documentation entries (text_with_fns); the equivalent text without function documentation entries (text_wo_fns), and the entire package source code represented as a single character vector (code).

Generating matches from those embeddings

The embedding vector corresponding to the input is then compared to these embedding vectors in order to find the best-matched package. Similarities between embeddings are measured here, as in the vast majority of LM applications, using cosine similarity.

emb_input <- get_embeddings (input, code = FALSE)
simil_with_fns <- cosine_similarity (emb_input, emb$text_with_fns)
simil_wo_fns <- cosine_similarity (emb_input, emb$text_wo_fns)
names (simil_with_fns) [2] <- "simil_with_fns"
names (simil_wo_fns) [2] <- "simil_wo_fns"
similarities <- dplyr::left_join (simil_with_fns, simil_wo_fns, by = "package")
print (similarities)
#>           package simil_with_fns simil_wo_fns
#> 1 rebus.datetimes      0.8304852    0.8186134
#> 2        datetime      0.8202274    0.8351010
#> 3        datefixR      0.8194707    0.8186624
#> 4       lubridate      0.7905832    0.7940536
#> 5           sweep      0.7714603    0.7984862
#> 6            fpp3      0.7071283    0.8093774

That generates two vectors of similarities, with values for each package generated from embeddings calculated both with and without text from function documentation. Matching in “pkgmatch” also uses additional data from “BM25” values for inverse word frequencies. These are derived from calculating word frequencies across the entire corpora, weighting each word in an input by the inverse of these frequencies, and then matching the result to word frequency vectors for each target package. The following lines load the pre-computed “inverse document frequencies” for the corpus, and then use those to calculate BM25 scores for the input. Since the “idfs” include data for the entire CRAN corpus, this code also filters final scores down to our selection of six packages only, noting that the “package” column contains full names of tarballs.

idfs <- pkgmatch_load_data (what = "idfs", corpus = "cran")
bm25 <- pkgmatch_bm25 (input = input, idfs = idfs, corpus = "cran") |>
    dplyr::mutate (package = gsub ("\\_.*$", "", package)) |>
    dplyr::filter (package %in% pkgs)
similarities <- dplyr::left_join (similarities, bm25, by = "package")
print (similarities)
#>           package simil_with_fns simil_wo_fns bm25_with_fns bm25_wo_fns
#> 1 rebus.datetimes      0.8304852    0.8186134    12.7782429  13.1704148
#> 2        datetime      0.8202274    0.8351010    12.6079534  12.6606973
#> 3        datefixR      0.8194707    0.8186624    13.5863247  16.3750351
#> 4       lubridate      0.7905832    0.7940536     0.2140736   0.6645225
#> 5           sweep      0.7714603    0.7984862    15.6956661  15.0961380
#> 6            fpp3      0.7071283    0.8093774    12.2372817  14.8849026

Finally, these similarities are combined using a “reranking” function. For “pkgmatch”, this function by default takes only similarities excluding function definitions, so only two of the four columns shown above, and combines them using an lm_proportion parameter with a default value of 0.5 for equal contributions of similarities from embeddings and BM25 values.

pkgmatch_rerank (similarities)
#>           package rank
#> 1        datefixR    1
#> 2        datetime    2
#> 3           sweep    3
#> 4 rebus.datetimes    4
#> 5            fpp3    5
#> 6       lubridate    6

And as expected, lubridate is the lowest-ranked of the six packages. Note also the the BM25 scores for lubridate are much lower than for any of the other packages. This indicates that the actual words used in the input prompt are only a poor match for the actual words used within the text of the lubridate package, whereas they are a much better match for the other five top-matched packages. Nevertheless, lubridate is also the worst-matched package in terms of similarities without function documentation text. Thus, regardless of how LM and BM25 values are combined, lubridate is never included in the top five matches.

While BM25 values are derived from empirical word frequencies, and so are effectively fixed, the LM similarity metrics are derived from embeddings, and as stated at the output, these are inherently approximate. The next section describes how {pkgmatch} might be refined to improve the accuracy of these approximate representations.


Modifying embeddings through chunking

One common strategy to improve the accuracy of LM embeddings is “chunking”, which refers to taking different “chunks” of an input that generally extends beyond the admissible “context window” of the model. The models used here have context windows of 8096 tokens, which is often enough to enter the entire code or text of a package. Chunking effects can nevertheless be achieved through permuting components of the input in different orders. The embedding for “This plus that. Then those.” will differ to that for “Then those. This plus that.”:

head (get_embeddings ("This plus that. Then those."))
head (get_embeddings ("Then those. This plus that."))
#>            [,1]
#> [1,] -0.5418470
#> [2,]  0.1983769
#> [3,]  1.1284385
#> [4,]  0.2710975
#> [5,]  0.1558345
#> [6,]  0.0710893
#>             [,1]
#> [1,] -0.57014912
#> [2,]  0.22628722
#> [3,]  1.09075117
#> [4,]  0.24393138
#> [5,]  0.11764651
#> [6,]  0.09092502

Those two sets of embeddings are quite similar, and yet not identical. Chunking is generally used to generate several embeddings for any given input, with similarity metrics averaged over these several embeddings then providing a more accurate approximation to the underlying textual similarities. In other words, chunking reduces the noise inherent in approximate representation through embeddings.

We will now demonstrate how chunking can be applied within the general “pkgmatch” workflow shown above. The following code modifies the procedure used within the get_embeddings() function, to create differently permuted chunks of the package input. (This requires calling two non-exported functions using the “three-dots” notation, :::, in contrast to the standard “two dots” for exported functions.)

txt_with_fns <- vapply (pkg_paths, function (p) pkgmatch:::get_pkg_text (p), character (1L))
txt_wo_fns <- pkgmatch:::rm_fns_from_pkg_txt (txt_with_fns)

The get_pkg_text() function inserts markdown-formatted section headers. These can then be used to break the text into sections which can then be randomly rearranged to form new chunks. The following code generates embeddings from differently-ordered chunks (excluding text from function documentation).

n_permutations <- 5L
permute_text <- function (text_input) {
    txt <- strsplit (text_input, "#+") [[1]]
    index <- order (runif (length (txt)))
    paste0 (txt [index], collapse = "\\n")
}

embeddings <- lapply (pkg_paths, function (p) {
    txt_with_fns <- pkgmatch:::get_pkg_text (p)
    txt_wo_fns <- pkgmatch:::rm_fns_from_pkg_txt (txt_with_fns) [[1]]

    do.call (cbind, lapply (
        seq_len (n_permutations),
        function (j) { get_embeddings (permute_text (txt_wo_fns)) }
    ))
})

Calculate similarities with those embeddings, and generate average similarities across the chunks for each package:

input_emb <- get_embeddings (input) [, 1]
similarities <- vapply (embeddings, function (emb) {
    colnames (emb) <- letters [seq_len (ncol (emb))]
    simil <- cosine_similarity (input_emb, emb, fns = FALSE)
    mean (simil$simil)
}, numeric (1L))
similarities <- data.frame (package = pkgs, similarity = similarities) |>
    dplyr::arrange (dplyr::desc (similarity))
print (similarities)
#>           package similarity
#> 1        datetime  0.8379673
#> 2 rebus.datetimes  0.8235914
#> 3        datefixR  0.8203636
#> 4       lubridate  0.7936824
#> 5           sweep  0.7849108
#> 6            fpp3  0.7724631

And averaging similarities across random chunks have increased the similarity between the input text and the lubridate package from last place up to fourth out of six. That is sufficient for lubridate to appear in the default list of top five matches seen when printing the output of “pkgmatch” functions.

Why not use chunking in pkgmatch?

Chunking like that demonstrated immediately above is very commonly used in many LM applications, notably including by almost all commercial API providers. The previous section also demonstrates that it can and does improve the results of “pkgmatch”. And yet it is not used here, for reason we now explain.

The current procedures generate three embedding vectors for every R package in both the rOpenSci and CRAN corpora, with these embeddings updated every day. The embeddings are stored with a GithUb release of this package. Clicking on that link shows all files of pre-calculated data needed in this package, and importantly shows their sizes as well. The data currently total over 300MB, with the largest single file being the embedding vectors for all CRAN packages, with a size of 227MB. Each permutation of package text and code would require an additional embedding vector, so that using five random permutations like the code above would increase the size of that file to well over one gigabtye. Anybody using this package must wait for these data to be downloaded. Downloading 227MB of data can already be very time consuming. The additional burden of forcing users to wait to download over 1GB of data in order to generate results would be a strong disincentive to using this package.

Equally importantly, these data are updated on a daily basis, and their generation requires energy. Each additional permutation of input chunks requires the energy used for current daily updates to be used again, with five chunks requiring five times the current energy usage.

In short: chunking may be effective in increasing the accuracy of LM outputs, but it also increases demands on memory, on data transfer, and on energy consumption. To remain generally useable by as many people as possible, and to reduce as far as possible the energy, bandwidth, and storage requirements of this package, chunking is not used here.

Recommendations for improving results

This entire vignette has shown how the numerical representations used within LMs are approximate, and demonstrated the consequences of those approximations. While chunking is an effective approach to numerically overcoming some of the negative effects of these approximations, it can be just as effective to provide more accurately descriptive textual input, as illustrated by the following code which uses lm_proportion = 1 to display match results from LM output only:

input <- paste0 (
    "package that works with dates and times in a tidy format and ",
    "overcomes unintuitive frustrating quirks with daylight savings, ",
    "time zones, and leap years"
)
pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = 1)
#> [1] "timechange" "neatRanges" "fpp3"       "datetime"   "lubridate"

And lubridate appears in the top five matched packages after tweaking the input to more accurately match some of the “quirks” of the text of that package (which includes that word).