Skip to contents

This vignette presumes that you’ve read the “How does pkgmatch work?” vignette, which initially explains that,

The “pkgmatch” package package finds packages, as well as individual functions, which best match a given input. Inputs can be text descriptions, sections of code, or even entire R packages. “pkgmatch” finds the most closely matching packages using a combination of Language Models (LMs, or equivalently, “LLMs” for large language models), and traditional token-frequency algorithms.

This vignette digs more deeply into the question of why “pkgmatch” may sometimes fail to produce expected results. In answering that question, it is important to understand that LMs effectively rely on compressed representations of input data. That compression is in the form of vectors of “embeddings” which transform textual input to vectors of numeric values. The vectors here, and in many LM-based systems, comprise 768 individual values. No matter how long an input is, it will always be represented in the embedding space by a vector of 768 numeric elements. This representation is thus inherently “lossy”, and therefore inherently approximate. “pkgmatch” works by matching the embedding vectors of any input to pre-computed data sets of embedding vectors from the specific corpora. Because all embeddings are approximate, matching is also unavoidably approximate. Any expected match may thus not necessarily appear as the best-matched result from {pkgmatch}. Nevertheless, to the extent that the approximations are accurate, expected matches should appear somewhere within the first few matches. This vignette explores why even that approximate expectation may sometimes fail to happen.

Note that all results in this vignette are pre-generated and hard-coded, because they rely on complex and time-consuming outputs of locally-running language models. While all code within the entire vignette may be run directly as is, results may differ slightly from those shown here.

Demonstration of unexpected results

To start, we need to load the package:

The entire vignette is based on results from a single prompt, for which the expected result is the dplyr package:

input <- "Tidy data manipulation"

By default, the object returned from the pkmgatch_similar_pkgs() function() prints the top five matching packages:

pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "tidyfst"   "dplyr"     "tidytable" "tidygraph" "romic"

And while expected results may not necessarily always be in the first position, dplyr indeed appears in the top five matches as unexpected. In contrast, consider the following example in which we expect to retrieve the lubridate package in the results:

input <- "Package that works with dates and times in tidy format"
pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "iso8601"    "fpp3"       "tibbletime" "cleaner"    "grates"

And lubridate doesn’t even appear in the top five results. Understanding why this arises requires diving into the code used by {pkgmatch} to obtain those matches. The following sub-sections re-generate full packages embeddings for the five packages listed above plus the expected result of lubridate, in order to compare results for that expected package with those of the five others.

Obtain package source

The following code downloads and extracts tarballs of source code for the five packages listed above in our search for lubridate, plus that package itself.

pkgs <- c (
    "iso8601", "fpp3", "tibbletime", "cleaner", "grates", "lubridate"
)
path <- utils::download.packages (
    pkgs,
    destdir = fs::path_temp (),
    repos = "https://cloud.r-project.org"
)
chk <- lapply (path [, 2], function (p) {
    utils::untar (p, exdir = fs::path_temp (), tar = "internal")
})
pkg_paths <- fs::path (fs::path_temp (), pkgs)
stopifnot (all (fs::dir_exists (pkg_paths)))

Embeddings from package source

Embeddings used in {pkgmatch} are generated from the pkgmatch_embeddings_from_pkgs() function. This function accepts one main input parameter specifying paths to one or more local directories containing source code.

emb <- pkgmatch_embeddings_from_pkgs (pkg_paths)
str (emb)
#> Generating text embeddings [1 / 2] ...
#> Generating text embeddings [2 / 2] ...
#> Generating code embeddings ...
#> List of 3
#>  $ text_with_fns: num [1:768, 1:6] -0.838 0.669 0.202 -0.686 -0.985 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:6] "iso8601" "fpp3" "tibbletime" "cleaner" ...
#>  $ text_wo_fns  : num [1:768, 1:6] 0.201 0.808 -0.34 -0.186 0.858 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:6] "iso8601" "fpp3" "tibbletime" "cleaner" ...
#>  $ code         : num [1:768, 1:6] 0.482 -0.588 0.226 -0.889 -0.949 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:6] "iso8601" "fpp3" "tibbletime" "cleaner" ...

Those three list items in the returned object are embeddings from the full package text including text of all function documentation entries (text_with_fns); the equivalent text without function documentation entries (text_wo_fns), and the entire package source code represented as a single character vector (code).

Generating matches from those embeddings

The embedding vector corresponding to the input is then compared to these embedding vectors in order to find the best-matched package. Similarities between embeddings are measured here, as in the vast majority of LM applications, using cosine similarity.

emb_input <- get_embeddings (input, code = FALSE)
simil_with_fns <- cosine_similarity (emb_input, emb$text_with_fns)
simil_wo_fns <- cosine_similarity (emb_input, emb$text_wo_fns)
names (simil_with_fns) [2] <- "simil_with_fns"
names (simil_wo_fns) [2] <- "simil_wo_fns"
similarities <- dplyr::left_join (simil_with_fns, simil_wo_fns, by = "package")
print (similarities)
#>      package simil_with_fns simil_wo_fns
#> 1     grates      0.0123306   -0.0106373
#> 2       fpp3     -0.0031622    0.0051597
#> 3    iso8601     -0.0129680   -0.0519957
#> 4 tibbletime     -0.0201389    0.0264501
#> 5  lubridate     -0.0311186    0.0112935
#> 6    cleaner     -0.0342489   -0.0004984

That generates two vectors of similarities, with values for each package generated from embeddings calculated both with and without text from function documentation. Matching in “pkgmatch” also uses additional data from “BM25” values for inverse word frequencies. These are derived from calculating word frequencies across the entire corpora, weighting each word in an input by the inverse of these frequencies, and then matching the result to word frequency vectors for each target package. The following lines load the pre-computed “inverse document frequencies” for the corpus, and then use those to calculate BM25 scores for the input. Since the “idfs” include data for the entire CRAN corpus, this code also filters final scores down to our selection of six packages only, noting that the “package” column contains full names of tarballs.

idfs <- pkgmatch_load_data (what = "idfs", corpus = "cran")
bm25 <- pkgmatch_bm25 (input = input, idfs = idfs, corpus = "cran") |>
    dplyr::mutate (package = gsub ("\\_.*$", "", package)) |>
    dplyr::filter (package %in% pkgs)
similarities <- dplyr::left_join (similarities, bm25, by = "package")
print (similarities)
#>      package simil_with_fns simil_wo_fns bm25_with_fns bm25_wo_fns
#> 1     grates      0.0123306   -0.0106373      13.90114    17.37301
#> 2       fpp3     -0.0031622    0.0051597      14.62466    17.43921
#> 3    iso8601     -0.0129680   -0.0519957      14.55312    18.50910
#> 4 tibbletime     -0.0201389    0.0264501      15.85009    17.26252
#> 5  lubridate     -0.0311186    0.0112935      13.49620    16.67262
#> 6    cleaner     -0.0342489   -0.0004984      17.64303    20.56125

Finally, these similarities are combined using a “reranking” function. For “pkgmatch”, this function by default takes only similarities excluding function definitions, so only two of the four columns shown above, and combines them using an lm_proportion parameter with a default value of 0.5 for equal contributions of similarities from embeddings and BM25 values.

pkgmatch_rerank (similarities)
#>     packages rank
#> 1    cleaner    1
#> 2 tibbletime    2
#> 3       fpp3    3
#> 4    iso8601    4
#> 5  lubridate    5
#> 6     grates    6

And as expected, lubridate is the lowest-ranked of the six packages. Note also the the BM25 scores for lubridate are much lower than for any of the other packages. This indicates that the actual words used in the input prompt are only a poor match for the actual words used within the text of the lubridate package, whereas they are a much better match for the other five top-matched packages. Nevertheless, lubridate is also the worst-matched package in terms of similarities without function documentation text. Thus, regardless of how LM and BM25 values are combined, lubridate is never included in the top five matches.

While BM25 values are derived from empirical word frequencies, and so are effectively fixed, the LM similarity metrics are derived from embeddings, and as stated at the output, these are inherently approximate. The next section describes how {pkgmatch} might be refined to improve the accuracy of these approximate representations.


Modifying embeddings through chunking

One common strategy to improve the accuracy of LM embeddings is “chunking”, which refers to taking different “chunks” of an input that generally extends beyond the admissible “context window” of the model. The models used here have context windows of 8096 tokens, which is often not enough to enter the entire code or text of a package, which includes full text of all help files and vignettes. The embeddings generated for the two corpora used here, of rOpenSci and CRAN, are generated from five “chunks” or randomly-permuted versions of pacakge texts, to ensure that full texts are represented in resultant embeddings, and to reduce the influence of specific text ordering.

More generally, the effects of chunking can be illustrated by examining how the embedding for “This plus that. Then those.” differs to that for “Then those. This plus that.”:

head (get_embeddings ("This plus that. Then those."))
head (get_embeddings ("Then those. This plus that."))
#>            [,1]
#> [1,] -0.5418470
#> [2,]  0.1983769
#> [3,]  1.1284385
#> [4,]  0.2710975
#> [5,]  0.1558345
#> [6,]  0.0710893
#>             [,1]
#> [1,] -0.57014912
#> [2,]  0.22628722
#> [3,]  1.09075117
#> [4,]  0.24393138
#> [5,]  0.11764651
#> [6,]  0.09092502

Those two sets of embeddings are quite similar, and yet not identical. Chunking is generally used to generate several embeddings for any given input, with similarity metrics averaged over these several embeddings then providing a more accurate approximation to the underlying textual similarities. In other words, chunking reduces the noise inherent in approximate representation through embeddings.

We will now demonstrate how chunking can be applied within the general “pkgmatch” workflow shown above. The pkgmatch_embeddings_from_pkgs() function generates embeddings from a list of local paths to package source code. Values are averaged over five differently-permuted chunks of input text and code for each package.

embeddings <- pkgmatch_embeddings_from_pkgs (pkg_paths)

Those embeddings are a list of three matrices, named after the source used to generate each set: text_with_fn for embeddings from full package text including function help files; text_wo_fns for the same excluding function help files; and code for embeddings from package code. For out input, we only want to match package text, so we’ll just use the first of these. We can then use those embeddings to calculate similarities against randomly-permuted versions of our input string.

num_chunks <- 10L
input_split <- strsplit (input, "\\s") [[1]]
emb <- embeddings$text_with_fns
# 'cosine similarity' returns a data.frame of (package, simil). We only want
# the "simil" values, but all ordered in the way way, so we can calculate
# average values.
similarities <- lapply (seq_len (num_chunks), function (i) {
    input_i <- input_split [order (runif (length (input_split)))]
    input_i <- paste0 (input_i, collapse = " ")
    input_emb <- get_embeddings (input_i) [, 1]
    s <- cosine_similarity (input_emb, emb, fns = FALSE)
    s [order (s$package), ]
})
pkgs <- similarities [[1]]$package
sim <- lapply (similarities, function (i) i$simil)
# Then average thoss similarity values:
similarities <- data.frame (
    package = pkgs,
    simil = colMeans (do.call (rbind, sim))
) |> dplyr::arrange (dplyr::desc (simil))
print (similarities)
#>      package similarity
#> 1    iso8601  0.8205117
#> 2 tibbletime  0.8163939
#> 3    cleaner  0.7997529
#> 4  lubridate  0.7997298
#> 5     grates  0.7697528
#> 6       fpp3  0.7359823

And averaging similarities across random chunks have increased the similarity between the input text and the lubridate package from last place up to fourth out of six. That is sufficient for lubridate to appear in the default list of top five matches seen when printing the output of “pkgmatch” functions.

A note on the use of chunking in pkgmatch?

Chunking like that demonstrated immediately above is very commonly used in many LM applications, notably including by almost all commercial API providers. The previous section also demonstrates that it can and does improve the results of “pkgmatch”. Chunking like that is nevertheless not used here, because each additional permutation requires additional calls to the ollama model used to generate the embeddings, and that is generally the most time-consuming part of generating results from this package.

The reference embeddings used to represent the two corpora are nevertheless generated using a “chunking” strategy of randomly-permuting package text and code. The current procedures generate three embedding vectors for every R package in both the rOpenSci and CRAN corpora, with these embeddings updated every day. The embeddings are stored with a GithUb release of this package.

Embeddings are themselves averaged over the randomized chunks, rather than averages of resultant similarity metrics like demonstrated in the code above. Averaging across similarity metrics (“AvgSim”) generates more accurate results than averaging across embedding vectors (“AvgEmb”), because cosine similarities are inherently non-linear, and so the final average of similarities will incorporate these non-linear effects. In contrast, AvgEmb generates less accurate results because it applies a linear average of embedding values prior to applying the non-linear similarity calculation.

This approach is neverthless the only feasible approach within this package, because AvgSim would require the full set of permuted embeddings to be stored for each package, whereas AvgEmb only stores one embedding vector for each. (Technically, three embedding vectors are stored for the three forms of input described above.)

Clicking on the link to the GithUb release of this package shows all files of pre-calculated data needed in this package, and importantly shows their sizes as well. The data currently total over 300MB, with the largest single file being the embedding vectors for all CRAN packages, with a size of 227MB. Each permutation of package text and code would require an additional embedding vector, so that storing all five random permutations would increase the size of that file to well over one gigabtye. Anybody using this package must wait for these data to be downloaded. Downloading 227MB of data can already be very time consuming. The additional burden of forcing users to wait to download over 1GB of data in order to generate results would be a strong disincentive to using this package. For that reason, this package uses AvgEmb chunking and not AvgSim.

Recommendations for improving results

This entire vignette has shown how the numerical representations used within LMs are approximate, and demonstrated the consequences of those approximations. While chunking is an effective approach to numerically overcoming some of the negative effects of these approximations, it can be just as effective to provide more accurately descriptive textual input, as illustrated by the following code which uses lm_proportion = 1 to display match results from LM output only:

input <- paste0 (
    "package that works with dates and times in a tidy format and ",
    "overcomes unintuitive frustrating quirks with daylight savings, ",
    "time zones, and leap years"
)
pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "tsibble"    "clock"      "CFtime"     "lubridate"  "timechange"

And lubridate appears in the top five matched packages after tweaking the input to more accurately match some of the “quirks” of the text of that package (which includes that word).