
Why are the results not what I expect?
Source:vignettes/E_why-are-the-results-not-what-i-expect.Rmd
E_why-are-the-results-not-what-i-expect.Rmd
This vignette presumes that you’ve read the “How does pkgmatch work?” vignette, which initially explains that,
The “pkgmatch” package package finds packages, as well as individual functions, which best match a given input. Inputs can be text descriptions, sections of code, or even entire R packages. “pkgmatch” finds the most closely matching packages using a combination of Language Models (LMs, or equivalently, “LLMs” for large language models), and traditional token-frequency algorithms.
This vignette digs more deeply into the question of why “pkgmatch” may sometimes fail to produce expected results. In answering that question, it is important to understand that LMs effectively rely on compressed representations of input data. That compression is in the form of vectors of “embeddings” which transform textual input to vectors of numeric values. The vectors here, and in many LM-based systems, comprise 768 individual values. No matter how long an input is, it will always be represented in the embedding space by a vector of 768 numeric elements. This representation is thus inherently “lossy”, and therefore inherently approximate. “pkgmatch” works by matching the embedding vectors of any input to pre-computed data sets of embedding vectors from the specific corpora. Because all embeddings are approximate, matching is also unavoidably approximate. Any expected match may thus not necessarily appear as the best-matched result from {pkgmatch}. Nevertheless, to the extent that the approximations are accurate, expected matches should appear somewhere within the first few matches. This vignette explores why even that approximate expectation may sometimes fail to happen.
Note that all results in this vignette are pre-generated and hard-coded, because they rely on complex and time-consuming outputs of locally-running language models. While all code within the entire vignette may be run directly as is, results may differ slightly from those shown here.
Demonstration of unexpected results
To start, we need to load the package:
library (pkgmatch)
#> ollama is not installed. Please follow installation instructions at https://ollama.com.
The entire vignette is based on results from a single prompt, for
which the expected result is the lubridate
package:
input <- "Package that works with dates and times in tidy format"
By default, the object returned from the
pkmgatch_similar_pkgs()
function() prints the top five
matching packages:
pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "datefixR" "fpp3" "rebus.datetimes" "sweep"
#> [5] "datetime"
And while expected results may not necessarily always be in the first
position, the absence of lubridate
from
the top five matches is indeed unexpected. Understanding why this arises
requires diving into the code used by {pkgmatch} to obtain those
matches. The following sub-sections re-generate full packages embeddings
for the five packages listed above plus the expected result of
lubridate
, in order to compare results for that expected
package with those of the five others.
Obtain package source
The following code downloads and extracts tarballs of source code for
the five packages listed above, plus lubridate
.
pkgs <- c (
"datefixR", "fpp3", "rebus.datetimes", "sweep", "datetime", "lubridate"
)
path <- utils::download.packages (
pkgs,
destdir = fs::path_temp (),
repos = "https://cloud.r-project.org"
)
chk <- lapply (path [, 2], function (p) {
utils::untar (p, exdir = fs::path_temp (), tar = "internal")
})
pkg_paths <- fs::path (fs::path_temp (), pkgs)
stopifnot (all (fs::dir_exists (pkg_paths)))
Embeddings from package source
Embeddings used in {pkgmatch} are generated from the
pkgmatch_embeddings_from_pkgs()
function. This function
accepts one main input parameter specifying paths to one or more local
directories containing source code.
emb <- pkgmatch_embeddings_from_pkgs (pkg_paths)
str (emb)
#> Generating text embeddings [1 / 2] ...
#> Generating text embeddings [2 / 2] ...
#> Generating code embeddings ...
#> List of 3
#> $ text_with_fns: num [1:768, 1:6] -0.838 0.669 0.202 -0.686 -0.985 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:6] "datefixR" "fpp3" "rebus.datetimes" "sweep" ...
#> $ text_wo_fns : num [1:768, 1:6] 0.201 0.808 -0.34 -0.186 0.858 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:6] "datefixR" "fpp3" "rebus.datetimes" "sweep" ...
#> $ code : num [1:768, 1:6] 0.482 -0.588 0.226 -0.889 -0.949 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:6] "datefixR" "fpp3" "rebus.datetimes" "sweep" ...
Those three list items in the returned object are embeddings from the
full package text including text of all function documentation entries
(text_with_fns
); the equivalent text without function
documentation entries (text_wo_fns
), and the entire package
source code represented as a single character vector
(code
).
Generating matches from those embeddings
The embedding vector corresponding to the input is then compared to these embedding vectors in order to find the best-matched package. Similarities between embeddings are measured here, as in the vast majority of LM applications, using cosine similarity.
emb_input <- get_embeddings (input, code = FALSE)
simil_with_fns <- cosine_similarity (emb_input, emb$text_with_fns)
simil_wo_fns <- cosine_similarity (emb_input, emb$text_wo_fns)
names (simil_with_fns) [2] <- "simil_with_fns"
names (simil_wo_fns) [2] <- "simil_wo_fns"
similarities <- dplyr::left_join (simil_with_fns, simil_wo_fns, by = "package")
print (similarities)
#> package simil_with_fns simil_wo_fns
#> 1 rebus.datetimes 0.8304852 0.8186134
#> 2 datetime 0.8202274 0.8351010
#> 3 datefixR 0.8194707 0.8186624
#> 4 lubridate 0.7905832 0.7940536
#> 5 sweep 0.7714603 0.7984862
#> 6 fpp3 0.7071283 0.8093774
That generates two vectors of similarities, with values for each package generated from embeddings calculated both with and without text from function documentation. Matching in “pkgmatch” also uses additional data from “BM25” values for inverse word frequencies. These are derived from calculating word frequencies across the entire corpora, weighting each word in an input by the inverse of these frequencies, and then matching the result to word frequency vectors for each target package. The following lines load the pre-computed “inverse document frequencies” for the corpus, and then use those to calculate BM25 scores for the input. Since the “idfs” include data for the entire CRAN corpus, this code also filters final scores down to our selection of six packages only, noting that the “package” column contains full names of tarballs.
idfs <- pkgmatch_load_data (what = "idfs", corpus = "cran")
bm25 <- pkgmatch_bm25 (input = input, idfs = idfs, corpus = "cran") |>
dplyr::mutate (package = gsub ("\\_.*$", "", package)) |>
dplyr::filter (package %in% pkgs)
similarities <- dplyr::left_join (similarities, bm25, by = "package")
print (similarities)
#> package simil_with_fns simil_wo_fns bm25_with_fns bm25_wo_fns
#> 1 rebus.datetimes 0.8304852 0.8186134 12.7782429 13.1704148
#> 2 datetime 0.8202274 0.8351010 12.6079534 12.6606973
#> 3 datefixR 0.8194707 0.8186624 13.5863247 16.3750351
#> 4 lubridate 0.7905832 0.7940536 0.2140736 0.6645225
#> 5 sweep 0.7714603 0.7984862 15.6956661 15.0961380
#> 6 fpp3 0.7071283 0.8093774 12.2372817 14.8849026
Finally, these similarities are combined using a “reranking”
function. For “pkgmatch”, this function by default takes only
similarities excluding function definitions, so only two of the four
columns shown above, and combines them using an
lm_proportion
parameter with a default value of 0.5 for
equal contributions of similarities from embeddings and BM25 values.
pkgmatch_rerank (similarities)
#> package rank
#> 1 datefixR 1
#> 2 datetime 2
#> 3 sweep 3
#> 4 rebus.datetimes 4
#> 5 fpp3 5
#> 6 lubridate 6
And as expected, lubridate
is the lowest-ranked of the
six packages. Note also the the BM25 scores for lubridate
are much lower than for any of the other packages. This indicates that
the actual words used in the input prompt are only a poor match for the
actual words used within the text of the lubridate
package,
whereas they are a much better match for the other five top-matched
packages. Nevertheless, lubridate
is also the worst-matched
package in terms of similarities without function documentation text.
Thus, regardless of how LM and BM25 values are combined,
lubridate
is never included in the top five matches.
While BM25 values are derived from empirical word frequencies, and so are effectively fixed, the LM similarity metrics are derived from embeddings, and as stated at the output, these are inherently approximate. The next section describes how {pkgmatch} might be refined to improve the accuracy of these approximate representations.
Modifying embeddings through chunking
One common strategy to improve the accuracy of LM embeddings is “chunking”, which refers to taking different “chunks” of an input that generally extends beyond the admissible “context window” of the model. The models used here have context windows of 8096 tokens, which is often enough to enter the entire code or text of a package. Chunking effects can nevertheless be achieved through permuting components of the input in different orders. The embedding for “This plus that. Then those.” will differ to that for “Then those. This plus that.”:
head (get_embeddings ("This plus that. Then those."))
head (get_embeddings ("Then those. This plus that."))
#> [,1]
#> [1,] -0.5418470
#> [2,] 0.1983769
#> [3,] 1.1284385
#> [4,] 0.2710975
#> [5,] 0.1558345
#> [6,] 0.0710893
#> [,1]
#> [1,] -0.57014912
#> [2,] 0.22628722
#> [3,] 1.09075117
#> [4,] 0.24393138
#> [5,] 0.11764651
#> [6,] 0.09092502
Those two sets of embeddings are quite similar, and yet not identical. Chunking is generally used to generate several embeddings for any given input, with similarity metrics averaged over these several embeddings then providing a more accurate approximation to the underlying textual similarities. In other words, chunking reduces the noise inherent in approximate representation through embeddings.
We will now demonstrate how chunking can be applied within the
general “pkgmatch” workflow shown above. The following code modifies the
procedure used within the get_embeddings()
function, to
create differently permuted chunks of the package input. (This requires
calling two non-exported functions using the “three-dots” notation,
:::
, in contrast to the standard “two dots” for exported
functions.)
txt_with_fns <- vapply (pkg_paths, function (p) pkgmatch:::get_pkg_text (p), character (1L))
txt_wo_fns <- pkgmatch:::rm_fns_from_pkg_txt (txt_with_fns)
The get_pkg_text()
function inserts markdown-formatted
section headers. These can then be used to break the text into sections
which can then be randomly rearranged to form new chunks. The following
code generates embeddings from differently-ordered chunks (excluding
text from function documentation).
n_permutations <- 5L
permute_text <- function (text_input) {
txt <- strsplit (text_input, "#+") [[1]]
index <- order (runif (length (txt)))
paste0 (txt [index], collapse = "\\n")
}
embeddings <- lapply (pkg_paths, function (p) {
txt_with_fns <- pkgmatch:::get_pkg_text (p)
txt_wo_fns <- pkgmatch:::rm_fns_from_pkg_txt (txt_with_fns) [[1]]
do.call (cbind, lapply (
seq_len (n_permutations),
function (j) { get_embeddings (permute_text (txt_wo_fns)) }
))
})
Calculate similarities with those embeddings, and generate average similarities across the chunks for each package:
input_emb <- get_embeddings (input) [, 1]
similarities <- vapply (embeddings, function (emb) {
colnames (emb) <- letters [seq_len (ncol (emb))]
simil <- cosine_similarity (input_emb, emb, fns = FALSE)
mean (simil$simil)
}, numeric (1L))
similarities <- data.frame (package = pkgs, similarity = similarities) |>
dplyr::arrange (dplyr::desc (similarity))
print (similarities)
#> package similarity
#> 1 datetime 0.8379673
#> 2 rebus.datetimes 0.8235914
#> 3 datefixR 0.8203636
#> 4 lubridate 0.7936824
#> 5 sweep 0.7849108
#> 6 fpp3 0.7724631
And averaging similarities across random chunks have increased the
similarity between the input text and the lubridate
package
from last place up to fourth out of six. That is sufficient for
lubridate
to appear in the default list of top five matches
seen when printing the output of “pkgmatch” functions.
Why not use chunking in pkgmatch
?
Chunking like that demonstrated immediately above is very commonly used in many LM applications, notably including by almost all commercial API providers. The previous section also demonstrates that it can and does improve the results of “pkgmatch”. And yet it is not used here, for reason we now explain.
The current procedures generate three embedding vectors for every R package in both the rOpenSci and CRAN corpora, with these embeddings updated every day. The embeddings are stored with a GithUb release of this package. Clicking on that link shows all files of pre-calculated data needed in this package, and importantly shows their sizes as well. The data currently total over 300MB, with the largest single file being the embedding vectors for all CRAN packages, with a size of 227MB. Each permutation of package text and code would require an additional embedding vector, so that using five random permutations like the code above would increase the size of that file to well over one gigabtye. Anybody using this package must wait for these data to be downloaded. Downloading 227MB of data can already be very time consuming. The additional burden of forcing users to wait to download over 1GB of data in order to generate results would be a strong disincentive to using this package.
Equally importantly, these data are updated on a daily basis, and their generation requires energy. Each additional permutation of input chunks requires the energy used for current daily updates to be used again, with five chunks requiring five times the current energy usage.
In short: chunking may be effective in increasing the accuracy of LM outputs, but it also increases demands on memory, on data transfer, and on energy consumption. To remain generally useable by as many people as possible, and to reduce as far as possible the energy, bandwidth, and storage requirements of this package, chunking is not used here.
Recommendations for improving results
This entire vignette has shown how the numerical representations used
within LMs are approximate, and demonstrated the consequences of those
approximations. While chunking is an effective approach to
numerically overcoming some of the negative effects of these
approximations, it can be just as effective to provide more accurately
descriptive textual input, as illustrated by the following code which
uses lm_proportion = 1
to display match results from LM
output only:
input <- paste0 (
"package that works with dates and times in a tidy format and ",
"overcomes unintuitive frustrating quirks with daylight savings, ",
"time zones, and leap years"
)
pkgmatch_similar_pkgs (input, corpus = "cran", lm_proportion = 1)
#> [1] "timechange" "neatRanges" "fpp3" "datetime" "lubridate"
And lubridate
appears in the top five matched packages
after tweaking the input to more accurately match some of the “quirks”
of the text of that package (which includes that word).