
Why are the results not what I expect?
Source:vignettes/E_why-are-the-results-not-what-i-expect.Rmd
E_why-are-the-results-not-what-i-expect.Rmd
This vignette presumes that you’ve read the “How does pkgmatch work?” vignette, which initially explains that,
The “pkgmatch” package package finds packages, as well as individual functions, which best match a given input. Inputs can be text descriptions, sections of code, or even entire R packages. “pkgmatch” finds the most closely matching packages using a combination of Language Models (LMs, or equivalently, “LLMs” for large language models), and traditional token-frequency algorithms.
This vignette digs more deeply into the question of why “pkgmatch” may sometimes fail to produce expected results. In answering that question, it is important to understand that LMs effectively rely on compressed representations of input data. That compression is in the form of vectors of “embeddings” which transform textual input to vectors of numeric values. The vectors here, and in many LM-based systems, comprise 768 individual values. No matter how long an input is, it will always be represented in the embedding space by a vector of 768 numeric elements. This representation is thus inherently “lossy”, and therefore inherently approximate. “pkgmatch” works by matching the embedding vectors of any input to pre-computed data sets of embedding vectors from the specific corpora. Because all embeddings are approximate, matching is also unavoidably approximate. Any expected match may thus not necessarily appear as the best-matched result from {pkgmatch}. Nevertheless, to the extent that the approximations are accurate, expected matches should appear somewhere within the first few matches. This vignette explores why even that approximate expectation may sometimes fail to happen.
Note that all results in this vignette are pre-generated and hard-coded, because they rely on complex and time-consuming outputs of locally-running language models. While all code within the entire vignette may be run directly as is, results may differ slightly from those shown here.
Demonstration of unexpected results
To start, we need to load the package:
The entire vignette is based on results from a single prompt, for
which the expected result is the
dplyr
package:
input <- "Tidy data manipulation"
By default, the object returned from the
pkmgatch_similar_pkgs()
function() prints the top five
matching packages:
pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "tidyfst" "dplyr" "tidytable" "tidygraph" "romic"
And while expected results may not necessarily always be in the first
position, dplyr
indeed appears in the top five matches as unexpected. In contrast,
consider the following example in which we expect to retrieve the lubridate
package in the results:
input <- "Package that works with dates and times in tidy format"
pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "iso8601" "fpp3" "tibbletime" "cleaner" "grates"
And lubridate
doesn’t even appear in the top five
results. Understanding why this arises requires diving into the code
used by {pkgmatch} to obtain those matches. The following sub-sections
re-generate full packages embeddings for the five packages listed above
plus the expected result of lubridate
, in order to compare
results for that expected package with those of the five others.
Obtain package source
The following code downloads and extracts tarballs of source code for
the five packages listed above in our search for lubridate
,
plus that package itself.
pkgs <- c (
"iso8601", "fpp3", "tibbletime", "cleaner", "grates", "lubridate"
)
path <- utils::download.packages (
pkgs,
destdir = fs::path_temp (),
repos = "https://cloud.r-project.org"
)
chk <- lapply (path [, 2], function (p) {
utils::untar (p, exdir = fs::path_temp (), tar = "internal")
})
pkg_paths <- fs::path (fs::path_temp (), pkgs)
stopifnot (all (fs::dir_exists (pkg_paths)))
Embeddings from package source
Embeddings used in {pkgmatch} are generated from the
pkgmatch_embeddings_from_pkgs()
function. This function
accepts one main input parameter specifying paths to one or more local
directories containing source code.
emb <- pkgmatch_embeddings_from_pkgs (pkg_paths)
str (emb)
#> Generating text embeddings [1 / 2] ...
#> Generating text embeddings [2 / 2] ...
#> Generating code embeddings ...
#> List of 3
#> $ text_with_fns: num [1:768, 1:6] -0.838 0.669 0.202 -0.686 -0.985 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:6] "iso8601" "fpp3" "tibbletime" "cleaner" ...
#> $ text_wo_fns : num [1:768, 1:6] 0.201 0.808 -0.34 -0.186 0.858 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:6] "iso8601" "fpp3" "tibbletime" "cleaner" ...
#> $ code : num [1:768, 1:6] 0.482 -0.588 0.226 -0.889 -0.949 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:6] "iso8601" "fpp3" "tibbletime" "cleaner" ...
Those three list items in the returned object are embeddings from the
full package text including text of all function documentation entries
(text_with_fns
); the equivalent text without function
documentation entries (text_wo_fns
), and the entire package
source code represented as a single character vector
(code
).
Generating matches from those embeddings
The embedding vector corresponding to the input is then compared to these embedding vectors in order to find the best-matched package. Similarities between embeddings are measured here, as in the vast majority of LM applications, using cosine similarity.
emb_input <- get_embeddings (input, code = FALSE)
simil_with_fns <- cosine_similarity (emb_input, emb$text_with_fns)
simil_wo_fns <- cosine_similarity (emb_input, emb$text_wo_fns)
names (simil_with_fns) [2] <- "simil_with_fns"
names (simil_wo_fns) [2] <- "simil_wo_fns"
similarities <- dplyr::left_join (simil_with_fns, simil_wo_fns, by = "package")
print (similarities)
#> package simil_with_fns simil_wo_fns
#> 1 grates 0.0123306 -0.0106373
#> 2 fpp3 -0.0031622 0.0051597
#> 3 iso8601 -0.0129680 -0.0519957
#> 4 tibbletime -0.0201389 0.0264501
#> 5 lubridate -0.0311186 0.0112935
#> 6 cleaner -0.0342489 -0.0004984
That generates two vectors of similarities, with values for each package generated from embeddings calculated both with and without text from function documentation. Matching in “pkgmatch” also uses additional data from “BM25” values for inverse word frequencies. These are derived from calculating word frequencies across the entire corpora, weighting each word in an input by the inverse of these frequencies, and then matching the result to word frequency vectors for each target package. The following lines load the pre-computed “inverse document frequencies” for the corpus, and then use those to calculate BM25 scores for the input. Since the “idfs” include data for the entire CRAN corpus, this code also filters final scores down to our selection of six packages only, noting that the “package” column contains full names of tarballs.
idfs <- pkgmatch_load_data (what = "idfs", corpus = "cran")
bm25 <- pkgmatch_bm25 (input = input, idfs = idfs, corpus = "cran") |>
dplyr::mutate (package = gsub ("\\_.*$", "", package)) |>
dplyr::filter (package %in% pkgs)
similarities <- dplyr::left_join (similarities, bm25, by = "package")
print (similarities)
#> package simil_with_fns simil_wo_fns bm25_with_fns bm25_wo_fns
#> 1 grates 0.0123306 -0.0106373 13.90114 17.37301
#> 2 fpp3 -0.0031622 0.0051597 14.62466 17.43921
#> 3 iso8601 -0.0129680 -0.0519957 14.55312 18.50910
#> 4 tibbletime -0.0201389 0.0264501 15.85009 17.26252
#> 5 lubridate -0.0311186 0.0112935 13.49620 16.67262
#> 6 cleaner -0.0342489 -0.0004984 17.64303 20.56125
Finally, these similarities are combined using a “reranking”
function. For “pkgmatch”, this function by default takes only
similarities excluding function definitions, so only two of the four
columns shown above, and combines them using an
lm_proportion
parameter with a default value of 0.5 for
equal contributions of similarities from embeddings and BM25 values.
pkgmatch_rerank (similarities)
#> packages rank
#> 1 cleaner 1
#> 2 tibbletime 2
#> 3 fpp3 3
#> 4 iso8601 4
#> 5 lubridate 5
#> 6 grates 6
And as expected, lubridate
is the lowest-ranked of the
six packages. Note also the the BM25 scores for lubridate
are much lower than for any of the other packages. This indicates that
the actual words used in the input prompt are only a poor match for the
actual words used within the text of the lubridate
package,
whereas they are a much better match for the other five top-matched
packages. Nevertheless, lubridate
is also the worst-matched
package in terms of similarities without function documentation text.
Thus, regardless of how LM and BM25 values are combined,
lubridate
is never included in the top five matches.
While BM25 values are derived from empirical word frequencies, and so are effectively fixed, the LM similarity metrics are derived from embeddings, and as stated at the output, these are inherently approximate. The next section describes how {pkgmatch} might be refined to improve the accuracy of these approximate representations.
Modifying embeddings through chunking
One common strategy to improve the accuracy of LM embeddings is “chunking”, which refers to taking different “chunks” of an input that generally extends beyond the admissible “context window” of the model. The models used here have context windows of 8096 tokens, which is often not enough to enter the entire code or text of a package, which includes full text of all help files and vignettes. The embeddings generated for the two corpora used here, of rOpenSci and CRAN, are generated from five “chunks” or randomly-permuted versions of pacakge texts, to ensure that full texts are represented in resultant embeddings, and to reduce the influence of specific text ordering.
More generally, the effects of chunking can be illustrated by examining how the embedding for “This plus that. Then those.” differs to that for “Then those. This plus that.”:
head (get_embeddings ("This plus that. Then those."))
head (get_embeddings ("Then those. This plus that."))
#> [,1]
#> [1,] -0.5418470
#> [2,] 0.1983769
#> [3,] 1.1284385
#> [4,] 0.2710975
#> [5,] 0.1558345
#> [6,] 0.0710893
#> [,1]
#> [1,] -0.57014912
#> [2,] 0.22628722
#> [3,] 1.09075117
#> [4,] 0.24393138
#> [5,] 0.11764651
#> [6,] 0.09092502
Those two sets of embeddings are quite similar, and yet not identical. Chunking is generally used to generate several embeddings for any given input, with similarity metrics averaged over these several embeddings then providing a more accurate approximation to the underlying textual similarities. In other words, chunking reduces the noise inherent in approximate representation through embeddings.
We will now demonstrate how chunking can be applied within the
general “pkgmatch” workflow shown above. The pkgmatch_embeddings_from_pkgs()
function generates embeddings from a list of local paths to package
source code. Values are averaged over five differently-permuted chunks
of input text and code for each package.
embeddings <- pkgmatch_embeddings_from_pkgs (pkg_paths)
Those embeddings are a list of three matrices, named after the source
used to generate each set: text_with_fn
for embeddings from
full package text including function help files;
text_wo_fns
for the same excluding function help files; and
code
for embeddings from package code. For out input, we
only want to match package text, so we’ll just use the first of these.
We can then use those embeddings to calculate similarities against
randomly-permuted versions of our input string.
num_chunks <- 10L
input_split <- strsplit (input, "\\s") [[1]]
emb <- embeddings$text_with_fns
# 'cosine similarity' returns a data.frame of (package, simil). We only want
# the "simil" values, but all ordered in the way way, so we can calculate
# average values.
similarities <- lapply (seq_len (num_chunks), function (i) {
input_i <- input_split [order (runif (length (input_split)))]
input_i <- paste0 (input_i, collapse = " ")
input_emb <- get_embeddings (input_i) [, 1]
s <- cosine_similarity (input_emb, emb, fns = FALSE)
s [order (s$package), ]
})
pkgs <- similarities [[1]]$package
sim <- lapply (similarities, function (i) i$simil)
# Then average thoss similarity values:
similarities <- data.frame (
package = pkgs,
simil = colMeans (do.call (rbind, sim))
) |> dplyr::arrange (dplyr::desc (simil))
print (similarities)
#> package similarity
#> 1 iso8601 0.8205117
#> 2 tibbletime 0.8163939
#> 3 cleaner 0.7997529
#> 4 lubridate 0.7997298
#> 5 grates 0.7697528
#> 6 fpp3 0.7359823
And averaging similarities across random chunks have increased the
similarity between the input text and the lubridate
package
from last place up to fourth out of six. That is sufficient for
lubridate
to appear in the default list of top five matches
seen when printing the output of “pkgmatch” functions.
A note on the use of chunking in pkgmatch
?
Chunking like that demonstrated immediately above is very commonly used in many LM applications, notably including by almost all commercial API providers. The previous section also demonstrates that it can and does improve the results of “pkgmatch”. Chunking like that is nevertheless not used here, because each additional permutation requires additional calls to the ollama model used to generate the embeddings, and that is generally the most time-consuming part of generating results from this package.
The reference embeddings used to represent the two corpora are nevertheless generated using a “chunking” strategy of randomly-permuting package text and code. The current procedures generate three embedding vectors for every R package in both the rOpenSci and CRAN corpora, with these embeddings updated every day. The embeddings are stored with a GithUb release of this package.
Embeddings are themselves averaged over the randomized chunks, rather than averages of resultant similarity metrics like demonstrated in the code above. Averaging across similarity metrics (“AvgSim”) generates more accurate results than averaging across embedding vectors (“AvgEmb”), because cosine similarities are inherently non-linear, and so the final average of similarities will incorporate these non-linear effects. In contrast, AvgEmb generates less accurate results because it applies a linear average of embedding values prior to applying the non-linear similarity calculation.
This approach is neverthless the only feasible approach within this package, because AvgSim would require the full set of permuted embeddings to be stored for each package, whereas AvgEmb only stores one embedding vector for each. (Technically, three embedding vectors are stored for the three forms of input described above.)
Clicking on the link to the GithUb release of this package shows all files of pre-calculated data needed in this package, and importantly shows their sizes as well. The data currently total over 300MB, with the largest single file being the embedding vectors for all CRAN packages, with a size of 227MB. Each permutation of package text and code would require an additional embedding vector, so that storing all five random permutations would increase the size of that file to well over one gigabtye. Anybody using this package must wait for these data to be downloaded. Downloading 227MB of data can already be very time consuming. The additional burden of forcing users to wait to download over 1GB of data in order to generate results would be a strong disincentive to using this package. For that reason, this package uses AvgEmb chunking and not AvgSim.
Recommendations for improving results
This entire vignette has shown how the numerical representations used
within LMs are approximate, and demonstrated the consequences of those
approximations. While chunking is an effective approach to
numerically overcoming some of the negative effects of these
approximations, it can be just as effective to provide more accurately
descriptive textual input, as illustrated by the following code which
uses lm_proportion = 1
to display match results from LM
output only:
input <- paste0 (
"package that works with dates and times in a tidy format and ",
"overcomes unintuitive frustrating quirks with daylight savings, ",
"time zones, and leap years"
)
pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "tsibble" "clock" "CFtime" "lubridate" "timechange"
And lubridate
appears in the top five matched packages
after tweaking the input to more accurately match some of the “quirks”
of the text of that package (which includes that word).