Data caching and updating
Source:vignettes/data-caching-and-updating.Rmd
data-caching-and-updating.Rmd
The “pkgmatch” package package relies on pre-generated Language Model (LM) embeddings. Inputs of text, code, or entire packages are converted into embeddings, and the results compared with the pre-generated embeddings to discern the best-matching result. The pre-generated embeddings are calculated for the entire package suites of both rOpenSci and CRAN.
Local caching and updating for users
The pre-generated embeddings are downloaded whenever needed in
initial package calls. The download location is determined by the rappdirs
package
as fs::path(rappdirs::user_cache_dir(), "R", "pkgmatch"
.
Users should generally not need to worry about managing these data files
themselves, although they - and indeed the entire directory in which are
stored - can be safely deleted at any time.
The remote data are regularly updated, and so locally-cached data also require regular updating. If any one of the locally-cached embeddings files needed for functionality is more than 30 days old, a newer version will be automatically downloaded. This update frequency can also be over-ridden by setting a value like 100 days with:
options ("pkgmatch_update_frequency" = 100L)
If you want to ensure your data are always up to date, set an update frequency of 1, and they’ll be updated every day.
Data updating for developers
These package suites are constantly changing, and therefore the
embeddings also need to be regularly updated. The “pkgmatch” package
includes several files in the /R
directory prefixed with
“data-update” containing functions which implement this updating. These
functions are intended to be used only by the developers. They are
ultimately used in this
GitHub workflow file which is automatically run every day to update
all embedding data for both CRAN and rOpenSci. The embeddings data thus
always reflect the current daily state of both repositories.