The “pkgmatch” package is a search and matching engine for R
packages. It finds the best-matching R packages to an input of either a
text description, or a local path to an R package. pkgmatch
was developed to enable rOpenSci to identify similar packages to each
new package submitted for our software peer-review
scheme. By default, matches are found from rOpenSci’s own package suite,
but it is also possible to find matches from all packages currently on CRAN.
What does the package do?
What the package does is best understood by example, starting with loading the package.
Then match packages to an input string:
input <- "genomics and transcriptomics sequence data"
pkgmatch_similar_pkgs (input)
#> [1] "onekp" "UCSCXenaTools" "biomartr" "restez"
#> [5] "DataPackageR"
By default, the top five matching packages are printed to the screen.
The function actually returns information on all packages, along with a
head
method to display the first few rows:
p <- pkgmatch_similar_pkgs (input)
head (p)
#> package rank
#> 1 onekp 1
#> 2 UCSCXenaTools 2
#> 3 biomartr 3
#> 4 restez 4
#> 5 DataPackageR 5
The head
method also accepts an n
parameter
to control how many rows are displayed, or as.data.frame
can be used to see the entire data.frame
of results.
The following lines find equivalent matches against all packages currently on CRAN:
pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "microseq" "read.gb" "seq2R" "tidysq"
#> [5] "rnaCrosslinkOO"
Using an R package as input
The package also accepts as input a path to a local R package. The
following code downloads a “tarball” (.tar.gz
file) from
CRAN and finds matching packages from that corpus. We of course expect
the best matches against CRAN packages to include that package
itself:
u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz"
destfile <- file.path (tempdir (), basename (u))
download.file (u, destfile = destfile, quiet = TRUE)
pkgmatch_similar_pkgs (destfile, corpus = "cran")
#> $text
#> [1] "odbc" "rocker" "connections"
#> [4] "DatabaseConnector" "DBI"
#>
#> $code
#> [1] "odbc" "sparklyr" "noctua" "RAthena" "pkgcache"
which they indeed do. As explained in the documentation, the
pkgmatch_similar_pkgs()
function ranks final results by
combining several distinct components, primarily from Language Model
(LM) embeddings, as well as from more conventional
document token-frequency analyses. The rankings from each of these
components can be seen as above with the head
method:
p <- pkgmatch_similar_pkgs (destfile, corpus = "cran")
head (p)
#> package version text_rank code_rank
#> 1 odbc 1.5.0 1 1
#> 2 rocker 0.3.1 2 1183
#> 3 connections 0.2.0 3 256
#> 4 DatabaseConnector 6.3.2 4 64
#> 5 DBI 1.2.3 5 102
Controlling how ranks are combined
As explained in the documentation for the main
pkgmatch_similar_pkgs()
function, ranks for the different
components are combined to form a single final ranking using the Reciprocal
Rank Fusion (RRF) algorithm. That function also includes an
additional lm_proportion
parameter which can be used to
weight the relative contributions of these different components. Results
from the LM component are:
pkgmatch_similar_pkgs (destfile, corpus = "cran", lm_proportion = 1)
#> $text
#> [1] "odbc" "rocker" "FormShare" "CDMConnector" "ODB"
#>
#> $code
#> [1] "odbc" "RODBCDBI" "sjdbc" "RODBC"
#> [5] "stacomirtools"
Results from other other component, comparing relative token frequencies with all CRAN packages, including frequencies of code tokens, are:
pkgmatch_similar_pkgs (destfile, corpus = "cran", lm_proportion = 0)
#> $text
#> [1] "odbc" "implyr" "DatabaseConnector"
#> [4] "sparklyr" "gbifdb"
#>
#> $code
#> [1] "odbc" "rsconnect" "pkgload" "xfun" "pak"
And there are notable differences between the two sets of results. As
also explained in the documentation for
pkgmatch_similar_pkgs()
, all internal function calls are
locally cached, so that this function can be easily and quickly re-run
with different values of lm_proportion
.