BM25 values match single inputs to document corpora by weighting terms by their inverse frequencies, so that relatively rare words contribute more to match scores than common words. For each input, the BM25 value is the sum of relative frequencies of each term in the input multiplied by the Inverse Document Frequency (IDF) of that term in the entire corpus. See the Wikipedia page at https://en.wikipedia.org/wiki/Okapi_BM25 for further details.
Arguments
- input
A single character string to match against the second parameter of all input documents.
- txt
An optional list of input documents. If not specified, data will be loaded as specified by the
corpus
parameter.- idfs
Optional list of Inverse Document Frequency weightings generated by the internal
bm25_idf
function. If not specified, values for the rOpenSci corpus will be automatically downloaded and used.- corpus
If
txt
is not specified, data for nominated corpus will be downloaded to local cache directory, and BM25 values calculated against those. Must be one of "ropensci", "ropensci-fns", or "cran". Note that the "ropensci-fns" corpus contains entries for every single function of every rOpenSci package, and the resulting BM25 values can be used to determine the best-matching function. The other two corpora are package-based, and the results can be used to find the best-matching package.
Value
A data.frame
of package names and 'BM25' measures against text
from whole packages both with and without function descriptions.
See also
Other bm25:
pkgmatch_bm25_fn_calls()
Examples
# The following function simulates remote data in temporary directory, to
# enable package usage without downloading. Do not run for normal usage.
generate_pkgmatch_example_data ()
#> This function resets the cache directory used by 'pkgmatch'
#> to a temporary path. To restore functionality with full data,
#> you'll either need to restart your R session, or set an
#> environment variable named 'PKGMATCH_CACHE_DIR' to the
#> desired path. Default path is /tmp/Rtmp3FnoNI/pkgmatch_ex_data
input <- "curl" # Name of a single installed package
pkgmatch_bm25 (input, corpus = "cran")
#> package bm25_with_fns bm25_wo_fns
#> 1 pkgcache_2.2.3 17.615631 10.081095
#> 2 ssh_0.9.3 15.683603 9.850115
#> 3 mRpostman_1.1.4 15.613734 9.970348
#> 4 crul_1.5.0 13.725163 9.798910
#> 5 CancerGram_1.0.0 13.725163 9.950105
#> 6 RCurl_1.98 13.690697 13.588629
#> 7 AmpGram_1.0 10.029629 10.101874
#> 8 httr2_1.1.2 9.928162 13.469124
#> 9 httr_1.4.7 9.880184 9.925923
#> 10 curl_6.2.2 9.876207 9.921904
# Or pre-load document-frequency weightings and pass those:
idfs <- pkgmatch_load_data ("idfs", corpus = "cran", fns = FALSE)
pkgmatch_bm25 (input, corpus = "cran", idfs = idfs)
#> package bm25_with_fns bm25_wo_fns
#> 1 pkgcache_2.2.3 17.615631 10.081095
#> 2 ssh_0.9.3 15.683603 9.850115
#> 3 mRpostman_1.1.4 15.613734 9.970348
#> 4 crul_1.5.0 13.725163 9.798910
#> 5 CancerGram_1.0.0 13.725163 9.950105
#> 6 RCurl_1.98 13.690697 13.588629
#> 7 AmpGram_1.0 10.029629 10.101874
#> 8 httr2_1.1.2 9.928162 13.469124
#> 9 httr_1.4.7 9.880184 9.925923
#> 10 curl_6.2.2 9.876207 9.921904