Skip to contents

BM25 values match single inputs to document corpora by weighting terms by their inverse frequencies, so that relatively rare words contribute more to match scores than common words. For each input, the BM25 value is the sum of relative frequencies of each term in the input multiplied by the Inverse Document Frequency (IDF) of that term in the entire corpus. See the Wikipedia page at https://en.wikipedia.org/wiki/Okapi_BM25 for further details.

Usage

pkgmatch_bm25(input, txt = NULL, idfs = NULL, corpus = NULL)

Arguments

input

A single character string to match against the second parameter of all input documents.

txt

An optional list of input documents. If not specified, data will be loaded as specified by the corpus parameter.

idfs

Optional list of Inverse Document Frequency weightings generated by the internal bm25_idf function. If not specified, values for the rOpenSci corpus will be automatically downloaded and used.

corpus

If txt is not specified, data for nominated corpus will be downloaded to local cache directory, and BM25 values calculated against those. Must be one of "ropensci", "ropensci-fns", or "cran". Note that the "ropensci-fns" corpus contains entries for every single function of every rOpenSci package, and the resulting BM25 values can be used to determine the best-matching function. The other two corpora are package-based, and the results can be used to find the best-matching package.

Value

A data.frame of package names and 'BM25' measures against text from whole packages both with and without function descriptions.

See also

Examples

# The following function simulates remote data in temporary directory, to
# enable package usage without downloading. Do not run for normal usage.
generate_pkgmatch_example_data ()
#> This function resets the cache directory used by 'pkgmatch'
#> to a temporary path. To restore functionality with full data,
#> you'll either need to restart your R session, or set an
#> environment variable named 'PKGMATCH_CACHE_DIR' to the
#> desired path. Default path is /tmp/Rtmp3FnoNI/pkgmatch_ex_data

input <- "curl" # Name of a single installed package
pkgmatch_bm25 (input, corpus = "cran")
#>             package bm25_with_fns bm25_wo_fns
#> 1    pkgcache_2.2.3     17.615631   10.081095
#> 2         ssh_0.9.3     15.683603    9.850115
#> 3   mRpostman_1.1.4     15.613734    9.970348
#> 4        crul_1.5.0     13.725163    9.798910
#> 5  CancerGram_1.0.0     13.725163    9.950105
#> 6        RCurl_1.98     13.690697   13.588629
#> 7       AmpGram_1.0     10.029629   10.101874
#> 8       httr2_1.1.2      9.928162   13.469124
#> 9        httr_1.4.7      9.880184    9.925923
#> 10       curl_6.2.2      9.876207    9.921904
# Or pre-load document-frequency weightings and pass those:
idfs <- pkgmatch_load_data ("idfs", corpus = "cran", fns = FALSE)
pkgmatch_bm25 (input, corpus = "cran", idfs = idfs)
#>             package bm25_with_fns bm25_wo_fns
#> 1    pkgcache_2.2.3     17.615631   10.081095
#> 2         ssh_0.9.3     15.683603    9.850115
#> 3   mRpostman_1.1.4     15.613734    9.970348
#> 4        crul_1.5.0     13.725163    9.798910
#> 5  CancerGram_1.0.0     13.725163    9.950105
#> 6        RCurl_1.98     13.690697   13.588629
#> 7       AmpGram_1.0     10.029629   10.101874
#> 8       httr2_1.1.2      9.928162   13.469124
#> 9        httr_1.4.7      9.880184    9.925923
#> 10       curl_6.2.2      9.876207    9.921904