Skip to contents

This function helps in defining a subset of ngram files which should be imported, since importing all ngrams at once can be very expensive (in terms of cpu and memory).

Usage

jst_subset_ngrams(zip_archives, ngram_type, selection, by = file_name)

Arguments

zip_archives

A character vector of one or multiple zip-files.

ngram_type

One of "ngram1", "ngram2" or "ngram3"

selection

A data.frame with the articles/books which are to be selected.

by

A column name for matching.

Value

A list of zip-locations which can be read via jst_get_ngram().

Examples

# create sample output
tmp <- tempdir()
jst_import_zip(jst_example("pseudo_dfr.zip"),
               import_spec = jst_define_import(book = jst_get_book),
               out_file = "test", out_path = tmp)
#> Processing files for book_chapter with functions jst_get_book

# re-import as our selection for which we would like to import ngrams
selection <- jst_re_import(file.path(tmp, 
                                     "test_book_chapter_jst_get_book-1.csv"))

# get location of file
zip_loc <- jst_subset_ngrams(jst_example("pseudo_dfr.zip"), "ngram1",
                             selection) 

# import ngram
jst_get_ngram(zip_loc[[1]])
#> # A tibble: 2 × 3
#>   file_name                  ngram        n
#>   <chr>                      <chr>    <int>
#> 1 book-chapter-standard_book Common     400
#> 2 book-chapter-standard_book Uncommon     5
unlink(tmp)