Automating File Import
Thomas Klebel
2024-12-14
Source:vignettes/automating-file-import.Rmd
automating-file-import.Rmd
Intro
The find_*
functions from jstor
all work on
a single file. Data from DfR however contains many single files, from up
to 25,000 when using the self-service functions, up to several hundreds
of thousands of files when requesting a large dataset via an
agreement.
Currently jstor
offers two implementations to import
many files: jst_import_zip()
and jst_import()
.
The first one lets you import data directly from zip archives, the
second works for file paths, so you need to unzip the archive first. I
will first introduce jst_import_zip()
and discuss the
approach of with jst_import()
afterwards.
Importing data directly from zip-archives
Unpacking and working with many files directly is unpractical for at least three reasons:
- If you unzip the archive, the single files will occupy a lot more disk space than the single archive.
- Before you can import the files, you need to list them via
list.files
orsystem("find...")
on UNIX. Depending on the size of your data, this can take some time. - There might be different types of data in your sample (journal articles, book chapters, etc.). You need to manage matching the paths to the appropriate functions, which is extra work.
Importing directly from the zip archive simplifies all those tasks
with a single function: jst_import_zip()
. For the following
demonstration, we will use a small sample archive that comes with the
package.
As a first step, we should take a look at the archive and its
content. This is made easy with jst_preview_zip()
:
jst_preview_zip(jst_example("pseudo_dfr.zip")) %>% knitr::kable()
type | meta_type | n |
---|---|---|
metadata | book_chapter | 1 |
metadata | journal_article | 1 |
metadata | pamphlet | 1 |
ngram1 | ngram1 | 1 |
We can see that we have a simple archive with three metadata files
and one ngram file. Before we can use jst_import_zip()
, we
first need to think about, what we actually want to import: all of the
data, or just parts? What kind of data do we want to extract from
articles, books and pamphlets? We can specify this via
jst_define_import()
:
import_spec <- jst_define_import(
article = c(jst_get_article, jst_get_authors),
book = jst_get_book,
ngram1 = jst_get_ngram
)
In this case, we want to import data on articles (standard metadata
plus information on the authors), general data on books and unigrams
(ngram1). This specification can then be used with
jst_import_zip()
:
# set up a temporary folder for output files
tmp <- tempdir()
# extract the content and write output to disk
jst_import_zip(jst_example("pseudo_dfr.zip"),
import_spec = import_spec,
out_file = "my_test",
out_path = tmp)
#> Processing files for book_chapter with functions jst_get_book
#> Processing files for journal_article with functions jst_get_article, jst_get_authors
#> Processing files for ngram1 with functions jst_get_ngram
We can take a look at the files within tmp
with
list.files()
:
list.files(tmp, pattern = "csv")
#> [1] "my_test_book_chapter_jst_get_book-1.csv"
#> [2] "my_test_journal_article_jst_get_article-1.csv"
#> [3] "my_test_journal_article_jst_get_authors-1.csv"
#> [4] "my_test_ngram1_jst_get_ngram-1.csv"
As you can see, jst_import_zip()
automatically creates
file names based on the string you supplied to out_file
to
delineate the different types of output.
If we want to re-import the data for further analysis, we can either
use functions like readr::read_csv()
, or a small helper
function from the package which determines and sets the column types
correctly:
jst_re_import(
file.path(tmp, "my_test_journal_article_jst_get_article-1.csv")
) %>%
knitr::kable()
file_name | journal_doi | journal_jcode | journal_pub_id | journal_title | article_doi | article_pub_id | article_jcode | article_type | article_title | volume | issue | language | pub_day | pub_month | pub_year | first_page | last_page | page_range |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
journal-article-standard_case | NA | kewbulletin | NA | Kew Bulletin | 10.2307/4117222 | NA | NA | research-article | Two New Species of Ischaemum | 5 | 2 | eng | 1 | 1 | 1950 | 187 | 188 | 187-188 |
A side note on ngrams: For larger archives, importing all ngrams can
take a very long time. It is thus advisable to only import ngrams for
articles which you want to analyse, i.e. most likely a subset of the
initial request. The function jst_subset_ngrams()
helps you
with this (see also the section on importing bigrams in the case
study).
Parallel processing
Since the above process might take a while for larger archives (files
have to be unpacked, read and parsed), there might be a benefit of
executing the function in parallel. jst_import_zip()
and
jst_import()
use furrr
at their core,
therefore adding parallelism is very easy. Just add the following lines
at the beginning of your script, and the import will use all available
cores:
You can find out more about futures by reading the package vignette:
vignette("future-1-overview", package = "future")
Working with file paths
The above approach of importing directly from zip archives is very convenient, but in some cases you might want to have more control over how data is imported. For example, if you run into problems because the output from any of the functions provided with the package looks corrupted, you could want to look at the original files. Alongside this, you could unzip the archive and work with the files directly, which I will demonstrate in the following sections.
Unzip containers
For simple purposes it might be sensible to unzip to a temporary
directory (with temp()
and unzip()
) but for my
research I simply extracted files to an external SSD, since I a) lacked
disk space, b) needed to read them fast, and c) wanted to be able to
look at specific files for debugging.
List files
There are many ways to generate a list of all files:
list.files()
or using system()
in conjunction
with find
on unix-like systems are common options.
For demonstration purposes I use files contained in
jstor
which can be accessed via
system.file
:
example_dir <- system.file("extdata", package = "jstor")
list.files
file_names_listed <- list.files(path = example_dir, full.names = TRUE,
pattern = "*.xml")
file_names_listed
#> [1] "/usr/local/lib/R/site-library/jstor/extdata/article_with_footnotes.xml"
#> [2] "/usr/local/lib/R/site-library/jstor/extdata/article_with_references.xml"
#> [3] "/usr/local/lib/R/site-library/jstor/extdata/book.xml"
#> [4] "/usr/local/lib/R/site-library/jstor/extdata/parsed_references.xml"
system
and find
library(stringr)
file_names_system <- file_names %>%
str_replace("^\\.\\/", "") %>%
str_c(example_dir, "/", .)
file_names_system
#> [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_with_footnotes.xml"
#> [2] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_book.xml"
#> [3] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_with_references.xml"
In this case the two approaches give the same result. The main
difference seems to be though, that list.files
sorts the
output, whereas find
does not. For a large amount of files
(200,000) this makes list.files
slower, for smaller
datasets the difference shouldn’t make an impact.
Batch import
Once the file list is generated, we can apply any of the
jst_get_*
-functions to the list. A good and simple way for
small to moderate amounts of files is to use
purrr::map_df()
:
# only work with journal articles
article_paths <- file_names_listed %>%
keep(str_detect, "with")
article_paths %>%
map_df(jst_get_article) %>%
knitr::kable()
file_name | journal_doi | journal_jcode | journal_pub_id | journal_title | article_doi | article_pub_id | article_jcode | article_type | article_title | volume | issue | language | pub_day | pub_month | pub_year | first_page | last_page | page_range |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
article_with_footnotes | NA | washhistq | NA | The Washington Historical Quarterly | NA | NA | 40428382 | research-article | The Nisqually Journal (Continued) | 13 | 2 | eng | 1 | 4 | 1922 | 131 | 141 | NA |
article_with_references | NA | tranamermicrsoci | NA | Transactions of the American Microscopical Society | 10.2307/3221896 | NA | NA | research-article | On the Protozoa Parasitic in Frogs | 41 | 2 | eng | 1 | 4 | 1922 | 59 | 76 | 59-76 |
This works well if 1) there are no errors and 2) if there is only a
moderate size of files. For larger numbers of files,
jst_import()
can streamline the process for you. This
function works very similar to jst_import_zip()
, the main
difference being that it needs file paths as input and can only handle
one type of output.
jst_import(article_paths, out_file = "my_second_test", .f = jst_get_article,
out_path = tmp)
#> Starting to import 2 file(s).
#> Finished importing 2 file(s) in 0.03 secs.
The result is again written to disk, as can be seen below:
list.files(tmp, pattern = "csv")
#> [1] "my_second_test-1.csv"
#> [2] "my_test_book_chapter_jst_get_book-1.csv"
#> [3] "my_test_journal_article_jst_get_article-1.csv"
#> [4] "my_test_journal_article_jst_get_authors-1.csv"
#> [5] "my_test_ngram1_jst_get_ngram-1.csv"