This function applies an import function to a list of xml
-files
or a .zip-archive in case of jst_import_zip
and saves
the output in batches of .csv
-files to disk.
Usage
jst_import(
in_paths,
out_file,
out_path = NULL,
.f,
col_names = TRUE,
n_batches = NULL,
files_per_batch = NULL,
show_progress = TRUE
)
jst_import_zip(
zip_archive,
import_spec,
out_file,
out_path = NULL,
col_names = TRUE,
n_batches = NULL,
files_per_batch = NULL,
show_progress = TRUE,
rows = NULL
)
Arguments
- in_paths
A character vector to the
xml
-files which should be imported- out_file
Name of files to export to. Each batch gets appended by an increasing number.
- out_path
Path to export files to (combined with filename).
- .f
Function to use for import. Can be one of
jst_get_article
,jst_get_authors
,jst_get_references
,jst_get_footnotes
,jst_get_book
orjst_get_chapter
.- col_names
Should column names be written to file? Defaults to
TRUE
.- n_batches
Number of batches, defaults to 1.
- files_per_batch
Number of files for each batch. Can be used instead of n_batches, but not in conjunction.
- show_progress
Displays a progress bar for each batch, if the session is interactive.
- zip_archive
A path to a .zip-archive from DfR
- import_spec
A specification from jst_define_import for which parts of a .zip-archive should be imported via which functions.
- rows
Mainly used for testing, to decrease the number of files which are imported (i.e. 1:100).
Details
Along the way, we wrap three functions, which make the process of converting many files easier:
When using one of the find_*
functions, there should usually be no errors.
To avoid the whole computation to fail in the unlikely event that an error
occurs, we use safely()
which let's us
continue the process, and catch the error along the way.
If you have many files to import, you might benefit from executing the
function in parallel. We use futures for this to give you maximum
flexibility. By default the code is executed sequentially. If you want to
run it in parallel, simply call future::plan()
with
future::multisession()
as an argument before
running jst_import
or jst_import_zip
.
After importing all files, they are written to disk with
readr::write_csv()
.
Since you might run out of memory when importing a large quantity of files,
you can split up the files to import into batches. Each batch is being
treated separately, therefore for each batch multiple processes from
future::multisession()
are spawned, if you added this plan.
For this reason, it is not recommended to have very small batches,
as there is an overhead for starting and ending the processes. On the other
hand, the batches should not be too large, to not exceed memory limitations.
A value of 10000 to 20000 for files_per_batch
should work fine on most
machines. If the session is interactive and show_progress
is TRUE
, a
progress bar is displayed for each batch.
Examples
if (FALSE) { # \dontrun{
# read from file list --------
# find all files
meta_files <- list.files(pattern = "xml", full.names = TRUE)
# import them via `jst_get_article`
jst_import(meta_files, out_file = "imported_metadata", .f = jst_get_article,
files_per_batch = 25000)
# do the same, but in parallel
library(future)
plan(multiprocess)
jst_import(meta_files, out_file = "imported_metadata", .f = jst_get_article,
files_per_batch = 25000)
# read from zip archive ------
# define imports
imports <- jst_define_import(article = c(jst_get_article, jst_get_authors))
# convert the files to .csv
jst_import_zip("my_archive.zip", out_file = "my_out_file",
import_spec = imports)
} # }