Skip to contents

This function applies an import function to a list of xml-files or a .zip-archive in case of jst_import_zip and saves the output in batches of .csv-files to disk.

Usage

jst_import(
  in_paths,
  out_file,
  out_path = NULL,
  .f,
  col_names = TRUE,
  n_batches = NULL,
  files_per_batch = NULL,
  show_progress = TRUE
)

jst_import_zip(
  zip_archive,
  import_spec,
  out_file,
  out_path = NULL,
  col_names = TRUE,
  n_batches = NULL,
  files_per_batch = NULL,
  show_progress = TRUE,
  rows = NULL
)

Arguments

in_paths

A character vector to the xml-files which should be imported

out_file

Name of files to export to. Each batch gets appended by an increasing number.

out_path

Path to export files to (combined with filename).

.f

Function to use for import. Can be one of jst_get_article, jst_get_authors, jst_get_references, jst_get_footnotes, jst_get_book or jst_get_chapter.

col_names

Should column names be written to file? Defaults to TRUE.

n_batches

Number of batches, defaults to 1.

files_per_batch

Number of files for each batch. Can be used instead of n_batches, but not in conjunction.

show_progress

Displays a progress bar for each batch, if the session is interactive.

zip_archive

A path to a .zip-archive from DfR

import_spec

A specification from jst_define_import for which parts of a .zip-archive should be imported via which functions.

rows

Mainly used for testing, to decrease the number of files which are imported (i.e. 1:100).

Value

Writes .csv-files to disk.

Details

Along the way, we wrap three functions, which make the process of converting many files easier:

When using one of the find_* functions, there should usually be no errors. To avoid the whole computation to fail in the unlikely event that an error occurs, we use safely() which let's us continue the process, and catch the error along the way.

If you have many files to import, you might benefit from executing the function in parallel. We use futures for this to give you maximum flexibility. By default the code is executed sequentially. If you want to run it in parallel, simply call future::plan() with future::multisession() as an argument before running jst_import or jst_import_zip.

After importing all files, they are written to disk with readr::write_csv().

Since you might run out of memory when importing a large quantity of files, you can split up the files to import into batches. Each batch is being treated separately, therefore for each batch multiple processes from future::multisession() are spawned, if you added this plan. For this reason, it is not recommended to have very small batches, as there is an overhead for starting and ending the processes. On the other hand, the batches should not be too large, to not exceed memory limitations. A value of 10000 to 20000 for files_per_batch should work fine on most machines. If the session is interactive and show_progress is TRUE, a progress bar is displayed for each batch.

Examples

if (FALSE) { # \dontrun{
# read from file list --------
# find all files
meta_files <- list.files(pattern = "xml", full.names = TRUE)

# import them via `jst_get_article`
jst_import(meta_files, out_file = "imported_metadata", .f = jst_get_article,
           files_per_batch = 25000)
           
# do the same, but in parallel
library(future)
plan(multiprocess)
jst_import(meta_files, out_file = "imported_metadata", .f = jst_get_article,
           files_per_batch = 25000)

# read from zip archive ------ 
# define imports
imports <- jst_define_import(article = c(jst_get_article, jst_get_authors))

# convert the files to .csv
jst_import_zip("my_archive.zip", out_file = "my_out_file", 
                 import_spec = imports)
} # }