Result dumpers are functions allowing to handle the chunks of results from OAI-PMH service "on the fly". Handling can include processing, writing to files, databases etc.
Usage
dump_raw_to_txt(
res,
args,
as,
file_pattern = "oaidump",
file_dir = ".",
file_ext = ".xml"
)
dump_to_rds(
res,
args,
as,
file_pattern = "oaidump",
file_dir = ".",
file_ext = ".rds"
)
dump_raw_to_db(res, args, as, dbcon, table_name, field_name, ...)
Arguments
- res
results, depends on
as
, not to be specified by the user- args
list, query arguments, not to be specified by the user
- as
character, type of result to return, not to be specified by the user
- file_pattern, file_dir, file_ext
character respectively: initial part of the file name, directory name, and file extension used to create file names. These arguments are passed to
tempfile()
argumentspattern
,tmpdir
, andfileext
respectively.- dbcon
DBI-compliant database connection
- table_name
character, name of the database table to write into
- field_name
character, name of the field in database table to write into
- ...
arguments passed to/from other functions
Value
Dumpers should return NULL
or a value that will be collected
and returned by the function using the dumper.
dump_raw_to_txt
returns the name of the created file.
dump_to_rds
returns the name of the created file.
dump_xml_to_db
returns NULL
Details
Often the result of a request to a OAI-PMH service are so large that it is
split into chunks that need to be requested separately using
resumptionToken
. By default functions like
list_identifiers()
or list_records()
request these
chunks under the hood and return all concatenated in a single R object. It
is convenient but insufficient when dealing with large result sets that
might not fit into RAM. A result dumper is a function that is called on
each result chunk. Dumper functions can write chunks to files or databases,
include initial pre-processing or extraction, and so on.
A result dumper needs to be function that accepts at least the arguments:
res
, args
, as
. They will get values by the enclosing
function internally. There may be additional arguments, including ...
.
Dumpers should return NULL
or a value that will
be collected and returned by the function calling the dumper (e.g.
list_records()
).
Currently result dumpers can be used with functions:
list_identifiers()
, list_records()
, and list_sets()
.
To use a dumper with one of these functions you need to:
Pass it as an additional argument
dumper
Pass optional addtional arguments to the dumper function in a list as the
dumper_args
argument
See Examples. Below we provide more details on the dumpers currently implemented.
dump_raw_to_txt
writes raw XML to text files. It requires
as=="raw"
. File names are created using tempfile()
. By
default they are written in the current working directory and have a format
oaidump*.xml
where *
is a random string in hex.
dump_to_rds
saves results in an .rds
file via saveRDS()
.
Type of object being saved is determined by the as
argument. File names
are generated in the same way as by dump_raw_to_txt
, but with default
extension .rds
dump_xml_to_db
writes raw XML to a single text column of a table in a
database. Requires as == "raw"
. Database connection dbcon
should be a connection object as created by DBI::dbConnect()
from
package DBI. As such, it can connect to any database supported by
DBI. The records are written to a field field_name
in a table
table_name
using DBI::dbWriteTable()
. If the table does not
exist, it is created. If it does, the records are appended. Any additional
arguments are passed to DBI::dbWriteTable()
References
OAI-PMH specification https://www.openarchives.org/OAI/openarchivesprotocol.html
See also
Functions supporting the dumpers:
list_identifiers()
, list_sets()
, and list_records()
Examples
if (FALSE) { # \dontrun{
### Dumping raw XML to text files
# This will write a set of XML files to a temporary directory
fnames <- list_identifiers(from="2018-06-01T",
until="2018-06-14T",
as="raw",
dumper=dump_raw_to_txt,
dumper_args=list(file_dir=tempdir()))
# vector of file names created
str(fnames)
all( file.exists(fnames) )
# clean-up
unlink(fnames)
### Dumping raw XML to a database
# Connect to in-memory SQLite database
con <- DBI::dbConnect(RSQLite::SQLite(), dbname=":memory:")
# Harvest and dump the results into field "bar" of table "foo"
list_identifiers(from="2018-06-01T",
until="2018-06-14T",
as="raw",
dumper=dump_raw_to_db,
dumper_args=list(dbcon=con,
table_name="foo",
field_name="bar") )
# Count records, should be 101
DBI::dbGetQuery(con, "SELECT count(*) as no_records FROM foo")
DBI::dbDisconnect(con)
} # }