frictionless workflows with deposits
Mark Padgham
2024-12-04
Source:vignettes/frictionless.Rmd
frictionless.Rmd
As described in the
main README, the deposits package is designed to work with rOpenSci’s
frictionless
package for documentation of datasets. The
frictionless
package is an R implementation of the general “frictionless” workflow. All
deposits metadata, both for a deposit in general as described in the
accompanying metadata
vignette, and for the internal structure of datasets, are stored in
a single “frictionless” metadata file called “datapackage.json”.
This vignette describes how to generate those files, and how to use them to manage deposits metadata.
Generating “datapackage.json” files
The deposits package uses the frictionless
R package to automatically generate frictionless metadata files
named “datapackage.json” for any tabular input data. These files can
only be generated for one or more input data sets, and so can not be
generated for deposits metadata without accompanying data files or
resources.
The following code is repeated from the main vignette, and generates a local data file with some tabular data in an empty directory.
dir.create (file.path (tempdir (), "data"))
path <- file.path (tempdir (), "data", "beaver1.csv")
write.csv (datasets::beaver1, path, row.names = FALSE)
We then need to construct a deposits client, and specify metadata, in
this case those describing the
beaver1
dataset. The format of these metadata is
explained at length in the metadata
vignette.
metadata <- list (
creator = list (list (name = "P. S. Reynolds")),
created = "1994-01-01T00:00:00",
title = "Time-series analyses of beaver body temperatures",
description = "Original source of 'beaver' dataset."
)
cli <- depositsClient$new (service = "zenodo", sandbox = TRUE, metadata = metadata)
Generation by uploading files
The main vignette demonstrates how to use these metadata to initiate
a new deposit on the Zenodo sandbox service, and then to upload the
local data. Calling the
deposit_upload_file()
method automatically generates a
“datapackage.json” file (if one does not already exist), adds deposits
metadata, and also uploads that file to the deposits service.
cli$deposit_new ()
#> ID of new deposit : 1162420
cli$deposit_upload_file (path = path)
#> frictionless metadata file has been generated as '/tmp/RtmpvRre5Z/data/datapackage.json'
The client then lists two files that were uploaded, and the local directory also now has a “datapackage.json” file:
cli$hostdata$files$filename
#> [1] "beaver1.csv" "datapackage.json"
list.files (file.path (tempdir (), "data"))
#> [1] "beaver1.csv" "datapackage.json"
The deposits package inserts an additional “metadata” field into the
frictionless metadata file, containing the deposits metadata defined
above. This method of generating frictionless metadata files requires
data to first be uploaded to a service, and will automatically upload
the frictionless metadata file. Any subsequent editing then requires
repeated calls to the
deposit_upload_file()
method to update the contents of
this file.
Local generation
A frictionless metadata file can also first be generated locally, to
allow editing prior to any uploading. This is achieved with the
deposit_add_resource()
method. As mentioned, the
frictionless workflow requires a data “resource” to exist. The resource
in the above example is the locally-stored “beaver.csv” file. Presuming
this file to exist, we can again initiate a deposit, and then to call deposit_add_resource()
,
instead of deposit_upload_file()
.
cli <- depositsClient$new (service = "zenodo", sandbox = TRUE, metadata = metadata)
cli$deposit_add_resource (path = path)
That call will generate a local frictionless metadata file (if one
does not already exist), and fill it with the deposits metadata, without
initiating a new deposit or uploading any files. Whether generated and
immediately uploading through calling deposit_upload_file()
,
or locally generated only through calling deposit_add_resource()
,
the frictionless data file can then be edited and updated as described
in the following sub-section.
Note that calling either of these methods connects the client to the
local directory containing the “datapackage.json” file and other data
files. Printing the client then produces additional information
including a local_path
identifying the directory, along
with counts of both local and remote “resources” or files.
Reading and editing “datapackage.json” files
The “datapackage.json” file can be read with the frictionless::read_package()
function, returning a named list of metadata entries:
library (frictionless)
path <- file.path (tempdir (), "data", "datapackage.json")
metadata <- read_package (path)
names (metadata)
#> [1] "profile" "metadata" "resources" "directory"
The “profile”, “resources”, and “directory” items are all generated
by frictionless
,
while the “metadata” items holds the deposits metadata entered
above:
metadata$metadata
#> $created
#> [1] "1994-01-01T00:00:00"
#>
#> $creator
#> $creator[[1]]
#> $creator[[1]]$name
#> [1] "P. S. Reynolds"
#>
#> $description
#> [1] "Original source of 'beaver' dataset."
#>
#> $title
#> [1] "Time-series analyses of beaver body temperatures"
The frictionless
aspects of the metadata are by default
automatically generated by that package, and generally benefit from the
kind of editing and enhancing described in the
main frictionless
vignette, including such things as
adding descriptions of variables. The deposits-specific “metadata”
component can also readily be extended and edited as desired.
The data in these files can be edited in two primary ways, through either:
- Editing the individual list items in R, and saving the result via
frictionless::write_package()
; or, - Directly editing the “datapackage.json” with a text editor.
We recommend the second method, as it enables the simplest overview over the entire metadata structure for a given deposit. Once a frictionless “datapackage.json” file has been generated for a deposit, the recommended deposits workflow is for all editing and updating of metadata to be done by directly editing that file, as explained in the metadata vignette.
Updating “datapackage.json” on deposits service
Any changes to a local “datapackage.json” file can be imported into a
deposits client with the deposit_update()
method, which will also update the remote version of the
“datapackage.json” file held on the deposits service. Because the
initial deposit_upload_file()
and deposit_add_resource()
methods both connected the client to the local directory containing the
deposit data, the update method can be called directly without any
parameters. Specific paths can nevertheless be passed to the deposit_update()
method, to update only specified files while ignoring changes in any
other files. For example, the path
argument can specify the
path to the single “datapackage.json” file, in which case only that file
will be uploaded, regardless of any local modifications to other
files.
cli$deposit_update ()
#> Local file at [/tmp/Rtmp5QfAEc/data/beaver1.csv] is identical on host and will not be uploaded.
#> Local file at [/tmp/Rtmp5QfAEc/data/datapackage.json] has changed and will now be uploaded.
Calling that method will update both the “metadata” and “hostdata” elements of the deposits client to reflect any changes made to the “datapackage.json” file, and will also update the remote version of that file, as indicated in the messages produced by calling that method.