Data provenance
Ben Raymond, Michael Sumner
2024-12-11
Source:vignettes/data_provenance.Rmd
data_provenance.Rmd
Bowerbird will maintain a local collection of data files, sourced from external data providers. When one does an analysis using such files, it is useful to know which files were used, and the provenance of those files. This information will assist in making analyses reproducible.
Bowerbird itself actually knows very little about the data files that
it maintains, particularly if it is using the
bb_handler_rget
method. In this case, for a given data
source it typically knows the URL and flags to pass to
bb_rget
, and some basic metadata (primarily intended to be
read by the user). This is by design: the heavy lifting involved in
mirroring a remote data source is generally handballed
bb_rget
. This simplifies both the bowerbird code as well as
the process of writing and maintaining data sources.
The downside of this is that bowerbird can’t necessarily answer questions about data provenance in specific detail. Say that we are using data from a data source that contains many separate files, one per day, spanning many years, and we run an analysis that uses only files from the year 1999. Ideally we would like a function that could tell us exactly which files from the complete data set are needed for our particular analysis, but this is impossible without specific knowledge of how the data source is structured. Analogous situations exist with data sources that split geographic space across files, or split different parameters across files.
Despite this, bowerbird can provide some general information to assist with data provenance issues.
Which files were used in an analysis
Say we have a bowerbird config:
We can ask bowerbird where these files are stored locally:
source_dirs <- bb_data_source_dir(cf)
knitr::kable(data.frame(name = cf$data_sources$name, source_dir = source_dirs))
name | source_dir |
---|---|
NOAA OI SST V2 | /some/local/path/downloads.psl.noaa.gov/Datasets/noaa.oisst.v2/ |
Australian Election 2016 House of Representatives data | /some/local/path/results.aec.gov.au/20499/Website/ |
CMEMS global gridded SSH reprocessed (1993-ongoing) | /some/local/path/data.marine.copernicus.eu/SEALEVEL_GLO_PHY_L4_MY_008_047 |
Oceandata SeaWiFS Level-3 mapped monthly 9km chl-a | /some/local/path/oceandata.sci.gsfc.nasa.gov/SeaWiFS/Mapped/Monthly/9km/chlor |
Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 3 | /some/local/path/daacdata.apps.nsidc.org/pub/DATASETS/nsidc0192_seaice_trends_climo_v3/ |
Bathymetry of Lake Superior | /some/local/path/www.ngdc.noaa.gov/mgg/greatlakes/superior/data/ |
The directory for each data source will contain all of the
files associated with that data source, not just the files used in a
particular analysis. (In extreme cases this directory might even contain
files from other data sources, although this should only happen if
multiple data sources have overlapping source_url
paths,
which ought not to be a common occurrence.)
So while this directory is not the minimal set of files needed to reproduce an analysis, it does at least contain that set, and could be used to store those files in an online repository that assigns DOIs (such as figshare, see https://cran.r-project.org/package=rfigshare, or zenodo), or bundle into a docker image (see e.g https://github.com/o2r-project/containerit).
The complete list of files within a data source’s directory can be
generated using a standard directory listing
(e.g. list.files(path = my_source_dir, recursive = TRUE
),
but see also the bb_fingerprint
function described below,
which provides additional information about the files.
Refining the list of files
Ascertaining exactly which files were used in a particular analysis is a task that is better handled by the code being used to do the file-reading and analysis, not by the repository-management (bowerbird) code.
We are unaware of any general solutions that will keep track of the files used by an arbitrary chunk of R code.
The recordr package (https://github.com/NCEAS/recordr) comes close: it will collect info about which files were read or written. However, this only works for file types that have read functions implemented in recordr. At the time of writing, this does not cover netcdf files or other files typically used for environmental data, and so its application in that domain is likely to be of limited value.
Packages that provide specific data access functionality (e.g. https://github.com/AustralianAntarcticDivision/raadtools) might provide mechanisms for tracking which data files are needed in order to fulfil particular data queries.
The provenance of files used in an analysis
Provenance: where the files came from, when they were downloaded, their version, etc.
The bb_fingerprint
function, given a data repository
configuration, will return the timestamp of download and hashes of all
files associated with its data sources. Thus, for all of these files, we
have established where they came from (the data source ID), when they
were downloaded, and a hash so that later versions of those files can be
compared to detect changes.
A word on digital object identifiers
When a data source is defined, it should have used its DOI as its data source ID (if it has a DOI, otherwise some other unique identifier). The idea of a DOI is that a data set that has a DOI should be peristent (accessible for long term use), and the DOI should uniquely identify the data resource that it is assigned to. When a data set changes in a substantial manner, and/or it is necessary to identify both the original and the changed material, a new DOI should be assigned. Thus, knowing the DOI gives some indication of the data provenance.
However, it is worth noting that the DOI might not uniquely identify the state of the data source. For example, the NSIDC SMMR-SSM/I Nasateam near-real-time sea ice concentration data set (DOI http://doi.org/10.5067/U8C09DWVX9LM) is updated each day (new files are added) as new data is acquired by the satellites. At periodic intervals, these files are subjected to more rigorous quality control and post-processing, and then moved into a different data set (http://doi.org/10.5067/8GQ8LZQVL0VL). Thus, knowing the DOI of the near-real-time data set does not uniquely identify the files that were included in that data set at a given point in time.
Similar ambiguity can arise for other reasons, including data corrections. Say that one data file within a large collection was found to have errors and was corrected: it is at the discretion of the data provider whether the data set as a whole receives a new DOI because of this change.