Simple data retrieval and versioning using GitHub
Over the last several years, there has been an increasing recognition that data is a first-class scientific product and a tremendous about of repositories and platforms have been developed to facilitate the storage, sharing, and re-use of data. However we think there is still an important gap in this ecosystem: platforms for data sharing offer limited functions for distributing and interacting with evolving datasets - those that continue to grow with time as more records are added, errors fixed, and new data structures are created. This is particularly the case for small to medium sized datasets that a typical scientific lab, or collection of labs, might produce.
In addition to enabling data creators to maintain and share a
living dataset, ideally, such an infrastructure would allow enable data users to:
This package can be used in two ways:
For both of these use-cases,
datastorr will store your data using GitHub releases which do not clog up your repository but allow up to 2GB files to be stored (future versions may support things like figshare).
datastorr is concerned about a simple versioning scheme for your data. If you do not imagine the version changing that should not matter. But if you work with data that changes (and everyone does eventually) this approach should make it easy to update files.
From the point of view of a user, using your data could be as simple as:
d <- datastorr::datastorr("richfitz/datastorr.example")
(see below for details, how this works, and what it is doing).
See here for the aim from the point of view for an end user.
They would install your package (which contains no data so is nice and light and can be uploaded to CRAN).
The user can see what versions they have locally
and can see what versions are present on GitHub:
datastorr.example::mydata_versions(local=FALSE) # remote
To download the most recent dataset:
d <- datastorr.example::mydata()
Subsequent calls (even across R sessions) are cached so that the mydata() function is fast enough you can use it in place of the data.
To get a particular version:
d <- datastorr.example::mydata("0.0.1")
Downloads are cached across sessions using
The simplest way is to run the (hidden) function
datastorr:::autogenerate(repo="richfitz/datastorr.example", read="readRDS", name="mydata")
which will print to the screen a bunch of code to add do your package. There will be a vignette explaining this more fully soon. A file generated in this way can be seen here.
Once set up, new releases can be made by running, within your package directory:
datastorr.example::mydata_release("description of release", "path/to/file")
provided you have your
GITHUB_TOKEN environment variable set appropriatey. See the vignette for more details.
MIT + file LICENSE © Rich FitzJohn.