Piggyback Data atop your GitHub Repository!
piggyback grew out of the needs of students both in my classroom and in my research group, who frequently need to work with data files somewhat larger than one can conveniently manage by committing directly to GitHub. As we frequently want to share and run code that depends on >50MB data files on each of our own machines, on continuous integration, and on larger computational servers, data sharing quickly becomes a bottleneck.
GitHub allows repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them.
Install the latest release from CRAN using:
You can install the development version from GitHub with:
# install.packages("devtools") devtools::install_github("ropensci/piggyback")
No authentication is required to download data from public GitHub repositories using
piggyback recommends setting a token when possible to avoid rate limits. To upload data to any repository, or to download data from private repositories, you will need to authenticate first.
To do so, add your GitHub Token to an environmental variable, e.g. in a
.Renviron file in your home directory or project directory (any private place you won’t upload), see
usethis::edit_r_environ(). For one-off use you can also set your token from the R console using:
But try to avoid putting
Sys.setenv() in any R scripts – remember, the goal here is to avoid writing your private token in any file that might be shared, even privately.
For more information, please see the usethis guide to GitHub credentials
Download the latest version or a specific version of the data:
Note: Whenever you are working from a location inside a git repository corresponding to your GitHub repo, you can simply omit the
repo argument and it will be detected automatically. Likewise, if you omit the release
pb_download will simply pull data from most recent release (
latest). Third, you can omit
tempdir() if you are using an RStudio Project (
.Rproj file) in your repository, and then the download location will be relative to Project root.
tempdir() is used throughout the examples only to meet CRAN policies and is unlikely to be the choice you actually want here.
Lastly, simply omit the file name to download all assets connected with a given release.
These defaults mean that in most cases, it is sufficient to simply call
pb_download() without additional arguments to pull in any data associated with a project on a GitHub repo that is too large to commit to git directly.
pb_download() will skip the download of any file that already exists locally if the timestamp on the local copy is more recent than the timestamp on the GitHub copy.
pb_download() also includes arguments to control the timestamp behavior, progress bar, whether existing files should be overwritten, or if any particular files should not be downloaded. See function documentation for details.
Sometimes it is preferable to have a URL from which the data can be read in directly, rather than downloading the data to a local file. For example, such a URL can be embedded directly into another R script, avoiding any dependence on
piggyback (provided the repository is already public.) To get a list of URLs rather than actually downloading the files, use
pb_download_url("data/mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1")
If your GitHub repository doesn’t have any releases yet,
piggyback will help you quickly create one. Create new releases to manage multiple versions of a given data file. While you can create releases as often as you like, making a new release is by no means necessary each time you upload a file. If maintaining old versions of the data is not useful, you can stick with a single release and upload all of your data there.
Once we have at least one release available, we are ready to upload. By default,
pb_upload will attach data to the latest release.
pb_upload() will overwrite any file of the same name already attached to the release file by default, unless the timestamp the previously uploaded version is more recent. You can toggle these settings with
List all files currently piggybacking on a given release. Omit the
tag to see files on all releases.
pb_list(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
Delete a file from a release:
pb_delete(file = "mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1")
Note that this is irreversible unless you have a copy of the data elsewhere.
library(magrittr) ## upload a folder of data list.files("data") %>% pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") ## upload certain file extensions list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>% pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
Similarly, you can download all current data assets of the latest or specified release by using
pb_download() with no arguments.
To reduce API calls to GitHub, piggyback caches most calls with a timeout of 1 second by default. This avoids repeating identical requests to update it’s internal record of the repository data (releases, assets, timestamps, etc) during programmatic use. You can increase or decrease this delay by setting the environmental variable in seconds, e.g.
Sys.setenv("piggyback_cache_duration"=10) for a longer delay or
Sys.setenv("piggyback_cache_duration"=0) to disable caching, and then restarting R.
GitHub assets attached to a release do not support file paths, and will convert most special characters (
%, etc) to
. or throw an error (e.g. for file names containing
/). piggyback will default to using the base name of the file only (i.e. will only use
"mtcars.csv" if provided a file path like
piggyback is not intended as a data archiving solution. Importantly, bear in mind that there is nothing special about multiple “versions” in releases, as far as data assets uploaded by
piggyback are concerned. The data files
piggyback attaches to a Release can be deleted or modified at any time – creating a new release to store data assets is the functional equivalent of just creating new directories
v0.2 to store your data. (GitHub Releases are always pinned to a particular
git tag, so the code/git-managed contents associated with repo are more immutable, but remember our data assets just piggyback on top of the repo).
Permanent, published data should always be archived in a proper data repository with a DOI, such as zenodo.org. Zenodo can freely archive public research data files up to 50 GB in size, and data is strictly versioned (once released, a DOI always refers to the same version of the data, new releases are given new DOIs).
piggyback is meant only to lower the friction of working with data during the research process. (e.g. provide data accessible to collaborators or continuous integration systems during research process, including for private repositories.)
GitHub documentation at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project:
Of course, it will be up to GitHub to decide if this use of release attachments is acceptable in the long term.