Piggyback Data atop your GitHub Repository!
Carl Boettiger & Tan Ho
2023-12-26
Source:vignettes/piggyback.Rmd
piggyback.Rmd
Why piggyback
?
piggyback
grew out of the needs of students both in my
classroom and in my research group, who frequently need to work with
data files somewhat larger than one can conveniently manage by
committing directly to GitHub. As we frequently want to share and run
code that depends on >50MB data files on each of our own machines, on
continuous integration, and on larger computational servers, data
sharing quickly becomes a bottleneck.
GitHub allows repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them.
Authentication
No authentication is required to download data from public
GitHub repositories using piggyback
. Nevertheless, we
recommends setting a token when possible to avoid rate limits. To upload
data to any repository, or to download data from private
repositories, you will need to authenticate first.
piggyback
uses the same GitHub Personal Access Token
(PAT) that devtools, usethis, and friends use
(gh::gh_token()
). The current best practice for managing
your GitHub credentials is detailed in this usethis
vignette.
You can also add the token as an environment variable, which may be useful in situations where you use piggyback non-interactively (i.e. automated scripts). Here are the relevant steps:
- Create a GitHub Token
- Add the environment variable:
- via project-specific Renviron:
-
usethis::use_git_ignore(".Renviron")
to update your gitignore - this prevents accidentally committing your token to GitHub -
usethis::edit_r_environ("project")
to open the Renviron file, and then add your token, e.g.GITHUB_PAT=ghp_a1b2c3d4e5f6g7
-
- via
Sys.setenv(GITHUB_PAT = "ghp_a1b2c3d4e5f6g7")
in your console for adhoc usage. Avoid adding this line to your R scripts – remember, the goal here is to avoid writing your private token in any file that might be shared, even privately.
- via project-specific Renviron:
Download Files
Download a file from a release:
pb_download(
file = "iris2.tsv.gz",
dest = tempdir(),
repo = "cboettig/piggyback-tests",
tag = "v0.0.1"
)
#> ℹ Downloading "iris2.tsv.gz"...
#> |======================================================| 100%
fs::dir_tree(tempdir())
#> /tmp/RtmpWxJSZj
#> └── iris2.tsv.gz
Some default behaviors to know about:
- The
repo
argument in most piggyback functions will default to detecting the relevant GitHub repo based on your current working directory’s git configs, so in many cases you can omit therepo
argument. - The
tag
argument in most functions defaults to “latest”, which typically refers to the most recently created release of the repository, unless there is a release specifically named “latest” or if you have marked a different release as “latest” via the GitHub UI. - The
dest
argument defaults to your current working directory ("."
). We usetempdir()
to meet CRAN policies for the purposes of examples. - The
file
argument inpb_download
defaults to NULL, which will download all files connected to a given release:
pb_download(
repo = "cboettig/piggyback-tests",
tag = "v0.0.1",
dest = tempdir()
)
#> ℹ Downloading "diamonds.tsv.gz"...
#> |======================================================| 100%
#> ℹ Downloading "iris.tsv.gz"...
#> |======================================================| 100%
#> ℹ Downloading "iris.tsv.xz"...
#> |======================================================| 100%
fs::dir_tree(tempdir())
#> /tmp/RtmpWxJSZj
#> ├── diamonds.tsv.gz
#> ├── iris.tsv.gz
#> ├── iris.tsv.xz
#> └── iris2.tsv.gz
- The
use_timestamps
argument defaults to TRUE - notice that above,iris2.tsv.gz
was not downloaded. Ifuse_timestamps
is TRUE, pb_download() will compare the local file timestamp against the GitHub file timestamp, and only download the file if it has changed.
pb_download()
also includes arguments to control the
progress bar or if any particular files should not be downloaded.
Download URLs
Sometimes it is preferable to have a URL from which the data can be
read in directly. These URL can then be passed into another R function,
which can be more elegant and performant than having to first download
the files locally. Enter pb_download_url()
:
pb_download_url(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
#> [1] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/diamonds.tsv.gz"
#> [2] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris.tsv.gz"
#> [3] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris.tsv.xz"
#> [4] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris2.tsv.gz"
By default, this function returns the same download URL that you would get by visiting the release page, right-clicking on the file, and copying the link (aka the “browser_download_url”). This URL is served by GitHub’s web servers and not its API servers, and therefore not as restrictive with rate-limiting.
However, this URL is not accessible for private repositories, since
the auth tokens are handled by the GitHub API. You can retrieve the API
download url for private repositories by passing in "api"
to the url_type
argument:
pb_download_url(repo = "cboettig/piggyback-tests", tag = "v0.0.1", url_type = "api")
#> [1] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/44261315
#> [2] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/41841778
#> [3] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/18538636
#> [4] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/8990141
pb_download_url
otherwise shares similar default
behaviors with pb_download
for the file
,
repo
, and tag
arguments.
Reading data for R usage
piggyback
supports several general patterns for reading
data into R, with increasing degrees of performance/efficiency (and
complexity):
-
pb_download()
files to disk and then reading files with a function that reads from disk into memory -
pb_download_url()
a set of URLs and then passing those URLs to a function that retrieves those URLs directly into memory - Disk-based workflows which require downloading all files first but then can perform queries before reading into memory
- Cloud-native workflows which can perform queries directly on the URLs before reading into memory
We recommend the latter two approaches in cases where performance and efficiency matter, and have some vignettes with examples: - cloud native workflows - disk native workflows
Reading files
pb_read()
is a wrapper on the first pattern - it
downloads the file to a temp file, then reads that file into memory,
then deletes the temporary file. It works for both public and private
repositories, handling authentication under the hood:
pb_read("mtcars.rds", repo = "tanho63/piggyback-private")
#> # A data.frame: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> # ℹ 27 more rows
#> # ℹ 1 more variable: carb <dbl>
pb_read("mtcars.parquet", repo = "tanho63/piggyback-private")
#> # A data.frame: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> # ℹ 27 more rows
#> # ℹ 1 more variable: carb <dbl>
By default, pb_read
is programmed to use the following
read_function
for the corresponding file extensions:
- “.csv”, “.csv.gz”, “.csv.xz” are read with
utils::read.csv()
- “.tsv”, “.tsv.gz”, “.tsv.xz” are read with
utils::read.delim()
- “.rds” is read with
readRDS()
- “.json” is read with
jsonlite::fromJSON()
- “.parquet” is read with
arrow::read_parquet()
- “.txt” is read with
readLines()
If a file extension is not on this list, pb_read
will
raise an error and ask you to provide a read_function
- you
can also use this parameter to override the default
read_function
yourself:
pb_read(
file = "play_by_play_2023.qs",
repo = "nflverse/nflverse-data",
tag = "pbp",
read_function = qs::qread
)
#> # A tibble: 42,251 × 372
#> play_id game_id old_game_id home_team away_team season_type week posteam
#> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 1 2023_01_ARI_W… 2023091007 WAS ARI REG 1 NA
#> 2 39 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> 3 55 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> 4 77 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> 5 102 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> # ℹ 42,246 more rows
#> # ℹ 364 more variables: posteam_type <chr>, defteam <chr>, side_of_field <chr>,
#> # yardline_100 <dbl>, game_date <chr>, quarter_seconds_remaining <dbl>,
#> # half_seconds_remaining <dbl>, game_seconds_remaining <dbl>, game_half <chr>,
#> # quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>,
#> # goal_to_go <dbl>, time <chr>, yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>,
#> # desc <chr>, play_type <chr>, yards_gained <dbl>, shotgun <dbl>, …
Any read_function
can be provided so long as it accepts
the filename as the first argument, and you can pass any additional
parameters via ...
:
pb_read(
file = "play_by_play_2023.csv",
n_max = 10,
repo = "nflverse/nflverse-data",
tag = "pbp",
read_function = readr::read_csv
)
#> # A tibble: 10 × 372
#> play_id game_id old_game_id home_team away_team season_type week posteam
#> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 1 2023_01_ARI_W… 2023091007 WAS ARI REG 1 NA
#> 2 39 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> 3 55 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> 4 77 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> 5 102 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS
#> # ℹ 5 more rows
#> # ℹ 364 more variables: posteam_type <chr>, defteam <chr>, side_of_field <chr>,
#> # yardline_100 <dbl>, game_date <chr>, quarter_seconds_remaining <dbl>,
#> # half_seconds_remaining <dbl>, game_seconds_remaining <dbl>, game_half <chr>,
#> # quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>,
#> # goal_to_go <dbl>, time <chr>, yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>,
#> # desc <chr>, play_type <chr>, yards_gained <dbl>, shotgun <dbl>, …
Reading from URLs
More efficiently, many read functions accept URLs, including
read.csv()
, arrow::read_parquet()
,
readr::read_csv()
, data.table::fread()
, and
jsonlite::fromJSON()
, so reading in one file can be done by
passing along the output of pb_download_url()
:
pb_download_url("mtcars.csv", repo = "tanho63/piggyback-tests", tag = "v0.0.2") %>%
read.csv()
#> # A data.frame: 32 × 12
#> X mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#> 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4
#> 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3
#> 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
#> # ℹ 27 more rows
#> # ℹ 1 more variable: carb <int>
#> # ℹ Use `print(n = ...)` to see more rows
Some functions also accept URLs when converted into a connection by
wrapping it in url()
, e.g. for readRDS()
:
pb_url <- pb_download_url("mtcars.rds", repo = "tanho63/piggyback-tests", tag = "v0.0.2") %>%
url()
readRDS(pb_url)
#> # A data.frame: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> # ℹ 27 more rows
#> # ℹ Use `print(n = ...)` to see more rows
close(pb_url)
Note that using url()
requires that we close the
connection after reading it, or else we will receive warnings about
leaving open connections.
This url()
approach allows us to pass along
authentication for private repos, e.g.
pb_url <- pb_download_url("mtcars.rds", repo = "tanho63/piggyback-private", url_type = "api") %>%
url(
headers = c(
"Accept" = "application/octet-stream",
"Authorization" = paste("Bearer", gh::gh_token())
)
)
readRDS(pb_url)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> # ℹ 27 more rows
#> # ℹ Use `print(n = ...)` to see more rows
close(pb_url)
Note that arrow
does not accept a url()
connection at this time, so you should default to pb_read()
if using private repositories.
Uploading data
piggyback
uploads data to GitHub releases. If your
repository doesn’t have a release yet, piggyback
will
prompt you to create one - you can create a release with:
pb_release_create(repo = "cboettig/piggyback-tests", tag = "v0.0.2")
#> ✔ Created new release "v0.0.2".
Create new releases to manage multiple versions of a given data file, or to organize sets of files under a common topic. While you can create releases as often as you like, making a new release is not necessary each time you upload a file. If maintaining old versions of the data is not useful, you can stick with a single release and upload all of your data there.
Once we have at least one release available, we are ready to upload
files. By default, pb_upload
will attach data to the latest
release.
## We'll need some example data first.
## Pro tip: compress your tabular data to save space & speed upload/downloads
readr::write_tsv(mtcars, "mtcars.tsv.gz")
pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests")
#> ℹ Uploading to latest release: "v0.0.2".
#> ℹ Uploading mtcars.tsv.gz ...
#> |===================================================| 100%
Like pb_download()
, pb_upload()
will
overwrite any file of the same name already attached to the release file
by default, unless the timestamp of the previously uploaded version is
more recent. You can toggle these settings with the
overwrite
parameter.
pb_upload
also accepts a vector of multiple files to
upload:
library(magrittr)
## upload a folder of data
list.files("data") %>%
pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
## upload certain file extensions
list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>%
pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
Write R object directly to release
pb_write
wraps the above process, essentially allowing
you to upload directly to a release by providing an object, filename,
and repo/tag:
pb_write(mtcars, "mtcars.rds", repo = "cboettig/piggyback-tests")
#> ℹ Uploading to latest release: "v0.0.2".
#> ℹ Uploading mtcars.rds ...
#> |===================================================| 100%
Similar to pb_read
, pb_write
has some
pre-programmed write_functions
for the following file
extensions: - “.csv”, “.csv.gz”, “.csv.xz” are written with
utils::write.csv()
- “.tsv”, “.tsv.gz”, “.tsv.xz” are
written with utils::write.csv(x, filename, sep = '\t')
-
“.rds” is written with saveRDS()
- “.json” is written with
jsonlite::write_json()
- “.parquet” is written with
arrow::write_parquet()
- “.txt” is written with
writeLines()
and you can pass custom functions with the
write_function
parameter:
pb_write(
x = mtcars,
file = "mtcars.csv.gz",
repo = "cboettig/piggyback-tests",
write_function = data.table::fwrite
)
#> ℹ Uploading to latest release: "v0.0.2".
#> ℹ Uploading mtcars.csv.gz ...
#> |===================================================| 100%
Deleting Files
Delete a file from a release:
pb_delete(file = "mtcars.tsv.gz",
repo = "cboettig/piggyback-tests",
tag = "v0.0.1")
#> ℹ Deleted "mtcars.tsv.gz" from "v0.0.1" release on "cboettig/piggyback-tests"
Note that this is irreversible unless you have a copy of the data elsewhere.
Listing Files
List all files currently piggybacking on a given release. Omit
tag
to see files on all releases.
pb_list(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
#> file_name size timestamp tag owner repo
#> 1 diamonds.tsv.gz 571664 2021-09-07 23:38:31 v0.0.1 cboettig piggyback-tests
#> 2 iris.tsv.gz 846 2021-08-05 20:00:09 v0.0.1 cboettig piggyback-tests
#> 3 iris.tsv.xz 848 2020-03-07 06:18:32 v0.0.1 cboettig piggyback-tests
#> 4 iris2.tsv.gz 846 2018-10-05 17:04:33 v0.0.1 cboettig piggyback-tests
Caching
To reduce GitHub API calls, piggyback caches pb_releases
and pb_list
with a timeout of 10 minutes by default. This
avoids repeating identical requests to update its internal record of the
repository data (releases, assets, timestamps, etc) during programmatic
use. You can increase or decrease this delay by setting the environment
variable in seconds,
e.g. Sys.setenv("piggyback_cache_duration" = 3600)
for a
longer cache or Sys.setenv("piggyback_cache_duration" = 0)
to disable caching, and then restarting R.
Valid file names
GitHub assets attached to a release do not support file paths, and
will sometimes convert most special characters (#
,
%
, etc) to .
or throw an error (e.g. for file
names containing $
, @
, /
).
piggyback
will default to using the basename()
of the file only (i.e. will only use "mtcars.csv"
if
provided a file path like "data/mtcars.csv"
)
A Note on GitHub Releases vs Data Archiving
piggyback
is not intended as a data archiving solution.
Importantly, bear in mind that there is nothing special about multiple
“versions” in releases, as far as data assets uploaded by
piggyback
are concerned. The data files
piggyback
attaches to a Release can be deleted or modified
at any time – creating a new release to store data assets is the
functional equivalent of just creating new directories
v0.1
, v0.2
to store your data. (GitHub
Releases are always pinned to a particular git
tag, so the
code/git-managed contents associated with repo are more immutable, but
remember our data assets just piggyback on top of the repo).
Permanent, published data should always be archived in a proper data
repository with a DOI, such as zenodo.org. Zenodo can freely archive
public research data files up to 50 GB in size, and data is strictly
versioned (once released, a DOI always refers to the same version of the
data, new releases are given new DOIs). piggyback
is meant
only to lower the friction of working with data during the research
process, (e.g. provide data accessible to collaborators or continuous
integration systems during research process, including for private
repositories.)
What will GitHub think of this?
GitHub documentation at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project: