Suggested Workflow for Storing a Variable Set of Dataframes under Version Control
Thierry Onkelinx
Source:vignettes/workflow.Rmd
workflow.Rmd
Introduction
This vignette describes a suggested workflow for storing a snapshot of dataframes as git2rdata objects under version control. The workflow comes in two flavours:
- A single repository holding both the data and the analysis code. The single repository set-up is simple. A single reference (e.g. commit) points to both the data and the code.
- One repository holding the data and a second repository holding the code. The data and the code have an independent history under a two repository set-up. Documenting the analysis requires one reference to each repository. Such a set-up is useful for repeating the same analysis (stable code) on updated data.
In this vignette we use a git2r::repository()
object as
the root. This adds git functionality to write_vc()
and
read_vc()
, provided by the git2r
package. This allows to pull, stage, commit and push from within R.
Each commit in the data git repository describes a complete snapshot of the data at the time of the commit. The difference between two commits can consist of changes in existing git2rdata object (updated observations, new observations, deleted observations or updated metadata). Besides updating the existing git2rdata objects, we can also add new git2rdata objects or remove existing ones. We need to track such higher level addition and deletions as well.
We illustrate the workflow with a mock analysis on the
datasets::beaver1
and datasets::beaver2
datasets.
Setup
We start by initializing a git repository. git2rdata
assumes that is already done. We’ll use the git2r
functions
to do so. We start by creating a local bare repository. In practice we
will use a remote on an external server (GitHub, Gitlab, Bitbucket, …).
The example below creates a local git repository with an upstream git
repository. Any other workflow to create a similar structure is
fine.
# initialize a bare git repo to be used as remote
remote <- tempfile("git2rdata-workflow-remote")
remote <- normalizePath(remote, winslash = "/")
#> Warning in normalizePath(remote, winslash = "/"):
#> path[1]="/tmp/Rtmp5VDmOF/git2rdata-workflow-remoteae955bfdfcc": No such file or
#> directory
dir.create(remote)
git2r::init(remote, bare = TRUE)
#> Head: nothing commited (yet)
# initialize a local git repo
path <- tempfile("git2rdata-workflow")
path <- normalizePath(path, winslash = "/")
#> Warning in normalizePath(path, winslash = "/"):
#> path[1]="/tmp/Rtmp5VDmOF/git2rdata-workflowae928267d65": No such file or
#> directory
dir.create(path)
init_repo <- git2r::clone(remote, path, progress = FALSE)
git2r::config(init_repo, user.name = "me", user.email = "me@me.com")
# add an initial commit with .gitignore file
writeLines("*extra*", file.path(path, ".gitignore"))
git2r::add(init_repo, ".gitignore", force = TRUE)
git2r::commit(init_repo, message = "Initial commit")
#> [2885f52] 2024-10-06: Initial commit
# push initial commit to remote
branch_name <- git2r::branches(init_repo)[[1]]$name
git2r::push(
init_repo, "origin", file.path("refs", "heads", branch_name, fsep = "/")
)
rm(init_repo)
Structuring Git2rdata Objects Within a Project
git2rdata
imposes a minimal structure. Both the
.tsv
and the .yml
file need to be in the same
folder. That’s it. For the sake of simplicity, in this vignette we dump
all git2rdata objects at the root of the repository.
This might not be good idea for real project. We recommend to use at
least a different directory tree for each import script. This directory
can go into the root of a data repository. It goes in the
data
directory in case of a data and code repository. Or
the inst
directory in case of an R package.
Your project might need a different directory structure. Feel free to choose the most relevant data structure for your project.
Storing Dataframes ad Hoc into a Git Repository
First Commit
In the first commit we use datasets::beaver1
. We connect
to the git repository using repository()
. Note that this
assumes that path
is an existing git repository. Now we can
write the dataset as a git2rdata object in the repository. If the
root
argument of write_vc()
is a
git_repository
, it gains two extra arguments:
stage
and force
. Setting
stage = TRUE
, will automatically stage the files written by
write_vc()
.
library(git2rdata)
repo <- repository(path)
fn <- write_vc(beaver1, "beaver", repo, sorting = "time", stage = TRUE)
We can use status()
to check that
write_vc()
wrote and staged the required files. Then we
commit()
the changes.
Second Commit
The second commit adds beaver2
.
fn <- write_vc(beaver2, "extra_beaver", repo, sorting = "time", stage = TRUE)
status(repo)
#> working directory clean
Notice that extra_beaver
is not listed in the
status()
, although write_vc()
wrote it to the
repository. The reason is that we set a .gitignore
which
contains "*extra*
, so git ignores any git2rdata object with
a name containing “extra”. We force git to stage it by setting
force = TRUE
.
status(repo, ignored = TRUE)
#> Ignored files:
#> Ignored: extra_beaver.tsv
#> Ignored: extra_beaver.yml
fn <- write_vc(beaver2, "extra_beaver", repo, sorting = "time", stage = TRUE,
force = TRUE)
status(repo)
#> Staged changes:
#> New: extra_beaver.tsv
#> New: extra_beaver.yml
cm2 <- commit(repo, message = "Second commit")
Third Commit
Now we decide that a single git2rdata object containing the data of
both beavers is more relevant. We add an ID variable for each of the
animals. This requires updating the sorting
to avoid ties.
And strict = FALSE
to update the metadata. The
“extra_beaver” git2rdata object is no longer needed so we remove it. We
use all = TRUE
to stage the removal of “extra_beaver” while
committing the changes.
beaver1$beaver <- 1
beaver2$beaver <- 2
beaver <- rbind(beaver1, beaver2)
fn <- write_vc(beaver, "beaver", repo, sorting = c("beaver", "time"),
strict = FALSE, stage = TRUE)
#> Warning: Changes in the metadata may lead to unnecessarily large diffs.
#> See vignette('version_control', package = 'git2rdata') for more information.
#>
#> - The sorting variables changed.
#> - Sorting for the new data: 'beaver', 'time'.
#> - Sorting for the old data: 'time'.
#> - New data has a different number of variables.
#> - New variables: beaver.
file.remove(list.files(path, "extra", full.names = TRUE))
#> [1] TRUE TRUE
status(repo)
#> Unstaged changes:
#> Deleted: extra_beaver.tsv
#> Deleted: extra_beaver.yml
#>
#> Staged changes:
#> Modified: beaver.tsv
#> Modified: beaver.yml
cm3 <- commit(repo, message = "Third commit", all = TRUE)
status(repo)
#> working directory clean
Scripted Workflow for Storing Dataframes
We strongly recommend to add git2rdata object through an import script instead of adding them ad hoc. Store this script in the (analysis) repository. It documents the creation of the git2rdata objects. Rerun this script whenever updated data becomes available.
Old versions of the import script and the associated git2rdata remain available through the version control history. Remove obsolete git2rdata objects from the import script. This keeps both the import script and the working directory tidy and minimal.
Basically, the import script should create all git2rdata objects within a given directory tree. This gives the advantage that we start the import script by clearing any existing git2rdata object in this directory. If the import script no longer creates a git2rdata object, it gets removed without the need to track what git2rdata objects existed in the previous version.
The brute force method of removing all files or all .tsv
/ .yml
pairs is not a good idea. This removes the existing
metadata which we need for efficient storage (see
vignette("efficiency", package = "git2rdata")
). A better
solution is to use rm_data()
on the directory at the start
of the import script. This removes all .tsv
files which
have valid metadata. The existing metadata remains untouched at this
point.
Then write all git2rdata objects and stage them. Unchanged objects
will not lead to a diff, even if we first deleted and then recreated
them. The script won’t recreate the .tsv
file of obsolete
git2rdata objects. Use prune_meta()
to remove any leftover
metadata files.
Commit and push the changes at the end of the script.
Below is an example script recreating the “beaver” git2rdata object from the third commit.
# load package
library(git2rdata)
# step 1: setup the repository and data path
repo <- repository(".")
data_path <- file.path("data", "beaver")
# step 1b: sync the repository with the remote
pull(repo = repo)
# step 2: remove all existing data files
rm_data(root = repo, path = data_path, stage = TRUE)
# step 3: write all relevant git2rdata objects to the data path
beaver1$beaver <- 1
beaver2$beaver <- 2
body_temp <- rbind(beaver1, beaver2)
fn <- write_vc(x = body_temp, file = file.path(data_path, "body_temperature"),
root = repo, sorting = c("beaver", "time"), stage = TRUE)
# step 4: remove any dangling metadata files
prune_meta(root = repo, path = data_path, stage = TRUE)
# step 5: commit the changes
cm <- commit(repo = repo, message = "import")
# step 5b: sync the repository with the remote
push(repo = repo)
R Package Workflow for Storing Dataframes
We recommend a two repository set-up in case of recurring analyses. These are relative stable analyses which have to run with some frequency on updated data (e.g. once a month). That makes it worthwhile to convert the analyses into an R package. Split long scripts into a set of shorter functions which are much easier to document and maintain. An R package offers lots of functionality out of the box to check the quality of your code.
The example below converts the import script above into a function.
We illustrate how you can use Roxygen2 (see
vignette("roxygen2", package = "roxygen2")
) tags to
document the function and to list its dependencies. Note that we added
session = TRUE
to commit()
. This will append
the sessionInfo()
at the time of the commit to the commit
message. Thus documenting all loaded R packages and their version. This
documents to code used to create the git2rdata object since your
analysis code resides in a dedicated package with its own version
number. We strongly recommend to run the import from a fresh R session.
Then the sessionInfo()
at commit time contains those
packages with are strictly required for the import. Consider running the
import from the command line.
e.g. Rscript -e 'mypackage::import_body_temp("path/to/root")'
.
#' Import the beaver body temperature data
#' @param path the root of the git repository
#' @importFrom git2rdata repository pull rm_data write_vc prune_meta commit push
#' @export
import_body_temp <- function(path) {
# step 1: setup the repository and data path
repo <- repository(path)
data_path <- file.path("data", "beaver")
# step 1b: sync the repository with the remote
pull(repo = repo)
# step 2: remove all existing data files
rm_data(root = repo, path = data_path, stage = TRUE)
# step 3: write all relevant git2rdata objects to the data path
beaver1$beaver <- 1
beaver2$beaver <- 2
body_temp <- rbind(beaver1, beaver2)
write_vc(x = body_temp, file = file.path(data_path, "body_temperature"),
root = repo, sorting = c("beaver", "time"), stage = TRUE)
# step 4: remove any dangling metadata files
prune_meta(root = repo, path = data_path, stage = TRUE)
# step 5: commit the changes
commit(repo = repo, message = "import", session = TRUE)
# step 5b: sync the repository with the remote
push(object = repo)
}
Analysis Workflow with Reproducible Data
The example below is a small trivial example of a standardized analysis in which documents the source of the data by describing the name of the data, the repository URL and the commit. We can use this information when reporting the results. This makes the data underlying the results traceable.
analysis <- function(ds_name, repo) {
ds <- read_vc(ds_name, repo)
list(
dataset = ds_name,
repository = git2r::remote_url(repo),
commit = recent_commit(ds_name, repo, data = TRUE),
model = lm(temp ~ activ, data = ds)
)
}
report <- function(x) {
knitr::kable(
coef(summary(x$model)),
caption = sprintf("**dataset:** %s \n**commit:** %s \n**repository:** %s",
x$dataset, x$commit$commit, x$repository)
)
}
In this case we can run every analysis by looping over the list of datasets in the repository.
repo <- repository(path)
current <- lapply(list_data(repo), analysis, repo = repo)
names(current) <- list_data(repo)
result <- lapply(current, report)
junk <- lapply(result, print)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 36.9084247 | 0.0198546 | 1858.93938 | 0 |
activ | 0.9346636 | 0.0352219 | 26.53644 | 0 |
The example below does the same thing for the first and second commit.
# checkout first commit
git2r::checkout(cm1)
# do analysis
previous <- lapply(list_data(repo), analysis, repo = repo)
names(previous) <- list_data(repo)
result <- lapply(previous, report)
junk <- lapply(result, print)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 36.8421296 | 0.0167694 | 2196.987569 | 0e+00 |
activ | 0.3812037 | 0.0730961 | 5.215107 | 8e-07 |
# checkout second commit
git2r::checkout(cm2)
# do analysis
previous <- lapply(list_data(repo), analysis, repo = repo)
names(previous) <- list_data(repo)
result <- lapply(previous, report)
junk <- lapply(result, print)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 36.8421296 | 0.0167694 | 2196.987569 | 0e+00 |
activ | 0.3812037 | 0.0730961 | 5.215107 | 8e-07 |
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 37.0968421 | 0.0345624 | 1073.32955 | 0 |
activ | 0.8062224 | 0.0438943 | 18.36736 | 0 |
If you inspect the reported results you’ll notice that all the output (coefficients and commit hash) for “beaver” object is identical for the first and second commit. This makes sense since the “beaver” object didn’t change during the second commit. The output for the current (third) commit is different because the dataset changed.
Long running analysis
Imagine the case where an individual analysis takes a while to run.
We store the most recent version of each analysis and add the
information from recent_commit()
. When preparing the
analysis, you can run recent_commit()
again on the dataset
and compare the commit hash with that one of the available analysis. If
the commit hashes match, then the data hasn’t changed. Then there is no
need to rerun the analysis1, saving valuable computing resources and
time.