Purpose

This vignette demonstrates how to use DataPackageR to build a data package.

DataPackageR aims to simplify data package construction.

It provides mechanisms for reproducibly preprocessing and tidying raw data into into documented, versioned, and packaged analysis-ready data sets.

Long-running or computationally intensive data processing can be decoupled from the usual R CMD build process while maintinaing data lineage.

In this vignette we will subset and package the mtcars data set.

Set up a new data package.

We’ll set up a new data package based on mtcars example in the README. The datapackage_skeleton() API is used to set up a new package. The user needs to provide:

  • R or Rmd code files that do data processing.
  • A list of R object names created by those code files.
  • Optionally a path to a directory of raw data (will be copied into the package).
  • Optionally a list of additional code files that may be dependencies of your R scripts.

What’s in the package skeleton structure?

This has created a datapackage source tree named “mtcars20” (in a temporary directory). For a real use case you would pick a path on your filesystem where you could then initialize a new github repository for the package.

The contents of mtcars20 are:

                levelName
1  mtcars20              
2   ¦--data              
3   ¦--data-raw          
4   ¦   °--subsetCars.Rmd
5   ¦--datapackager.yml  
6   ¦--DESCRIPTION       
7   ¦--inst              
8   ¦   °--extdata       
9   ¦--R                 
10  °--Read-and-delete-me

You should fill out the DESCRIPTION file to describe your data package. It contains a new DataVersion string that will be automatically incremented when the data package is built if the packaged data has changed.

The user-provided code files reside in data-raw. They are executed during the data package build process.

A few words about the YAML config file

A datapackager.yml file is used to configure and control the build process.

The contents are:

configuration:
  files:
    subsetCars.Rmd:
      enabled: yes
  objects: cars_over_20
  render_root:
    tmp: '8072'

The two main pieces of information in the configuration are a list of the files to be processed and the data sets the package will store.

This example packages an R data set named cars_over_20 (the name was passed in to datapackage_skeleton()). It is created by the subsetCars.Rmd file.

The objects must be listed in the yaml configuration file. datapackage_skeleton() ensures this is done for you automatically.

DataPackageR provides an API for modifying this file, so it does not need to be done by hand.

Further information on the contents of the YAML configuration file, and the API are in the YAML Configuration Details

Where do I put my raw datasets?

Raw data (provided the size is not prohibitive) can be placed in inst/extdata.

The datapackage_skeleton() API has the raw_data_dir argument, which will copy the contents of raw_data_dir (and its subdirectories) into inst/extdata automatically.

In this example we are reading the mtcars data set that is already in memory, rather than from the file system.

An API to read raw data sets from within an R or Rmd procesing script.

As stated in the README, in order for your processing scripts to be portable, you should not use absolute paths to files. DataPackageR provides an API to point to the data package root directory and the inst/extdata and data subdirectories. These are useful for constructing portable paths in your code to read files from these locations.

For example: to construct a path to a file named “mydata.csv” located in inst/extdata in your data package source tree:

Similarly:

Raw data sets that are stored externally (outside the data package source tree) can be constructed relative to the project_path().

YAML header metadata for R files and Rmd files.

If your processing scripts are Rmd files, the usual yaml header for rmarkdown documents should be present.

If you have Rmd files, you can still include a yaml header, but it should be commented with #' and it should be at the top of your R file. For example, a test R file in the DataPackageR package looks as follows:

#'---
#'title: Sample report  from R script
#'author: Greg Finak
#'date: August 1, 2018
#'---
data <- runif(100)

This will be converted to an Rmd file with a proper yaml header, which will then be turned into a vignette and indexed in the built package.

Build the data package.

Once the skeleton framework is set up,

# Run the preprocessing code to build cars_over_20
# and reproducibly enclose it in a package.
dir.create(file.path(tempdir(),"lib"))
DataPackageR:::package_build(file.path(tempdir(),"mtcars20"), install = TRUE,  lib = file.path(tempdir(),"lib"))

1 data set(s) created by subsetCars.Rmd
• cars_over_20
☘ Built  all datasets!
Non-interactive NEWS.md file update.
✔ Creating 'vignettes/'
✔ Creating 'inst/doc/'
First time using roxygen2. Upgrading automatically...
Loading mtcars20
Writing NAMESPACE
Writing mtcars20.Rd
Writing cars_over_20.Rd
  
   checking for file ‘/tmp/RtmpNGp8DC/mtcars20/DESCRIPTION’ ...
  
✔  checking for file ‘/tmp/RtmpNGp8DC/mtcars20/DESCRIPTION’ (344ms)

  
─  preparing ‘mtcars20’:

  
   checking DESCRIPTION meta-information ...
  
✔  checking DESCRIPTION meta-information

  
─  checking for LF line-endings in source and make files and shell scripts

  
─  checking for empty or unneeded directories

  
     NB: this package now depends on R (>= 3.5.0)

  
     WARNING: Added dependency on R >= 3.5.0 because serialized objects in  serialize/load version 3 cannot be read in older versions of R.  File(s) containing such objects: ‘mtcars20/data/cars_over_20.rda’

  
─  building ‘mtcars20_1.0.tar.gz’

  
   

Reloading attached mtcars20
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'mtcars20'
Next Steps 
1. Update your package documentation. 
   - Edit the documentation.R file in the package source data-raw subdirectory and update the roxygen markup. 
   - Rebuild the package documentation with  document() . 
2. Add your package to source control. 
   - Call  git init .  in the package source root directory. 
   -  git add  the package files. 
   -  git commit  your new package. 
   - Set up a github repository for your pacakge. 
   - Add the github repository as a remote of your local package repository. 
   -  git push  your local repository to gitub. 
[1] "/tmp/RtmpNGp8DC/mtcars20_1.0.tar.gz"

Documenting your data set changes in NEWS.md

When you build a package in interactive mode, you will be prompted to input text describing the changes to your data package (one line).

These will appear in the NEWS.md file in the following format:

DataVersion: xx.yy.zz
========
A description of your changes to the package

[The rest of the file]

Why not just use R CMD build?

If the processing script is time consuming or the data set is particularly large, then R CMD build would run the code each time the package is installed. In such cases, raw data may not be available, or the environment to do the data processing may not be set up for each user of the data. DataPackageR decouples data processing from package building/installation for data consumers.

A log of the build process

DataPackageR uses the futile.logger package to log progress.

If there are errors in the processing, the script will notify you via logging to console and to /private/tmp/Test/inst/extdata/Logfiles/processing.log. Errors should be corrected and the build repeated.

If everything goes smoothly, you will have a new package built in the parent directory.

In this case we have a new package mtcars20_1.0.tar.gz.

A note about the package source directory after building.

The pacakge source directory changes after the first build.

                         levelName
1  mtcars20                       
2   ¦--data                       
3   ¦   °--cars_over_20.rda       
4   ¦--data-raw                   
5   ¦   ¦--documentation.R        
6   ¦   °--subsetCars.Rmd         
7   ¦--DATADIGEST                 
8   ¦--datapackager.yml           
9   ¦--DESCRIPTION                
10  ¦--inst                       
11  ¦   ¦--doc                    
12  ¦   ¦   ¦--subsetCars.html    
13  ¦   ¦   °--subsetCars.Rmd     
14  ¦   °--extdata                
15  ¦       °--Logfiles           
16  ¦           ¦--processing.log 
17  ¦           °--subsetCars.html
18  ¦--man                        
19  ¦   ¦--cars_over_20.Rd        
20  ¦   °--mtcars20.Rd            
21  ¦--NAMESPACE                  
22  ¦--NEWS.md                    
23  ¦--R                          
24  ¦   °--mtcars20.R             
25  ¦--Read-and-delete-me         
26  °--vignettes                  
27      °--subsetCars.Rmd         

Update the autogenerated documentation.

After the first build, the R directory contains mtcars.R that has autogenerated roxygen2 markup documentation for the data package and for the packaged data cars_over20.

The processed Rd files can be found in man.

The autogenerated documentation source is in the documentation.R file in data-raw.

You should update this file to properly document your objects. Then rebuild the documentation:

This is done without reprocessing the data.

Dont’ forget to rebuild the package.

You should update the documentation in R/mtcars.R, then call package_build() again.

Installing and using the new data package

Migrating old data packages.

Version 1.12.0 has moved away from controlling the build process using datasets.R and an additional masterfile argument.

The build process is now controlled via a datapackager.yml configuration file located in the package root directory. (see YAML Configuration Details)

Create a datapackager.yml file

You can migrate an old package by constructing such a config file using the construct_yml_config() API.

# assume I have file1.Rmd and file2.R located in /data-raw, 
# and these create 'object1' and 'object2' respectively.

config <- construct_yml_config(code = c("file1.Rmd", "file2.R"),
                              data = c("object1", "object2"))
cat(yaml::as.yaml(config))
configuration:
  files:
    file1.Rmd:
      enabled: yes
    file2.R:
      enabled: yes
  objects:
  - object1
  - object2
  render_root:
    tmp: '244377'

config is a newly constructed yaml configuration object. It can be written to the package directory:

path_to_package <- tempdir() #e.g., if tempdir() was the root of our package.
yml_write(config, path = path_to_package)

Now the package at path_to_package will build with version 1.12.0 or greater.

Reading data sets from Rmd files

In versions prior to 1.12.1 we would read data sets from inst/extdata in an Rmd script using paths relative to data-raw in the data package source tree.

For example:

The old way

# read 'myfile.csv' from inst/extdata relative to data-raw where the Rmd is rendered.
read.csv(file.path("../inst/extdata","myfile.csv"))

Now Rmd and R scripts are processed in render_root defined in the yaml config.

To read a raw data set we can get the path to the package source directory using an API call:

The new way

# DataPackageR::project_extdata_path() returns the path to the data package inst/extdata subdirectory directory.
# DataPackageR::project_path() returns the path to the data package root directory.
# DataPackageR::project_data_path() returns the path to the data package data subdirectory directory.
read.csv(
    DataPackageR::project_extdata_path("myfile.csv")
    )

Partial builds

We can also perform partial builds of a subset of files in a package by toggling the enabled key in the config file.

This can be done with the following API:

config <- yml_disable_compile(config,filenames = "file2.R")
yml_write(config, path = path_to_package) # write modified yml to the package.
configuration:
  files:
    file1.Rmd:
      enabled: yes
    file2.R:
      enabled: no
  objects:
  - object1
  - object2
  render_root:
    tmp: '244377'

Note that the modified configuration needs to be written back to the package source directory in order for the changes to take effect.

The consequence of toggling a file to enable: no is that it will be skipped when the package is rebuilt, but the data will still be retained in the package, and the documentation will not be altered.

This is useful in situations where we have multiple data sets, and want to re-run one script to update a specific data set, but not the other scripts because they may be too time consuming, for example.

Multi-script pipelines.

We may have situations where we have mutli-script pipelines. There are two ways to share data among scripts.

  1. filesystem artifacts
  2. data objects passed to subsequent scripts.

File system artifacts

The yaml configuration property render_root specifies the working directory where scripts will be rendered.

If a script writes files to the working directory, that is where files will appear. These can be read by subsequent scripts.

Passing data objects to subsequent scripts.

A script (e.g., script2.Rmd) running after script1.Rmd can access a stored data object named script1_dataset created by script1.Rmd by calling

script1_dataset <- DataPackageR::datapackager_object_read("script1_dataset").

Passing of data objects amongst scripts can be turned off via:

package_build(deps = FALSE)

Next steps

We recommend the following once your package is created.

Place your package under source control

You now have a data package source tree.

This will let you version control your data processing code, and provide a mechanism for sharing your package with others.

For more details on using git and github with R, there is an excellent guide provided by Jenny Bryan: Happy Git and GitHub for the useR and Hadley Wickham’s book on R packages.

Additional Details

We provide some additional details for the interested.

Fingerprints of stored data objects

DataPackageR calculates an md5 checksum of each data object it stores, and keeps track of them in a file called DATADIGEST.

  • Each time the package is rebuilt, the md5 sums of the new data objects are compared against the DATADIGEST.
  • If they don’t match, the build process checks that the DataVersion string has been incremented in the DESCRIPTION file.
  • If it has not the build process will exit and produce an error message.

DATADIGEST

The DATADIGEST file contains the following:

DataVersion: 0.1.0
cars_over_20: 3ccb5b0aaa74fe7cfc0d3ca6ab0b5cf3

DESCRIPTION

The description file has the new DataVersion string.

Package: mtcars20
Title: What the Package Does (One Line, Title Case)
Version: 1.0
[email protected]: 
    person(given = "First",
           family = "Last",
           role = c("aut", "cre"),
           email = "[email protected]",
           comment = c(ORCID = "YOUR-ORCID-ID"))
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
    license
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
DataVersion: 0.1.0
Date: 2020-10-19
Suggests: 
    knitr,
    rmarkdown
VignetteBuilder: knitr