Skip to contents

A framework to automate the processing, tidying and packaging of raw data into analysis-ready data sets as R packages.


DataPackageR will automate running of data processing code, storing tidied data sets in an R package, producing data documentation stubs, tracking data object finger prints (md5 hash) and tracking and incrementing a "DataVersion" string in the DESCRIPTION file of the package when raw data or data objects change. Code to perform the data processing is passed to DataPackageR by the user. The user also specifies the names of the tidy data objects to be stored, documented and tracked in the final package. Raw data should be read from "inst/extdata" but large raw data files can be read from sources external to the package source tree.

Configuration is controlled via the config.yml file created at the package root. Its properties include a list of R and Rmd files that are to be rendered / sourced and which read data and do the actual processing. It also includes a list of r object names created by those files. These objects are stored in the final package and accessible via the data() API. The documentation for these objects is accessible via "?object-name", and md5 fingerprints of these objects are created and tracked.

The Rmd and R files used to process the objects are transformed into vignettes accessible in the final package so that the processing is fully documented.

A DATADIGEST file in the package source keeps track of the data object fingerprints. A DataVersion string is added to the package DESCRIPTION file and updated when these objects are updated or changed on subsequent builds.

Once the package is built and installed, the data objects created in the package are accessible via the data() API, and Calling datapackage_skeleton() and passing in R / Rmd file names, and r object names constructs a skeleton data package source tree and an associated config.yml file.

Calling build_package() sets the build process in motion.


# A simple Rmd file that creates one data object
# named "tbl".
f <- tempdir()
f <- file.path(f,"foo.Rmd")
con <- file(f)
writeLines("```{r}\n tbl = table(sample(1:10,1000,replace=TRUE)) \n```\n",con=con)

# construct a data package skeleton named "MyDataPackage" and pass
# in the Rmd file name with full path, and the name of the object(s) it
# creates.

pname <- basename(tempfile())
   force = TRUE,
   r_object_names = "tbl",
   code_files = f)

# call package_build to run the "foo.Rmd" processing and
# build a data package.
package_build(file.path(tempdir(), pname), install = FALSE)

# "install" the data package
devtools::load_all(file.path(tempdir(), pname))

# read the data version

# list the data sets in the package.
data(package = pname)

# The data objects are in the package source under "/data"
list.files(pattern="rda", path = file.path(tempdir(),pname,"data"), full = TRUE)

# The documentation that needs to be edited is in "/R"
list.files(pattern="R", path = file.path(tempdir(), pname,"R"), full = TRUE)
readLines(list.files(pattern="R", path = file.path(tempdir(),pname,"R"), full = TRUE))
# view the documentation with
#>  Creating '/tmp/Rtmp91YS0s/file76f7c847b79/'
#>  Setting active project to '/tmp/Rtmp91YS0s/file76f7c847b79'
#>  Creating 'R/'
#>  Writing 'DESCRIPTION'
#> Package: file76f7c847b79
#> Title: What the Package Does (One Line, Title Case)
#> Version:
#> [email protected] (parsed):
#>     * First Last <[email protected]> [aut, cre] (YOUR-ORCID-ID)
#> Description: What the package does (one paragraph).
#> License: `use_mit_license()`, `use_gpl3_license()` or friends to
#>     pick a license
#> Encoding: UTF-8
#> Roxygen: list(markdown = TRUE)
#> RoxygenNote: 7.2.3
#>  Writing 'NAMESPACE'
#>  Setting active project to '<no active project>'
#>  Setting active project to '/tmp/Rtmp91YS0s/file76f7c847b79'
#>  Added DataVersion string to 'DESCRIPTION'
#>  Creating 'data-raw/'
#>  Creating 'data/'
#>  Creating 'inst/extdata/'
#>  Copied foo.Rmd into 'data-raw'
#>  configured 'datapackager.yml' file
#>  1 data set(s) created by foo.Rmd
#>  tbl
#>  Built  all datasets!
#> Non-interactive file update.
#>  Creating 'vignettes/'
#>  Creating 'inst/doc/'
#>  Loading file76f7c847b79
#> Writing NAMESPACE
#> Writing file76f7c847b79.Rd
#> Writing tbl.Rd
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#> * checking for file ‘/tmp/Rtmp91YS0s/file76f7c847b79/DESCRIPTION’ ... OK
#> * preparing ‘file76f7c847b79’:
#> * checking DESCRIPTION meta-information ... OK
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * looking to see if a ‘data/datalist’ file should be added
#>   NB: this package now depends on R (>= 3.5.0)
#>   WARNING: Added dependency on R >= 3.5.0 because serialized objects in
#>   serialize/load version 3 cannot be read in older versions of R.
#>   File(s) containing such objects:
#>     ‘file76f7c847b79/data/tbl.rda’
#> * building ‘file76f7c847b79_1.0.tar.gz’
#> Next Steps 
#> 1. Update your package documentation. 
#>    - Edit the documentation.R file in the package source data-raw subdirectory and update the roxygen markup. 
#>    - Rebuild the package documentation with  document() . 
#> 2. Add your package to source control. 
#>    - Call  git init .  in the package source root directory. 
#>    -  git add  the package files. 
#>    -  git commit  your new package. 
#>    - Set up a github repository for your pacakge. 
#>    - Add the github repository as a remote of your local package repository. 
#>    -  git push  your local repository to gitub. 
#>  Loading file76f7c847b79
#>  Rendering development documentation for "tbl"