Data-packages is a standard format for describing meta-data for a collection of datasets. The package
datapkg provides convenience functions for retrieving and parsing data packages in R. To install in R:
datapkg_read function retrieves and parses data packages from a local or remote sources. A few example packages are available from the datasets and testsuite-py repositories. The path needs to point to a directory on disk or git remote or URL containing the root of the data package.
# Load client library(datapkg) # Clone via git cities <- datapkg_read("git://github.com/datasets/world-cities") # Same data but download over http cities <- datapkg_read("https://raw.githubusercontent.com/datasets/world-cities/master")
The output object contains data and metadata from the data-package, with actual datasets inside the
In the case of multiple datasets, each one is either referenced by index or, if available, by name (names are optional in data packages).
The package also has basic functionality to save a data frame into a data package and update the
datapackage.json file accordingly.
# Create new data package pkgdir <- tempfile() datapkg_write(mtcars, path = pkgdir) datapkg_write(iris, path = pkgdir) # Read it back mypkg <- datapkg_read(pkgdir) print(mypkg$data$mtcars)
From here you can modify the
datapackage.json file with other metadata.
This package is work in progress. Current open issues:
1values for booleans: PR#406
%Y). Not sure if this constituates a valid date actually: PR#407
readrrequire to specify which strings are interepreted as missing values. Default are empty string
NA. A similar property needs to be defined in the spec.
datapackage.jsondoes not match the csv data. Examples: s-and-p-500 and currency-codes