Overview
The dataset
package helps you create semantically rich, machine-readable, and interoperable datasets in R. It introduces S3 classes that extend data frames, vectors, and bibliographic entries with formal metadata structures inspired by:
-
SDMX (Statistical Data and Metadata eXchange), widely used in official statistics
-
Dublin Core and DataCite, for FAIR-compliant depositing and reuse in scientific and open data repositories
- Open Science publishing practices, to support transparent and reproducible research
The goal is to preserve metadata when reusing statistical and repository datasets, improve interoperability, and make it easy to turn tidy data frames into web-ready, publishable datasets that comply with ISO and W3C standards.
Installation
You can install the latest released version of dataset
from CRAN with:
install.packages("dataset")
To install the development version from GitHub with pak
or remotes
:
# install.packages("pak")
pak::pak("dataobservatory-eu/dataset")
# install.packages("remotes")
remotes::install_github("dataobservatory-eu/dataset")
Minimal Example
library(dataset)
df <- dataset_df(
country = defined(
c("AD", "LI"),
label = "Country",
namespace = "https://www.geonames.org/countries/$1/"
),
gdp = defined(c(3897, 7365),
label = "GDP",
unit = "million euros"
),
dataset_bibentry = dublincore(
title = "GDP Dataset",
creator = person("Jane", "Doe", role = "aut"),
publisher = "Small Repository"
)
)
print(df)
#> Doe (2025): GDP Dataset [dataset]
#> rowid country gdp
#> <defined> <defined> <defined>
#> 1 obs1 AD 3897
#> 2 obs2 LI 7365
Export as RDF triples:
dataset_to_triples(df, format = "nt")
#> [1] "<http://example.com/dataset#obsobs1> <http://example.com/prop/country> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#obsobs2> <http://example.com/prop/country> <https://www.geonames.org/countries/LI/> ."
#> [3] "<http://example.com/dataset#obsobs1> <http://example.com/prop/gdp> \"3897\"^^<xsd:decimal> ."
#> [4] "<http://example.com/dataset#obsobs2> <http://example.com/prop/gdp> \"7365\"^^<xsd:decimal> ."
Retain automatically recorded provenance:
provenance(df)
#> [1] "<http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> ."
#> [2] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> ."
#> [3] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."
#> [4] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."
#> [5] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [6] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [7] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-08-27T17:55:55Z\"^^<xsd:dateTime> ."
Contributing
We welcome contributions and discussion!
- Please see our CONTRIBUTING.md guide.
- Ideas, bug reports, and feedback are welcome via GitHub issues.
- The design principles and ideas for futher development are explained in Design Principles & Future Work Semantically Enriched, Standards-Aligned Datasets in R.
Code of Conduct
This project follows the rOpenSci Code of Conduct. By participating, you are expected to uphold these guidelines.