Overview
The dataset
package enriches R’s native data structures
with machine-readable metadata. It allows variables and datasets to
carry semantic definitions — such as URIs, labels, units, and provenance
— which makes them suitable for long-term reuse, FAIR-compliant
publishing, and integration into semantic web systems.
Unlike most metadata packages that attach metadata after the fact,
dataset
follows a semantic early-binding
approach: metadata is embedded as soon as the data is created.
This vignette provides a high-level introduction. For details on key components, see the following:
-
vignette("defined", package = "dataset")
: Semantic vectors withdefined()
-
vignette("dataset_df", package = "dataset")
: Structuring and metadata withdataset_df()
-
vignette("rdf", package = "dataset")
: Exporting to RDF and Linked Data -
vignette("bibrecord", package = "dataset")
: Creating rich citation metadata usingbibrecord()
Why extend tidy data?
Hadley Wickham (2014) defines tidy data with three principles:
- Each variable forms a column
- Each observation forms a row
- Each observational unit forms a table
This structure is ideal for analysis, but lacks semantic
clarity, particularly when an analyst is working in a
realistic, but not ideal scenario with several datasets received from
various internet services. For example, two datasets might both contain
a column named gdp
, but one might be in euros and the other
in dollars. Without metadata, tools cannot detect this mismatch.
The dataset
package addresses this by allowing you to
define variables explicitly, and to store dataset-level metadata within
a tidy tibble.
Example: defining semantically rich vectors
Semantically rich vectors are vectors in a data.frame that contain richer semantics than a simple column name; a long-form human-readable title; a machine- and human-readable variable definition; and if needed, an external resource that contains the codebook.
library(dataset)
gdp <- defined(
c(2355, 2592, 2884),
label = "Gross Domestic Product",
unit = "CP_MEUR",
concept = "http://data.europa.eu/83i/aa/GDP"
)
geo <- defined(
rep("AD", 3),
label = "Geopolitical Entity",
concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
namespace = "https://www.geonames.org/countries/$1/"
)
gdp
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR
#> [1] 2355 2592 2884
geo
#> x: Geopolitical Entity
#> Defined as http://purl.org/linked-data/sdmx/2009/dimension#refArea
#> [1] "AD" "AD" "AD"
In this case, we define geo
as the geopolitical entity
http://purl.org/linked-data/sdmx/2009/dimension#refArea,
and we know that the AD
value can resolve to Andorra: https://www.geonames.org/countries/AD/. These vectors
now carry metadata you can inspect directly — including their label,
unit, and concept URI — which will be preserved even after
transformation or storage.
Example: creating a dataset from a metadata-enriched data frame
small_dataset <- dataset_df(
geo = geo,
gdp = gdp,
identifier = c(gdp = "http://example.com/dataset#gdp"),
dataset_bibentry = dublincore(
title = "Small GDP Dataset",
creator = person("Jane", "Doe", role = "aut"),
publisher = "Small Repository",
subject = "Gross Domestic Product"
)
)
small_dataset
#> Doe (2025): Small GDP Dataset [dataset]
#> rowid geo gdp
#> <defined> <defined> <defined>
#> 1 gdp1 AD 2355
#> 2 gdp2 AD 2592
#> 3 gdp3 AD 2884
This dataset not only stores the variables and values, but also includes embedded metadata that supports precise interpretation and repository-level publication.
as_dublincore(small_dataset)
#> Dublin Core Metadata Record
#> --------------------------
#> Title: Small GDP Dataset
#> Creator(s): Jane Doe [aut]
#> Contributor(s): :unas
#> Subject(s): Gross Domestic Product
#> Publisher: Small Repository
#> Year: 2025
#> Language: :unas
#> Description: :unas
Exporting to RDF
As Carl Boettinger has shown in the vignettes accompanying the R-binding to the popular Python library rdflib, (see: A tidyverse lover’s intro to RDF), tidy datasets can be retrofitted with rich metadata if they are pivoted to a strictly three-column long format.
Our packages tries to lower the burden of such retrofitting with early binding and sensible defaults to serialise the dataset’s contents and the dataset’s bibliographic data to this format for those who are not familiar with RDF.
You can convert any dataset_df
object into a tidy
3-column representation (subject–predicate–object) using
dataset_to_triples()
:
triples <- dataset_to_triples(small_dataset,
format = "nt"
)
triples
#> [1] "<http://example.com/dataset#gdpgdp1> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#gdpgdp2> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [3] "<http://example.com/dataset#gdpgdp3> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [4] "<http://example.com/dataset#gdpgdp1> <http://data.europa.eu/83i/aa/GDP> \"2355\"^^<xsd:decimal> ."
#> [5] "<http://example.com/dataset#gdpgdp2> <http://data.europa.eu/83i/aa/GDP> \"2592\"^^<xsd:decimal> ."
#> [6] "<http://example.com/dataset#gdpgdp3> <http://data.europa.eu/83i/aa/GDP> \"2884\"^^<xsd:decimal> ."
This 3-column format (subject–predicate–object) is compatible with
semantic web tools such as SPARQL, rdflib
, and triple
stores.
mycon <- tempfile("my_dataset", fileext = "nt")
my_description <- describe(x = small_dataset, con = mycon)
# Only three statements are shown:
readLines(mycon)[c(4, 8, 12)]
#> [1] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."
#> [2] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/title> \"Small GDP Dataset\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [3] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/type> <http://purl.org/dc/dcmitype/Dataset> ."
## Show two lines of provenance:
provenance(small_dataset)[c(6, 7)]
#> [1] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [2] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-08-27T18:33:42Z\"^^<xsd:dateTime> ."
Summary
The dataset package enriches tidy data by attaching metadata from the start of the workflow. It helps avoid semantic mismatches, supports RDF publication, and meets interoperability standards like SDMX, DataCite, and Dublin Core. Use it when you need:
Meaningful variable descriptions and URIs
Dataset-level metadata embedded directly in .rds or .rda files
Easy export to RDF and semantic web formats
For deeper examples, see:
vignette("defined", package = "dataset")
: Working with semantic vectorsvignette("dataset_df", package = "dataset")
: Dataset-level metadata and structurevignette("rdf", package = "dataset")
: Linked Data and exportvignette("bibrecord", package = "dataset")
: Creating rich citation metadata usingbibrecord()