An Introduction to the dataset Package

Overview

The dataset package enriches R’s native data structures with machine-readable metadata. It allows variables and datasets to carry semantic definitions — such as URIs, labels, units, and provenance — which makes them suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems.

Unlike most metadata packages that attach metadata after the fact, dataset follows a semantic early-binding approach: metadata is embedded as soon as the data is created.

This vignette provides a high-level introduction. For details on key components, see the following:

vignette("defined", package = "dataset"): Semantic vectors with defined()
vignette("dataset_df", package = "dataset"): Structuring and metadata with dataset_df()
vignette("rdf", package = "dataset"): Exporting to RDF and Linked Data
vignette("bibrecord", package = "dataset"): Creating rich citation metadata using bibrecord()

Why extend tidy data?

Hadley Wickham (2014) defines tidy data with three principles:

Each variable forms a column
Each observation forms a row
Each observational unit forms a table

This structure is ideal for analysis, but lacks semantic clarity, particularly when an analyst is working in a realistic, but not ideal scenario with several datasets received from various internet services. For example, two datasets might both contain a column named gdp, but one might be in euros and the other in dollars. Without metadata, tools cannot detect this mismatch.

The dataset package addresses this by allowing you to define variables explicitly, and to store dataset-level metadata within a tidy tibble.

Example: defining semantically rich vectors

Semantically rich vectors are vectors in a data.frame that contain richer semantics than a simple column name; a long-form human-readable title; a machine- and human-readable variable definition; and if needed, an external resource that contains the codebook.

library(dataset)

gdp <- defined(
  c(2355, 2592, 2884),
  label = "Gross Domestic Product",
  unit = "CP_MEUR",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

geo <- defined(
  rep("AD", 3),
  label = "Geopolitical Entity",
  concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
  namespace = "https://www.geonames.org/countries/$1/"
)

gdp
#> x: Gross Domestic Product
#> Defined as http://data.europa.eu/83i/aa/GDP, measured in CP_MEUR 
#> [1] 2355 2592 2884
geo
#> x: Geopolitical Entity
#> Defined as http://purl.org/linked-data/sdmx/2009/dimension#refArea 
#> [1] "AD" "AD" "AD"

In this case, we define geo as the geopolitical entity http://purl.org/linked-data/sdmx/2009/dimension#refArea, and we know that the AD value can resolve to Andorra: https://www.geonames.org/countries/AD/. These vectors now carry metadata you can inspect directly — including their label, unit, and concept URI — which will be preserved even after transformation or storage.

Example: creating a dataset from a metadata-enriched data frame

small_dataset <- dataset_df(
  geo = geo,
  gdp = gdp,
  identifier = c(gdp = "http://example.com/dataset#gdp"),
  dataset_bibentry = dublincore(
    title = "Small GDP Dataset",
    creator = person("Jane", "Doe", role = "aut"),
    publisher = "Small Repository",
    subject = "Gross Domestic Product"
  )
)

small_dataset
#> Doe (2025): Small GDP Dataset [dataset]
#>   rowid geo     gdp 
#>   <chr> <chr> <dbl>
#> 1 gdp1  AD     2355
#> 2 gdp2  AD     2592
#> 3 gdp3  AD     2884

This dataset not only stores the variables and values, but also includes embedded metadata that supports precise interpretation and repository-level publication.

as_dublincore(small_dataset)
#> Dublin Core Metadata Record
#> --------------------------
#> Title:       Small GDP Dataset
#> Creator(s):  Jane Doe [aut]
#> Contributor(s): :unas
#> Subject(s):  Gross Domestic Product
#> Publisher:   Small Repository
#> Year:        2025
#> Language:    :unas
#> Description: :unas

Exporting to RDF

As Carl Boettinger has shown in the vignettes accompanying the R-binding to the popular Python library rdflib, (see: A tidyverse lover’s intro to RDF), tidy datasets can be retrofitted with rich metadata if they are pivoted to a strictly three-column long format.

Our packages tries to lower the burden of such retrofitting with early binding and sensible defaults to serialise the dataset’s contents and the dataset’s bibliographic data to this format for those who are not familiar with RDF.

You can convert any dataset_df object into a tidy 3-column representation (subject–predicate–object) using dataset_to_triples():

triples <- dataset_to_triples(small_dataset,
  format = "nt"
)
triples
#> [1] "<http://example.com/dataset#gdpgdp1> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#gdpgdp2> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [3] "<http://example.com/dataset#gdpgdp3> <http://purl.org/linked-data/sdmx/2009/dimension#refArea> <https://www.geonames.org/countries/AD/> ."
#> [4] "<http://example.com/dataset#gdpgdp1> <http://data.europa.eu/83i/aa/GDP> \"2355\"^^<xsd:decimal> ."                                        
#> [5] "<http://example.com/dataset#gdpgdp2> <http://data.europa.eu/83i/aa/GDP> \"2592\"^^<xsd:decimal> ."                                        
#> [6] "<http://example.com/dataset#gdpgdp3> <http://data.europa.eu/83i/aa/GDP> \"2884\"^^<xsd:decimal> ."

This 3-column format (subject–predicate–object) is compatible with semantic web tools such as SPARQL, rdflib, and triple stores.

mycon <- tempfile("my_dataset", 
                  fileext = "nt")
my_description <- describe(x = small_dataset, 
                           con = mycon)

# Only three statements are shown:
readLines(mycon)[c(4, 8, 12)]
#> [1] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."                                      
#> [2] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/title> \"Small GDP Dataset\"^^<http://www.w3.org/2001/XMLSchema#string> ."
#> [3] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/type> <http://purl.org/dc/dcmitype/Dataset> ."

## Show two lines of provenance:
provenance(small_dataset)[c(6, 7)]
#> [1] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."
#> [2] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-12-16T10:45:36Z\"^^<xsd:dateTime> ."

Coercing back

There may be use cases when your richer dataset needs to be simplified to as base R data.frame or a tbf_df.

We offer two coercion forms:

small_df <- as.data.frame(small_dataset, 
              strip_attributes = FALSE)

attr(small_dataset, "subject")
#> $term
#> [1] "Data sets"
#> 
#> $subjectScheme
#> [1] "LCSH"
#> 
#> $schemeURI
#> [1] "http://id.loc.gov/authorities/subjects"
#> 
#> $valueURI
#> [1] "http://id.loc.gov/authorities/subjects/sh2018002256"
#> 
#> $classificationCode
#> NULL
#> 
#> $prefix
#> [1] "lcsh:"
#> 
#> attr(,"class")
#> [1] "subject" "list"

Using the strip_attributes = FALSE the rich attributes remain in the base R data.frame. In most pipelines the attributes play no role, and you can retain it, and perhaps later load it back to a richer form.

You can also strip all these attributes, and choose tbl_df (if you have tibble) installed”:

small_tbl <- as_tibble(
  small_dataset, 
  strip_attributes = TRUE)

small_tbl
#> # A tibble: 3 × 3
#>   rowid geo     gdp
#>   <chr> <chr> <dbl>
#> 1 gdp1  AD     2355
#> 2 gdp2  AD     2592
#> 3 gdp3  AD     2884

Summary

The dataset package enriches tidy data by attaching metadata from the start of the workflow. It helps avoid semantic mismatches, supports RDF publication, and meets interoperability standards like SDMX, DataCite, and Dublin Core. Use it when you need:

Meaningful variable descriptions and URIs
Dataset-level metadata embedded directly in .rds or .rda files
Easy export to RDF and semantic web formats

For deeper examples, see:

vignette("defined", package = "dataset"): Working with semantic vectors
vignette("dataset_df", package = "dataset"): Dataset-level metadata and structure
vignette("rdf", package = "dataset"): Linked Data and export
vignette("bibrecord", package = "dataset"): Creating rich citation metadata using bibrecord()