Design Principles & Future Work Semantically Enriched, Standards-Aligned Datasets in R

Abstract

library(dataset)

“A dataset is an identifiable collection of data available for access or download in one or more formats.” — ISO/IEC 20546

The dataset package enriches R data objects with machine-readable metadata by embedding semantic definitions and provenance at both the variable and dataset levels. It follows a semantic early-binding design: metadata is attached at creation time, not retrofitted post hoc. This ensures that meaning and context are preserved throughout the data lifecycle — from exploration to publication — and enables partial automation of documentation.

This article outlines the design philosophy behind the dataset package, including its theoretical foundations, structure, relationship to other R tools, and example workflows. It serves as a long-form complement to the package vignettes.

Introduction and Motivation

“The principles of tidy data provide a standard way to organise data values within a dataset.”
— Wickham (2014)

The dataset package extends R’s native data structures by embedding machine-readable semantics and provenance directly in tidy data objects. It builds on tidy data principles (Wickham, 2014) but introduces a semantic early-binding approach: metadata is attached when the dataset is created, ensuring that context and meaning are preserved through all stages of the workflow — including transformation, validation, serialization, and reuse.

While tidyverse tools enforce structural clarity, they are generally agnostic about semantics. Variables may be misinterpreted, joined incorrectly, or published without context. dataset addresses this gap by aligning with international metadata standards, supporting RDF export, and providing an interface to the W3C Data Cube model.

A tidy dataset, per Wickham’s definition, adheres to three core rules:

Each variable forms a column
Each observation forms a row
Each value forms a cell

Figure 12.1 in R for Data Science: tidy data structure.

However, this tidy structure — typically implemented as a data.frame or tibble — is not semantically self-describing. In practical workflows, users often conflate the in-memory structure with the abstract concept of a dataset, which in metadata terms refers not just to structure but also to definitions, units, provenance, and contributors.

Several ISO and W3C standards define what constitutes a dataset. According to ISO/IEC 20546, a dataset is an identifiable collection of data available for access or download in one or more formats. The Dublin Core DCMI Metadata Terms define a dataset as “data encoded in a defined structure.” The W3C’s Data Cube Vocabulary, widely used in official statistics, describes a dataset as a “collection of statistical data that corresponds to a defined structure.” That structure includes observations, metadata about their organisation, structural metadata (e.g., units of measure), and reference metadata (e.g., creator, publisher).

This differs from R’s data.frame object, which is defined as “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R’s modeling software.” In practice, R users often use the terms data frame (or tibble) and dataset interchangeably. However, even a tidy data frame is underspecified for use in scientific repositories, statistical data exchanges, or many database applications. A data.frame exists only in the memory of an R session, limiting its interoperability and reusability. While R can already serialise data frames to formats like .rds, .rda, or .csv, these serialisations by default lack rich, standardised metadata. The dataset package bridges that gap by aligning with established metadata standards, producing serialisations that are easier to reuse and interpret.

The dataset package extends R’s native data structures with machine-readable metadata. It follows a semantic early-binding approach: metadata is embedded as soon as the data is created, making datasets suitable for long-term reuse, FAIR-compliant publishing, and integration into semantic web systems.

The central innovation of the package is an extended data-frame-like object: a tibble::tibble() enhanced with R’s attributes() system to store standard metadata from ISO and W3C vocabularies. This dataset_df class integrates naturally with tidy data principles (Wickham, 2014), where each variable is a column, each observation is a row, and each type of observational unit forms a table. On top of this tidy structure, dataset_df adds a semantic layer so that the meaning of variables and datasets is explicit and machine-readable. This new class is introduced in `vignette(“dataset_df”, package = “dataset”).

In research or institutional contexts, a dataset is a form of digital resource, often archived, cited, or published. Such resources are typically described with metadata using the Resource Description Framework (RDF), enabling machine-actionable, language-independent, schema-neutral representation. Our aim is to facilitate the translation or annotation of a tidy R data.frame into such a resource.

RDF also enables description at the level of elementary statements — that is, per-cell metadata combining variable (column) and observation (row). This allows for fine-grained semantic annotation, supporting full data traceability and interoperability.

The original tidy workflow was designed for solo, interactive analysis where analysts had full context. But in collaborative, institutional, or public-sharing contexts, assumptions must be replaced with formal semantics. Not only structure, but also clear definitions — of units, classifications, codes, and contributors — become essential.

Moreover, many statistical data providers follow the data cube model, which resembles tidy data but supports higher dimensionality and more formal metadata. Examples include SDMX and the W3C Data Cube vocabulary.

Tidy data assumes that column names and structure are sufficient for clarity. However, ambiguity arises quickly when combining datasets from heterogeneous sources. A column named geo might contain ISO codes in one dataset and Eurostat codes in another. GDP figures may differ in currency or base year. These inconsistencies often go unnoticed until late-stage analytical errors.

For example:

data.frame(
  geo = c("LI", "SM"),
  CPI = c("0.8", "0.9"),
  GNI = c("8976", "9672")
)
#>   geo CPI  GNI
#> 1  LI 0.8 8976
#> 2  SM 0.9 9672

This dataset is tidy, but not self-describing. Is geo using ISO 3166 or Eurostat codes? Is GNI measured in euros, dollars, or PPP-adjusted values?

The dataset package addresses these challenges by introducing structures for semantically rich vectors (defined()) and annotated tibbles (dataset_df()). It integrates machine-readable metadata directly into R objects and ensures that labels, units, concept URIs, and provenance are preserved from creation to publication.

This approach bridges the gap between tidy data and RDF, making formal semantics part of the tidyverse workflow — without requiring users to leave R or manually manage external metadata schemas.

Several R packages have offered tools to improve the metadata management of datasets within the tidyverse ecosystem or its surrounding statistical traditions.

The labelled class in the labelled and haven packages supports long-form variable labels and improved handling of value label sets compared to base R’s factor class. This is particularly helpful for variables collected in survey instruments — a major source of microdata in statistical workflows. However, real statistical production, as standardized by GSIM (Generic Statistical Information Model) and DDI (Data Documentation Initiative), involves a far more complex metadata model. Our contribution builds on these efforts by enabling users to attach standardized, cross-domain codebook references to such variables, ensuring that labelling follows recognized metadata vocabularies.

The dataspice package allows users to create auxiliary metadata datasets for publication. Its strength lies in its simplicity: it uses human-readable CSV files to capture key metadata fields. However, this simplicity introduces fragility: the metadata may become detached, outdated, or unsynchronised from the data file, especially in collaborative or iterative settings.

The rdflib package, a high-level interface to the RDF library of the same name in Python, supports RDF serialization and querying. It allows tidy data to be mapped to RDF triples and serialized into N-Triples, Turtle, RDF/XML, or JSON-LD. However, it assumes that metadata is retrofitted — applied after the analytical workflow is complete. While rdflib is essential for interoperability, it requires users to leave the tidyverse workflow or gain RDF modelling expertise.

Our goal with the dataset package is to bridge the semantic and methodological gap between the tidyverse and rdflib: to make semantically annotated, publication-ready datasets part of the R-native workflow from the start.

Another important initiative is the Frictionless Data project, which provides lightweight standards for describing datasets (e.g., via datapackage.json). It enables platform-independent validation and metadata exchange. In R, the frictionless package supports reading, writing, and validating data packages. However, this system relies heavily on external JSON schemas and does not integrate metadata into the objects used during analysis — meaning users must juggle separate metadata files and validation steps, potentially losing semantic continuity during transformation.

Tierney and Cook (2020), in their paper Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations, demonstrate how tidy data can be extended through consistent metadata structures and function design. Their concept of “nabular data” — datasets with shadow columns representing missingness — shows how tidy workflows can accommodate new dimensions of metadata. While their focus is on handling missing data, their methodological framing is closely aligned with ours: extending tidy conventions by designing new object classes and verbs, rather than retrofitting external metadata after the fact.

Collectively, these tools highlight a shared recognition: tidy data principles offer a strong foundation, but do not, by themselves, guarantee semantic clarity, interoperability, or reuse. The dataset package responds to this need by embedding standardized metadata directly in R objects, enabling datasets to remain semantically intact throughout transformation, validation, and publication.

Design Principles

Early binding of semantics: Metadata is attached at the point of dataset creation, not after the fact.
Attribute-based, not schema-based: Metadata lives inside the R object itself, not in external schemas or files.
Minimal friction with tidyverse workflows: Compatible with dplyr, tidyr, vctrs, and coercible to tibble or data.frame.
Persistence across save/load cycles: Metadata survives R serialization (.rds, .rda).
Tidyverse-like grammar: Core verbs include defined(), dataset_df(), provenance(), describe(), datacite(), and dublincore().
Full interoperability outside R: Datasets can be exported as RDF using dataset_to_triples() and ingested into triple stores.

Semantic Early Binding

The dataset package introduces several new S3 classes that remain fully compatible with tidyverse idioms and largely interoperable with base R. These classes rely on R’s native attribute system to embed metadata directly within vectors and tibbles. This enables metadata such as labels, concept URIs, namespaces, and provenance details to persist during filtering, joining, or transformation.

The attribute system in R is underused, and most user-friendly packages offer little support or interface for working directly with object attributes. This leads to redundancy — with metadata often duplicated within the dataset content itself.

The defined() constructor builds on labelled::labelled (originally from haven) and provides a more expressive way to annotate vectors with:

A human-readable label (e.g., "Gross Domestic Product")
A unit or measurement system (e.g., "CP_MEUR"), accessible via var_unit() and set with var_unit() <-
A concept URI that uniquely identifies the variable or dimension, handled via var_concept() and assignment
A namespace URI pattern for resolving coded values (e.g., ISO or Eurostat country codes), via var_namespace()

The dataset_df() class extends tibble and supports combining enriched vectors with dataset-level metadata. This includes Dublin Core and DataCite elements such as title, creator, publisher, subject, and contributors, along with provenance metadata like creation time or software agent.

Attribute-Based, Not Schema-Based

The dataset package adopts an attribute-based design rather than a schema-based approach. Metadata is stored directly in R objects using native attributes, ensuring semantic annotations remain tightly coupled with the data throughout transformation, saving, and reuse.

This approach eliminates the need for separate schema definitions or JSON metadata files — lowering the barrier to semantic data publishing within R workflows.

In R, most objects (especially vectors and data frames) can carry attributes such as:

names
class
label
unit
concept
namespace

These are lightweight, internal, and flexible. For example:

x <- 2457
attr(x, "unit") <- "CP_MEUR"
attr(x, "concept") <- "http://data.europa.eu/83i/aa/GDP"

In the dataset package, this metadata is preserved in defined and dataset_df objects and moves with the data — whether it’s saved, joined, subsetted, or filtered.

By contrast, many CRAN or rOpenSci packages are schema-based: they require external metadata definitions that describe expected columns, data types, and semantic rules. While these can support more complex use cases — such as SDMX structural metadata or JSON Schema validation — they introduce additional overhead, increase complexity, and risk desynchronisation between data and metadata.

Schema-based solutions may be more appropriate when data analysts work in teams alongside research data managers or other documentation specialists. In contrast, the dataset package is designed for individual researchers or small teams who want to avoid semantic errors when ingesting new data from external sources — while also enabling standards-compliant data exchange and publication with minimal additional tooling.

Persistence Across Save/Load Cycles

Because all metadata is stored as object attributes, it remains intact when datasets are saved using native R serialization formats like .rds or .rda. These attributes can be queried, extracted, or exported — but they do not interfere with regular data manipulation or analysis.

Metadata is added at the time of object creation, in contrast to workflows where metadata is generated after analysis or stored in sidecar files (e.g., JSON-LD). This design reduces the risk of metadata being detached, outdated, or incomplete.

Base Examples: Using the `dataset` Grammar

This section demonstrates the core grammar of the dataset package using minimal, synthetic examples. These illustrate how to define semantically enriched vectors, assemble them into annotated datasets, and prepare them for RDF export or validation.

Creating Defined Vectors

The defined() constructor creates semantically enriched vectors. It extends labelled::labelled with additional attributes such as unit, concept, and namespace.

library(dataset)

gdp <- defined(
  c(2355, 2592, 2884),
  label = "Gross Domestic Product",
  unit = "CP_MEUR",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

geo <- defined(
  rep("AD", 3),
  label = "Geopolitical Entity",
  concept = "http://dd.eionet.europa.eu/vocabulary/eurostat/geo/",
  namespace = "https://www.geonames.org/countries/$1/"
)

These vectors behave like regular R vectors but carry internal metadata. This metadata can be retrieved or reassigned using the accessor and setter functions provided by the package:

var_concept(gdp)
#> [1] "http://data.europa.eu/83i/aa/GDP"
var_unit(gdp)
#> [1] "CP_MEUR"
var_namespace(geo)
#> [1] "https://www.geonames.org/countries/$1/"

These attributes are preserved across most data transformations, and persist when saving to .rds or .rda.

Assembling a Dataset with Metadata

Use dataset_df() to combine defined vectors into a tibble-like object that includes dataset-level metadata, such as bibliographic information, identifiers, and provenance.

small_dataset <- dataset_df(
  geo = geo,
  gdp = gdp,
  identifier = c(gdp = "http://example.com/dataset#gdp"),
  dataset_bibentry = dublincore(
    title = "Small GDP Dataset",
    creator = person("Jane", "Doe", role = "aut"),
    publisher = "Small Repository",
    subject = "Gross Domestic Product"
  )
)

Behind the scenes, the package uses a custom bibrecord class that extends utils::bibentry() to accommodate all metadata fields defined by Dublin Core and DataCite — two major standards used in repositories, library systems, and FAIR data infrastructures.

You can review the dataset-level metadata in both formats:

as_dublincore(small_dataset)
#> Dublin Core Metadata Record
#> --------------------------
#> Title:       Small GDP Dataset
#> Creator(s):  Jane Doe [aut]
#> Contributor(s): :unas
#> Subject(s):  Gross Domestic Product
#> Publisher:   Small Repository
#> Year:        2025
#> Language:    :unas
#> Description: :unas
as_datacite(small_dataset)
#> DataCite Metadata Record
#> --------------------------
#> Title:       Small GDP Dataset
#> Creator(s):  Jane Doe [aut]
#> Contributor(s): :unas
#> Subject(s):  Gross Domestic Product
#> Identifier:  :tba
#> Publisher:   Small Repository
#> Year:        2025
#> Language:    :unas
#> Description: :unas

Since these metadata models do not fully overlap, using dublincore() will leave some DataCite-specific fields empty.

Provenance Tracking

One benefit of early metadata binding is that basic provenance is automatically tracked. The provenance() function returns metadata about when and how the dataset was created — including the system time and, optionally, the software environment.

provenance(small_dataset)
#> [1] "<http://example.com/dataset_prov.nt> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Bundle> ."                  
#> [2] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> ."                         
#> [3] "<http://example.com/dataset#> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> ."                 
#> [4] "_:doejane <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Agent> ."                                              
#> [5] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [6] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."                       
#> [7] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-12-16T10:45:31Z\"^^<xsd:dateTime> ."

This provenance is also included in the machine-readable metadata that can be exported using describe(), which generates an RDF description of the dataset.

description_nt <- tempfile(pattern = "small_dataset", fileext = ".nt")
describe(small_dataset, description_nt)

# Only a few lines shown:
readLines(description_nt)[5:8]
#> [1] "<https://doi.org/10.32614/CRAN.package.dataset> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#SoftwareAgent> ."
#> [2] "<http://example.com/creation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Activity> ."                       
#> [3] "<http://example.com/creation> <http://www.w3.org/ns/prov#generatedAtTime> \"2025-12-16T10:45:31Z\"^^<xsd:dateTime> ."                         
#> [4] "<http://example.com/dataset_tba/> <http://purl.org/dc/terms/title> \"Small GDP Dataset\"^^<http://www.w3.org/2001/XMLSchema#string> ."

The dataset grammar provides a lightweight but standards-compliant way to attach metadata during the creation of R objects. Unlike retrofitted metadata tools, it keeps semantic annotations inside the object throughout filtering, saving, and publishing. In the next section, we apply this grammar to a real-world scenario involving statistical datasets with conflicting semantics.

Applied Example: Joining Data with Semantic Constraints

This example demonstrates how the dataset package helps avoid semantic errors when combining data from heterogeneous sources. We create a small GDP dataset for three European microstates, measured in millions of euros (CP_MEUR), and then attempt to append an observation from Tuvalu, measured in US dollars (USD). The semantic mismatch triggers an error.

Step 1: Create a Eurostat-Compatible GDP Dataset

euro_gdp <- defined(
  c(2355, 2592),
  label = "Gross Domestic Product",
  unit = "CP_MEUR",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

geo_europe <- defined(
  c("AD", "LI"),
  label = "Geopolitical Entity",
  concept = "http://dd.eionet.europa.eu/vocabulary/eurostat/geo/",
  namespace = "https://www.geonames.org/countries/$1/"
)

euros_dataset <- dataset_df(
  geo = geo_europe,
  gdp = euro_gdp,
  dataset_bibentry = dublincore(
    title = "European Microstates GDP",
    creator = person("Statistical Unit", role = "aut"),
    publisher = "Eurostat",
    subject = "Gross Domestic Product"
  )
)

Step 2: Create a Dollar-based GDP Dataset

usd_gdp <- defined(
  56,
  label = "Gross Domestic Product",
  unit = "USD_MILLIONS",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

geo_tuvalu <- defined(
  "TV",
  label = "Geopolitical Entity",
  concept = "http://dd.eionet.europa.eu/vocabulary/eurostat/geo/",
  namespace = "https://www.geonames.org/countries/$1/"
)

tuvalu_dataset <- dataset_df(
  geo = geo_tuvalu,
  gdp = usd_gdp,
  dataset_bibentry = dublincore(
    title = "Tuvalu GDP (USD)",
    creator = person("Island", "Bureau", role = "aut"),
    publisher = "PacificStats",
    subject = "Gross Domestic Product"
  )
)

The tidy workflow is based around five operational actions: - Data reshaping goes from long to wide formats; - sorting arranges rows in a specific order; - filtering removes rows based on a condition; - transforming, changes existing variables or adds new ones; - aggregating creates a single value from many values, say, for example, in computing the minimum, maximum, and mean.

Ideally, each of these steps should be recorded in the metadata. We will only show data reshaping and transforming, because aggregation can be well described with defining the new aggregate with defined(), and sorting and filtering are trivial in a format where each observation is uniquely identified.

binded <- try(bind_defined_rows(euros_dataset, tuvalu_dataset), silent = TRUE)

This will raise an error or warning because the gdp column has inconsistent units (CP_MEUR vs USD_MILLIONS). The semantic definitions attached to each vector allow dataset to detect and prevent accidental joins across incompatible measurement systems.

Step 3: Transform the Data and Document the Change

exchange_rate <- 1.02
eur_tuv_gdp <- defined(
  56 * exchange_rate,
  label = "Gross Domestic Product",
  unit = "CP_MEUR",
  concept = "http://data.europa.eu/83i/aa/GDP"
)

tuvalu_dataset <- dataset_df(
  geo = geo_tuvalu,
  gdp = eur_tuv_gdp,
  dataset_bibentry = dublincore(
    title = "Tuvalu GDP (USD)",
    creator = person("Island", "Bureau", role = "aut"),
    publisher = "PacificStats",
    subject = "Gross Domestic Product"
  )
)

In a larger dataset, the user will likely use the tidyverse grammar (or the grammar of data.table), with mutating the dollar values into euro values. In this case, the transformation or the mutation should be recorded in the change of the unit. If you would add population data to the GDP dataset, and compute GDP/capita, you would also want to add a new long-form variable label, perhaps change the unit from millions of euros to euros.

var_unit(eur_tuv_gdp) <- "M_EUR"

The joined dataset needs a new title, and it can be attributed to a new author and publisher. The vocabulary of the Dublin Core and DataCite metadata standards used by most repositories and exchanges are covered with convenient helper functions that retrieve or set the descriptive metadata value. Some of them, like the title, are protected with explicit overwrite permissions.

global_dataset <- bind_defined_rows(euros_dataset, tuvalu_dataset)
dataset_title(global_dataset, overwrite = TRUE) <- "Global Microstates GDP"
publisher(global_dataset) <- "My Research Institute"
creator(global_dataset) <- person("Jane Doe", role = "aut")
language(global_dataset) <- "en"
description(global_dataset) <- "A dataset created from various sources about the GDP of very small states."
global_dataset
#> Jane Doe [aut] (2025): Global Microstates GDP [dataset]
#>   rowid geo      gdp 
#>   <chr> <chr>  <dbl>
#> 1 obs1  AD    2355  
#> 2 obs2  LI    2592  
#> 3 obs3  TV      57.1

You can review the descriptive metadata of the dataset with as_dublincore() or [as_datacite()] in various formats.

as_dublincore(global_dataset)
#> Dublin Core Metadata Record
#> --------------------------
#> Title:       Global Microstates GDP
#> Creator(s):  Jane Doe [aut]
#> Contributor(s): :unas
#> Subject(s):  Gross Domestic Product
#> Publisher:   My Research Institute
#> Year:        2025
#> Language:    eng
#> Description: A dataset created from various sources about the GDP of very small states.

A tidy dataset can be serialised to RDF with dataset_to_triples, which performs the data reshaping goes from wide to long formats. You can read a lot more in the vignette-articles of the high-level R-binding to the Python RDFLib library, rdflib, particularly the A tidyverse lover’s introduction to R on how to normalise the data to a format that it can be serialised to a flat RDF file or a graph database.

dataset_to_triples(global_dataset)
#>                                    s
#> 1 http://example.com/dataset#obsobs1
#> 2 http://example.com/dataset#obsobs2
#> 3 http://example.com/dataset#obsobs3
#> 4 http://example.com/dataset#obsobs1
#> 5 http://example.com/dataset#obsobs2
#> 6 http://example.com/dataset#obsobs3
#>                                                     p
#> 1 http://dd.eionet.europa.eu/vocabulary/eurostat/geo/
#> 2 http://dd.eionet.europa.eu/vocabulary/eurostat/geo/
#> 3 http://dd.eionet.europa.eu/vocabulary/eurostat/geo/
#> 4                    http://data.europa.eu/83i/aa/GDP
#> 5                    http://data.europa.eu/83i/aa/GDP
#> 6                    http://data.europa.eu/83i/aa/GDP
#>                                        o
#> 1 https://www.geonames.org/countries/AD/
#> 2 https://www.geonames.org/countries/LI/
#> 3 https://www.geonames.org/countries/TV/
#> 4               "2355.00"^^<xsd:decimal>
#> 5               "2592.00"^^<xsd:decimal>
#> 6                 "57.12"^^<xsd:decimal>

dataset_to_triples(global_dataset, format = "nt")
#> [1] "<http://example.com/dataset#obsobs1> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/AD/> ."
#> [2] "<http://example.com/dataset#obsobs2> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/LI/> ."
#> [3] "<http://example.com/dataset#obsobs3> <http://dd.eionet.europa.eu/vocabulary/eurostat/geo/> <https://www.geonames.org/countries/TV/> ."
#> [4] "<http://example.com/dataset#obsobs1> <http://data.europa.eu/83i/aa/GDP> \"2355.00\"^^<xsd:decimal> ."                                 
#> [5] "<http://example.com/dataset#obsobs2> <http://data.europa.eu/83i/aa/GDP> \"2592.00\"^^<xsd:decimal> ."                                 
#> [6] "<http://example.com/dataset#obsobs3> <http://data.europa.eu/83i/aa/GDP> \"57.12\"^^<xsd:decimal> ."

Full Interoperability

In the semantic web, datasets are often represented as collections of triples: subject, predicate, and object. The dataset_to_triples() function enables this by converting any dataset_df into a long-form representation where each row represents a semantically annotated cell.

Unlike tidy datasets that require column-wise joins and reshape operations, RDF-based datasets eliminate structural joins by relying on identity, context, and concept URIs. Repeated values are normalized at the semantic level. This makes triple-based data more flexible for publishing, integration, and querying across domains.

This design choice affects how we implemented joins and bindings. The package avoids implementing column-wise joins or wide-format merging because semantically rich datasets can be recombined or queried directly via SPARQL or other RDF tooling. Instead, row-wise binding via bind_defined_rows() is supported, allowing users to append consistent datasets without losing semantics.

This reflects a deliberate philosophy: rather than duplicate tidyverse behaviours, dataset encourages upstream semantic modelling and downstream interoperability.

The dataset_to_triples() function exports a tidy dataset to RDF-style triplets:

triples <- dataset_to_triples(small_dataset) head(triples)

Each row becomes a triple (subject, predicate, object), typed with XSD and optionally resolved via URIs. Export is supported through rdflib.

This example illustrates the core design goal of the dataset package: to make semantic metadata first-class citizens of the R data workflow. By embedding units, concept URIs, and provenance directly in data objects, the package supports not only reproducible research but also semantically interoperable publication — all without departing from familiar tidyverse idioms.

The dataset created in this example could be easily validated, documented, and exported as linked data using standard RDF tooling. This forms the basis for reproducible, standards-aligned workflows that extend beyond the analyst’s desktop — into repositories, triple stores, or domain-specific data services.

Yet, the applied example also reveals current limitations and areas for growth in the dataset package, which we now turn to.

Export and Interoperability

The dataset package is designed with FAIR principles in mind, particularly the goal of enabling machine-actionable data publishing. To support semantic web compatibility and downstream interoperability, it provides functions that allow users to convert annotated datasets into RDF-compatible formats.

The key function in this process is:

dataset_to_triples(): Converts a dataset_df into a three-column long-form structure—subject, predicate, object—representing each cell as an RDF-style triple. These can be exported to tabular or text-based formats, or directly ingested by triple stores.

This structure aligns with the W3C’s RDF and Data Cube vocabularies, where: - The subject typically encodes an observation or observation unit - The predicate is derived from a concept URI associated with the variable - The object is the value, typed using XML Schema Definitions (e.g., xsd:integer, xsd:string)

These outputs are fully compatible with the rdflib package, which can serialize RDF datasets into: - Turtle (.ttl) - RDF/XML (.rdf) - N-Triples (.nt) - JSON-LD (.jsonld)

This enables dataset publication to:

SPARQL endpoints
FAIR data repositories
Wikibase instances (via planned extensions)
Semantic web catalogues

Triple-based export promotes structural normalization, eliminates redundancy, and facilitates data integration across domains and systems.

Limitations and Future Work

The dataset package prioritizes ease of use and integration with existing tidyverse workflows. It intentionally implements a practical subset of features drawn from more formal metadata and ontology systems used in statistical domains, such as SDMX, DDI, and DCAT.

Some features have been deliberately left out to keep the package lightweight and analyst-friendly:

No native support for data cube slicing (e.g., filtering all observations for a specific dimension level)
No column-wise binding (e.g., bind_cols()) with semantic integrity checks
No built-in validation against controlled vocabularies or semantic registries
Limited UI or interactive support for defining or editing metadata.

One key limitation is the lack of experience with dataset in large-scale, multi-institutional ingestion, exchange, or publication workflows. For example, it remains unclear whether column-wise binding is necessary in practice, given that many users will serialize data to RDF triples — where redundancy is automatically filtered out by triple stores.

Some features could be better developed as stand-alone packages.

The bibrecord() s3 class with its constructor was created out of necessity, because the utils::bibentry class and the utils::person() do not handle well modern library and repository metadata. Most of the work carried out with the bibentry class to use the dublincore() and datacite() constructors could be easily adopted in the utils package of R, because it does not raise backward compatibility problems.
The provenance() function could safely be developed into a package of its own, because there are countless ways to improve the granularity a dataset provenance description.

Several downstream features and companion packages are under development:

wbdataset: export to the Wikibase data model for collaborative metadata curation
Observation status flags: support for tagging cells as “estimated”, “provisional”, or “forecasted”
Validation helpers: checks for missing concept URIs, unit mismatches, or inconsistent namespaces
External thesauri support: integration with vocabularies such as EuroVoc or GEMET
FAIR alignment: better coverage of 5-star and 8-star FAIR metadata criteria

We expect the need for tailored adaptations in specific domains — including environmental statistics, cultural heritage, and social sciences — where existing metadata models often deviate from general-purpose ontologies.

The dataset package does not aim to replace enterprise-scale metadata infrastructure (e.g., SDMX registries), but rather to empower individual researchers and small teams to produce semantically valid, publication-ready datasets without high setup costs.

Limitations and Future Work

The dataset package prioritizes ease of use and integration with existing tidyverse workflows. It implements a practical subset of features inspired by more formal metadata and ontology systems used in statistical domains, such as SDMX, DDI, and DCAT.

Several features have been deliberately left out to keep the package lightweight and analyst-friendly:

No native support for data cube slicing (e.g., filtering all observations for a specific dimension level)
No column-wise binding (e.g., bind_cols()) with semantic integrity checks
No built-in validation against controlled vocabularies or semantic registries
Limited UI or interactive tooling for defining or editing metadata

A key limitation is the limited experience with dataset in large-scale, multi-institutional ingestion, exchange, or publication workflows. For example, it is still unclear whether column-wise semantic binding is necessary in practice — given that many users export to RDF triples, where redundancy is naturally eliminated by triple stores.

Some internal components could be better developed as stand-alone packages:

The bibrecord() S3 class was introduced out of necessity. Base R’s utils::bibentry and utils::person() do not adequately support modern library and repository metadata. Much of the work done in the dublincore() and datacite() constructors could be ported upstream to the utils package without introducing backward compatibility issues.
The provenance() function could reasonably be split into a separate package, as there are many opportunities to increase the granularity and expressiveness of dataset provenance descriptions.

Several downstream features and companion packages are under development:

wbdataset: export to the Wikibase data model for collaborative metadata curation
Observation status flags: tagging of individual cells as “estimated”, “provisional”, or “forecasted”
Validation helpers: checks for missing concept URIs, unit mismatches, or inconsistent namespaces
External thesauri support: integration with vocabularies such as EuroVoc or GEMET
FAIR alignment: support for 5-star and 8-star FAIR metadata compliance

We anticipate the need for tailored extensions in domain-specific contexts — including environmental statistics, cultural heritage, and social sciences — where metadata conventions often deviate from general-purpose ontologies.

The dataset package is not intended to replace enterprise-scale metadata infrastructure (e.g., SDMX registries), but rather to empower individual researchers and small teams to produce semantically valid, publication-ready datasets with minimal overhead.

Abstract

Introduction and Motivation

Design Principles

Semantic Early Binding

Attribute-Based, Not Schema-Based

Persistence Across Save/Load Cycles

Base Examples: Using the `dataset` Grammar

Creating Defined Vectors

Assembling a Dataset with Metadata

Provenance Tracking

Applied Example: Joining Data with Semantic Constraints

Step 1: Create a Eurostat-Compatible GDP Dataset

Step 2: Create a Dollar-based GDP Dataset

Step 3: Transform the Data and Document the Change

Full Interoperability

Export and Interoperability

Limitations and Future Work

Limitations and Future Work

About

Community

Resources

Design Principles & Future Work Semantically Enriched, Standards-Aligned Datasets in R

Abstract

Introduction and Motivation

Related Work

Design Principles

Semantic Early Binding

Attribute-Based, Not Schema-Based

Persistence Across Save/Load Cycles

Base Examples: Using the dataset Grammar

Creating Defined Vectors

Assembling a Dataset with Metadata

Provenance Tracking

Applied Example: Joining Data with Semantic Constraints

Step 1: Create a Eurostat-Compatible GDP Dataset

Step 2: Create a Dollar-based GDP Dataset

Step 3: Transform the Data and Document the Change

Full Interoperability

Export and Interoperability

Limitations and Future Work

Limitations and Future Work

Base Examples: Using the `dataset` Grammar