The Data Lake and Schema On Read
Carl Boettiger
2020-01-09
Source:vignettes/articles/data-lake.Rmd
data-lake.Rmd
A provocative recent analogy (e.g. Archer, 2017) for thinking about RDF is that of the Data Lake. Whereas adding new data to a traditional relational database frequently involves laborious wrangling into the an existing rigid data schema of tables and columns, RDF can enable a far more simple approach of just ‘tossing everything into the data lake.’ While this may sound like a recipe for a terrible mess down the road, RDF gives the promise of schema-on-read: rather than dictating the shape of your data when first adding it to the database (i.e. schema-on-write), a SPARQL query auto-magically reaches into the lake and retrieves all the relevant data, returning it in exactly the schema you ask for; that is, as a nice single table containing only the requested columns.
This can serve as a very effective means of data integration (provided a reasonably consistent and diligent use of URIs in identifying subjects and properties (predicates)), since just about any data can be added to the lake without worrying about whether it comes in a schema that matches the existing architecture of the database. It is this flexibility not to have to define your database schema at the start that is the primary strength of the RDF approach.
A key advantage of this approach is that it is equally as easy to
extract your desired data.frame from non-rectangular data as it is from
tables or relational database systems. While the previous vignette
focused on simple examples including data from single
data.frames
or small JSON files, this vignette showcases
the Data Lake approach on a more complex relational database organized
over several tables (nyflights13
data, used to teach
joins
and other relational data in Grolemund & Wickham, 2017), and
also in a large non-tabular file returned from the GitHub API (e.g. as
used in Bryan
& Wickham, 2017 to teach data rectangling.)
Load the necessary libraries to get started:
## Data
library(nycflights13)
library(repurrrsive)
## Error in library(repurrrsive): there is no package called 'repurrrsive'
## for comparison approaches
library(dplyr)
library(purrr)
library(jsonlite)
## Our focal package:
library(rdflib)
## experimental functions for rdflib package
source(system.file("examples/tidy_schema.R", package="rdflib"))
Configure RDF storage to use the BDB backend for on-disk storage.
rdf <- rdf(storage = "BDB", new_db = TRUE)
Relational data
The tidyverse
approach
tidyverse
operations are incredibly effective for
working with relational data. These dplyr
on the
nyflights13
dataset are easy to write and fast to
execute:
df <- flights %>%
left_join(airlines) %>%
left_join(planes, by="tailnum") %>%
select(carrier, name, manufacturer, model) %>%
distinct()
head(df)
## # A tibble: 6 x 4
## carrier name manufacturer model
## <chr> <chr> <chr> <chr>
## 1 UA United Air Lines Inc. BOEING 737-824
## 2 AA American Airlines Inc. BOEING 757-223
## 3 B6 JetBlue Airways AIRBUS A320-232
## 4 DL Delta Air Lines Inc. BOEING 757-232
## 5 UA United Air Lines Inc. BOEING 737-924ER
## 6 B6 JetBlue Airways AIRBUS INDUSTRIE A320-232
Still, joins are often a challenge in data preparation. Tabular
formats can often be sloppy about what is a key column and what is a
literal value, and also whether a column with the same name in different
tables means the same thing in both. Both of these things pose
challenges for later use when joining data. RDF representation
encourages greater discipline through the use of URIs (though we’ll run
a bit roughshod over that with the caviler use of x:
here.)
This example uses only data that is already part of the relational database. Adding additional information to the database is frequently more tricky, and can result in a rapidly expanding number of tables that can become difficult to work across.
RDF approach
Okay, now let’s dump the nyflights13
into the data lake.
Foreign keys in any table must be represented as URIs and not literal
strings:
uri_flights <- flights %>%
mutate(tailnum = paste0("planes:", tailnum),
carrier = paste0("airlines:", carrier))
We write the data.frame
s out as nquads. Recall that each
cell of a data.frame
can be represented as a triple, in
which the column is the predicate, the primary key (or row number) the
subject, and the cell value the object. We turn column names and primary
keys into URIs using a prefix based on the table name. (Note that
rdflib
does this conversion by merely munging cells and
calling write.table
, it is not a standard
redland
library transform).
system.time({
write_nquads(airlines, "airlines.nq", key = "carrier", prefix = "airlines:")
write_nquads(planes, "planes.nq", key = "tailnum", prefix = "planes:")
write_nquads(uri_flights, "flights.nq", prefix = "flights:")
})
## user system elapsed
## 16.807 2.561 19.128
We can now read these into our RDF data lake:
system.time({
read_nquads("airlines.nq", rdf = rdf)
read_nquads("flights.nq", rdf = rdf)
read_nquads("planes.nq", rdf = rdf)
})
## user system elapsed
## 79.267 51.865 70.052
Note that flights does not have a natural key (somewhat surprisingly,
flight
number is not a unique key for this table, as the
same flight number is reused on the same route at different times.) So,
we will treat each row as a unique anonymous key by setting the key to
NULL
.
Schema on read
We simply define the columns we want and we immediately get back the
desired data.frame
:
s <-
'SELECT ?carrier ?name ?manufacturer ?model ?dep_delay
WHERE {
?flight <flights:tailnum> ?tailnum .
?flight <flights:carrier> ?carrier .
?flight <flights:dep_delay> ?dep_delay .
?tailnum <planes:manufacturer> ?manufacturer .
?tailnum <planes:model> ?model .
?carrier <airlines:name> ?name
}'
system.time(
df <- rdf_query(rdf, s)
)
## user system elapsed
## 16.078 1.164 13.734
head(df)
## # A tibble: 6 x 5
## carrier name manufacturer model dep_delay
## <chr> <chr> <chr> <chr> <chr>
## 1 airlines:UA United Air Lines Inc. BOEING 737-824 2
## 2 airlines:UA United Air Lines Inc. BOEING 737-824 4
## 3 airlines:AA American Airlines Inc. BOEING 757-223 2
## 4 airlines:B6 JetBlue Airways AIRBUS A320-232 -1
## 5 airlines:DL Delta Air Lines Inc. BOEING 757-232 -6
## 6 airlines:UA United Air Lines Inc. BOEING 737-924ER -4
Note that in place of joins, we give more semantically meaningful
statements about the data: e.g. manufacturer
is a property
of a tailnum
(corresponding to a particular physical
aircraft), not of a flight
number. Departure delay
dep_delay
is a property of a flight, not of an aircraft
(tailnum
).
This is reminiscent of the way in which these data are organized in
the relational database tables to begin with: we find
deb_delay
in the flights
table and
manufacturer
in the planes
table. Good
relational design encourages this, but to work with the data the user is
often left having to do the required joins, which also creates tables
where these semantics are less clear.
Non-tabular data
The tidyverse
approach
We start with data from Bryan & Wickham, 2017 lesson on data rectangling:
f <- system.file("extdata/gh_repos.json", package="repurrrsive")
gh_data <- jsonlite::read_json(f)
## Error in readBin(structure(4L, class = c("file", "connection"), conn_id = <pointer: 0x263>), : can only read from a binary connection
The original lesson illustrates the power and reasonably concise
syntax of the tidyverse
package, purrr
, to
iterate over this complex structure to extract the necessary data. In
this approach, nesting of the data is largely a nuisance to overcome
rather than an asset to the data analyst:
gh_flat <- gh_data %>% purrr::flatten() # abandon nested structure and hope we didn't need it
## Error in eval(lhs, parent, parent): object 'gh_data' not found
gh_tibble <- tibble(
name = gh_flat %>% map_chr("name"),
issues = gh_flat %>% map_int("open_issues_count"),
wiki = gh_flat %>% map_lgl("has_wiki"),
homepage = gh_flat %>% map_chr("homepage", .default = ""),
owner = gh_flat %>% map_chr(c("owner", "login"))
)
## Error in eval(lhs, parent, parent): object 'gh_flat' not found
## Error in eval(lhs, parent, parent): object 'gh_tibble' not found
RDF on non-tabular data
The RDF approach merely treats JSON as JSON-LD within a given vocabulary. In this context, nesting implicitly provides important semantic information about relationships between the data, which are captured in the RDF triples. Here, we import the JSON data as RDF (and add it to our existing triplestore just for fun)
gh_rdf <- as_rdf(gh_data, rdf = rdf, prefix = "gh:")
## Error in as_rdf(gh_data, rdf = rdf, prefix = "gh:"): object 'gh_data' not found
And we can query it back out of the lake just by selecting the columns of interest.
s <-
'SELECT ?name ?issues ?wiki ?homepage ?owner
WHERE {
?repo <gh:homepage> ?homepage .
?repo <gh:has_wiki> ?wiki .
?repo <gh:open_issues_count> ?issues .
?repo <gh:name> ?name .
?repo <gh:owner> ?owner_id .
?owner_id <gh:login> ?owner
}'
system.time(
rdf_tibble <- rdf_query(rdf, s)
)
## user system elapsed
## 7.992 0.287 8.279
head(rdf_tibble)
## # A tibble: 0 x 5
## # … with 5 variables: name <chr>, issues <chr>, wiki <chr>, homepage <chr>, owner <chr>
Going further: A dplyr
-style syntax for SPARQL?
dplyr
provides a reasonably intuitive, powerful, and
concise interface for many common SQL commands. Indeed,
dplyr
function calls are literally serialized to the
corresponding SQL query when working with a relational database back-end
(via dbplyr
). Can a similar API be developed for SPARQL
queries?
SPARQL syntax is obviously inspired by SQL syntax, and includes many
of the same operations found in SQL and dplyr
(e.g. SELECT,
WHERE, FILTER, DISTINCT) as well as other more RDF specific queries.
If we are willing to make some assumptions about the most common queries, we can start to make some simplifying functions. For instance, in the above patterns, the variables being returned are always objects from the triples, with columns being named using the corresponding predicate. Assuming some additional convention to define the prefixes and indicate graph traversal (i.e. nested values), we could have constructed the above query from a call:
Though to generalize to arbitrary labels and predicates this might need to support something like: