Table Schema is a simple format to describe tabular data, including field names, types, constraints, missing values, foreign keys, etc.
In this document we use the terms “package” for Data Package, “resource” for Data Resource, “dialect” for Table Dialect, and “schema” for Table Schema.
General implementation
Frictionless supports schema$fields
and
schema$missingValues
to parse data types and missing values
when reading Tabular
Data Resources. Schema manipulation is limited to extracting a
schema from a resource, creating one from a data frame, and providing
one back to a resource. Schema metadata is including when writing a
package.
Read
get_schema()
extracts the schema from a resource:
library(frictionless)
package <- example_package()
# Get the Table Schema for the resource "observations"
schema <- get_schema(package, "observations")
str(schema)
#> List of 4
#> $ fields :List of 7
#> ..$ :List of 3
#> .. ..$ name : chr "observation_id"
#> .. ..$ type : chr "string"
#> .. ..$ constraints:List of 2
#> .. .. ..$ required: logi TRUE
#> .. .. ..$ unique : logi TRUE
#> ..$ :List of 3
#> .. ..$ name : chr "deployment_id"
#> .. ..$ type : chr "string"
#> .. ..$ constraints:List of 1
#> .. .. ..$ required: logi TRUE
#> ..$ :List of 4
#> .. ..$ name : chr "timestamp"
#> .. ..$ type : chr "datetime"
#> .. ..$ format : chr "%Y-%m-%dT%H:%M:%S%z"
#> .. ..$ constraints:List of 1
#> .. .. ..$ required: logi TRUE
#> ..$ :List of 3
#> .. ..$ name : chr "scientific_name"
#> .. ..$ type : chr "string"
#> .. ..$ constraints:List of 1
#> .. .. ..$ required: logi FALSE
#> ..$ :List of 3
#> .. ..$ name : chr "count"
#> .. ..$ type : chr "integer"
#> .. ..$ constraints:List of 2
#> .. .. ..$ required: logi FALSE
#> .. .. ..$ minimum : int 1
#> ..$ :List of 3
#> .. ..$ name : chr "life_stage"
#> .. ..$ type : chr "string"
#> .. ..$ constraints:List of 2
#> .. .. ..$ required: logi FALSE
#> .. .. ..$ enum : chr [1:5] "adult" "subadult" "juvenile" "offspring" ...
#> ..$ :List of 3
#> .. ..$ name : chr "comments"
#> .. ..$ type : chr "string"
#> .. ..$ constraints:List of 1
#> .. .. ..$ required: logi FALSE
#> $ missingValues: chr [1:3] "" "NA" "NaN"
#> $ primaryKey : chr "observation_id"
#> $ foreignKeys :List of 1
#> ..$ :List of 2
#> .. ..$ fields : chr "deployment_id"
#> .. ..$ reference:List of 2
#> .. .. ..$ resource: chr "deployments"
#> .. .. ..$ fields : chr "deployment_id"
read_resource()
uses schema$fields
to parse
the names and data types of the columns in a tabular data file. For
example, the third field in the schema (timestamp
) is
defined as a datetime type
with a specific
format
:
str(schema$fields[[3]])
#> List of 4
#> $ name : chr "timestamp"
#> $ type : chr "datetime"
#> $ format : chr "%Y-%m-%dT%H:%M:%S%z"
#> $ constraints:List of 1
#> ..$ required: logi TRUE
read_resource()
uses that information to correctly parse
the data type and to assign the name timestamp
to the
column:
observations <- read_resource(package, "observations")
observations$timestamp
#> [1] "2020-09-28 00:13:07 UTC" "2020-09-28 15:59:17 UTC"
#> [3] "2020-09-28 16:35:23 UTC" "2020-09-28 17:04:04 UTC"
#> [5] "2020-09-28 19:19:54 UTC" "2021-10-01 01:25:06 UTC"
#> [7] "2021-10-01 01:25:06 UTC" "2021-10-01 04:47:30 UTC"
The sixth field life_stage
has an enum
defined as one of its constraints
:
str(schema$fields[[6]])
#> List of 3
#> $ name : chr "life_stage"
#> $ type : chr "string"
#> $ constraints:List of 2
#> ..$ required: logi FALSE
#> ..$ enum : chr [1:5] "adult" "subadult" "juvenile" "offspring" ...
read_resource()
uses that information to parse the
column as a factor, using enum
as the factor levels:
Manipulate
A schema is a list which you can manipulate, but frictionless does
not provide functions to do that. Use purrr or base R
instead (see vignette("frictionless")
). You do not have to
start a schema from scratch though: use get_schema()
(see
above) or create_schema()
instead.
create_schema()
creates a schema from a data frame and
defines the name
, type
(and if a factor
constraints$enum
) for each field:
# Create a schema from the built-in dataset "iris"
iris_schema <- create_schema(iris)
str(iris_schema)
#> List of 1
#> $ fields:List of 5
#> ..$ :List of 2
#> .. ..$ name: chr "Sepal.Length"
#> .. ..$ type: chr "number"
#> ..$ :List of 2
#> .. ..$ name: chr "Sepal.Width"
#> .. ..$ type: chr "number"
#> ..$ :List of 2
#> .. ..$ name: chr "Petal.Length"
#> .. ..$ type: chr "number"
#> ..$ :List of 2
#> .. ..$ name: chr "Petal.Width"
#> .. ..$ type: chr "number"
#> ..$ :List of 3
#> .. ..$ name : chr "Species"
#> .. ..$ type : chr "string"
#> .. ..$ constraints:List of 1
#> .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"
add_resource()
allows to include the schema with a
resource. If no schema is provided, one is created with
create_schema()
:
package <- add_resource(
package,
resource_name = "iris",
data = iris,
schema = iris_schema
)
Write
write_package()
writes a package to disk as a
datapackage.json
file. This file includes the metadata of
all the resources, including the schema. To directly write a schema to
disk, use jsonlite::write_json()
.
Schema properties implementation
fields
fields
is required. It is used by read_resource()
to parse the
names and data types of the columns in a tabular data file.
create_schema()
sets fields
based on
information in a data frame. See Field properties
implementation for details.
missingValues
missingValues
is used by read_resource()
and defaults to ""
.
It is passed to na
in readr::read_delim()
.
create_schema()
does not set missingValues
.
write_package()
converts NA
values to
""
when writing a data frame to a CSV file. Since this is
the default, no missingValues
property is set.
primaryKey
primaryKey
is ignored by read_resource()
and not set by
create_schema()
.
foreignKeys
foreignKeys
is ignored by read_resource()
and not set by
create_schema()
.
Field properties implementation
name
name
is used by read_resource()
to assign a column name. The
vector of names is passed as col_names
to
readr::read_delim()
, ignoring names provided in the header
of the data file. create_schema()
uses the data frame
column name to set name
.
type and format
type
and (for some types) format
is used by read_resource()
to understand the column type.
The vector of types is passed as col_types
to
readr::read_delim()
, which warns if there are parsing
issues (inspect with problems()
).
create_schema()
uses the data frame column type to set
type
. See Field types
implementation for details.
read_resource()
interprets type
as
follows:
field type | column type |
---|---|
string |
character or factor
|
number |
double or factor
|
integer |
double or factor
|
boolean |
logical |
object |
character |
array |
character |
datetime |
POSIXct |
date |
Date |
time |
hms::hms() |
year |
Date |
yearmonth |
Date |
duration |
character |
geopoint |
character |
geojson |
character |
any |
character |
other value | error |
undefined | guessed |
create_schema()
sets type
as follows:
column type | field type |
---|---|
character |
string |
Date |
date |
difftime |
number |
factor |
string with factor levels as
constraints$enum
|
hms::hms() |
time |
integer |
integer |
logical |
boolean |
numeric |
number |
POSIXct /POSIXlt
|
datetime |
any other type | any |
create_schema()
does not set a format
,
since defaults are used for all types. This is also the case for
datetimes, dates and times, since readr::write_csv()
used
by write_package()
formats those to ISO8601, which is
considered the default.
title
title
is ignored by read_resource()
and not set by
create_schema()
.
description
description
is ignored by read_resource()
and not set by
create_schema()
.
example
example
is ignored by read_resource()
and not set by
create_schema()
.
constraints
constraints
is ignored by read_resource()
and not set by
create_schema()
, except for constraints$enum
.
read_resource()
uses it set the column type to
factor
, with enum
values as factor levels.
create_schema()
does the reverse.
rdfType
rdfType
is ignored by read_resource()
and not set by
create_schema()
.
Field types implementation
string
string
is interpreted as character
. Or factor
when
constraints$enum
is defined.
-
format
is ignored.
number
number
is interpreted as double
. Or factor
when
constraints$enum
is defined.
-
bareNumber
is supported. Iffalse
, whitespace and non-numeric characters are ignored. -
decimalChar
(.
by default) is supported, but as a single value for all number fields. If different values are defined, the most occurring one is selected. -
groupChar
(undefined by default) is supported, but as a single value for all number fields. If different values are defined, the most occurring one is selected.
integer
integer
is interpreted as double
(to avoid issues with big
numbers). Or factor
when constraints$enum
is
defined.
-
bareNumber
is supported. Iffalse
, whitespace and non-numeric characters are ignored.
boolean
boolean
is interpreted as logical
.
-
trueValues
that are not defaults are not supported. -
falseValues
that are not defaults are not supported.
object
object
is interpreted as character
array
array
is interpreted as character
.
datetime
datetime
is interpreted as POSIXct
.
-
format
is supported for the valuesdefault
(ISO datetime),any
(ISO datetime) and the same patterns as fordate
andtime
. The value%c
is not supported.
date
date
is interpreted as Date
.
-
format
is supported for the valuesdefault
(ISO date),any
(guessymd
) and Python/C strptime patterns, such as%a, %d %B %Y
forSat, 23 November 2013
.%x
is interpreted as%m/%d/%y
. The values%j
,%U
,%w
and%W
are not supported.
time
time
is interpreted as hms::hms()
.
-
format
is supported for the valuesdefault
(ISO time),any
(guesshms
) and Python/C strptime patterns, such as%I%p%M:%S.%f%z
for8AM30:00.300+0200
.
year
year
is interpreted as Date
with month and day set to
01
.
yearmonth
yearmonth
is interpreted as Date
with day set to 01
.
duration
duration
is interpreted as character
. You can parse these values
with lubridate::duration()
.
geopoint
geopoint
is interpreted as character
.
geojson
geojson
is interpreted as character
.
any
any
is interpreted as character