Data Resource is a simple format to describe a data resource such as an individual table or file, including its name, format, path, etc.
In this document we use the terms “package” for Data Package, “resource” for Data Resource, “dialect” for Table Dialect, and “schema” for Table Schema.
General implementation
Frictionless supports reading, manipulating and writing resources, but much of its functionality is limited to Tabular Data Resources.
Read
resources()
lists all resources in a package:
library(frictionless)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
#> object 'type_sum.accel' not found
package <- example_package()
# List the resources
resources(package)
#> [1] "deployments" "observations" "media"
read_resource()
reads data from a tabular resource to a
data frame:
read_resource(package, "deployments")
#> # A tibble: 3 × 5
#> deployment_id longitude latitude start comments
#> <chr> <dbl> <dbl> <date> <chr>
#> 1 1 4.62 50.8 2020-09-25 NA
#> 2 2 4.64 50.8 2020-10-01 "On \"forêt\" road."
#> 3 3 4.65 50.8 2020-10-05 "Malfunction/no photos, data"
Frictionless does not support reading data from non-tabular resources.
Manipulate
remove_resource()
removes a resource (of any type):
remove_resource(package, "deployments")
#> A Data Package with 2 resources:
#> • observations
#> • media
#> Use `unclass()` to print the Data Package as a list.
# This and many other functions return "package", which you can update with
# package <- remove_resource(package, "deployments")
add_resource()
adds or replaces a tabular resource. The
provided data must be a data frame or a tabular data file
(e.g. CSV):
# Add a resource with data from a data frame
add_resource(package, "iris", data = iris)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • iris
#> Use `unclass()` to print the Data Package as a list.
# Replace a resource with one where data is stored in a tabular file
path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless")
add_resource(package, "deployments", data = path, replace = TRUE)
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `unclass()` to print the Data Package as a list.
You can pipe most functions (see
vignette("data-package")
).
Write
write_package()
writes a package to disk as a
datapackage.json
file. This file includes the metadata of
all the resources. write_package()
also writes resource
data to CSV files, unless the referred data are referred to be URL or
inline. See the function documentation for details.
Properties implementation
name
name
is required. It is used to identify a resource in
read_resource()
, add_resource()
and
remove_resource()
(always as the second argument):
deployments <- read_resource(package, resource_name = "deployments")
add_resource()
sets name
to the provided
resource_name
:
add_resource(package, resource_name = "iris", data = iris)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • iris
#> Use `unclass()` to print the Data Package as a list.
path
path
or data
(see further) is required. Providing both is not
allowed.
path
is for data in files (e.g. a CSV file). It can be a
local path or URL. Supported protocols are http
,
https
, ftp
, sftp
and
sftp
. Absolute paths (/
) or relative parent
paths (../
) are not allowed to avoid security
vulnerabilities.
When multiple paths are provided
("path": ["myfile1.csv", "myfile2.csv"]
), the files are
expected to have the same structure. read_resource()
merges
these into a single data frame in the order the paths are provided
(using dplyr::bind_rows()
):
# The "observations" resource has multiple files in path
package$resources[[2]]$path
#> [1] "observations_1.tsv" "observations_2.tsv"
# These are combined into a single data frame when reading
read_resource(package, "observations")
#> # A tibble: 8 × 7
#> observation_id deployment_id timestamp scientific_name count
#> <chr> <chr> <dttm> <chr> <dbl>
#> 1 1-1 1 2020-09-28 00:13:07 Capreolus capreolus 1
#> 2 1-2 1 2020-09-28 15:59:17 Capreolus capreolus 1
#> 3 1-3 1 2020-09-28 16:35:23 Lepus europaeus 1
#> 4 1-4 1 2020-09-28 17:04:04 Lepus europaeus 1
#> 5 1-5 1 2020-09-28 19:19:54 Sus scrofa 2
#> 6 2-1 2 2021-10-01 01:25:06 Sus scrofa 1
#> 7 2-2 2 2021-10-01 01:25:06 Sus scrofa 1
#> 8 2-3 2 2021-10-01 04:47:30 Sus scrofa 1
#> # ℹ 2 more variables: life_stage <fct>, comments <chr>
add_resource()
sets path
to the path(s)
provided in data
:
path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless")
add_resource(package, "deployments", data = path, replace = TRUE)
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `unclass()` to print the Data Package as a list.
data
Support for inline data
is currently limited, e.g. JSON
object and string are not supported and schema
,
mediatype
and format
are ignored.
data
is for inline data (included in the
datapackage.json
). read_resource()
attempts to
read data
if it is provided as a JSON array:
# The "media" resource has inline data
str(package$resources[[3]]$data)
#> List of 3
#> $ :List of 5
#> ..$ media_id : chr "aed5fa71-3ed4-4284-a6ba-3550d1a4de8d"
#> ..$ deployment_id : chr "1"
#> ..$ observation_id: chr "1-1"
#> ..$ timestamp : chr "2020-09-28 02:14:59+02:00"
#> ..$ file_path : chr "https://multimedia.agouti.eu/assets/aed5fa71-3ed4-4284-a6ba-3550d1a4de8d/file"
#> $ :List of 5
#> ..$ media_id : chr "da81a501-8236-4cbd-aa95-4bc4b10a05df"
#> ..$ deployment_id : chr "1"
#> ..$ observation_id: chr "1-1"
#> ..$ timestamp : chr "2020-09-28 02:15:00+02:00"
#> ..$ file_path : chr "https://multimedia.agouti.eu/assets/da81a501-8236-4cbd-aa95-4bc4b10a05df/file"
#> $ :List of 5
#> ..$ media_id : chr "0ba57608-3cf1-49d6-a5a2-fe680851024d"
#> ..$ deployment_id : chr "1"
#> ..$ observation_id: chr "1-1"
#> ..$ timestamp : chr "2020-09-28 02:15:01+02:00"
#> ..$ file_path : chr "https://multimedia.agouti.eu/assets/0ba57608-3cf1-49d6-a5a2-fe680851024d/file"
read_resource(package, "media")
#> # A tibble: 3 × 5
#> media_id deployment_id observation_id timestamp file_path
#> <chr> <chr> <chr> <chr> <chr>
#> 1 aed5fa71-3ed4-4284-a6ba-3550… 1 1-1 2020-09-… https://…
#> 2 da81a501-8236-4cbd-aa95-4bc4… 1 1-1 2020-09-… https://…
#> 3 0ba57608-3cf1-49d6-a5a2-fe68… 1 1-1 2020-09-… https://…
add_resource()
adds the provided data frame to
data
:
df <- data.frame("col_1" = c(1, 2), "col_2" = c("a", "b"))
package <- add_resource(package, "df", df)
package$resources[[4]]$data
#> col_1 col_2
#> 1 1 a
#> 2 2 b
write_package()
writes that data frame to a CSV file,
adds its path to path
and removes data
.
profile
profile
is required to have the value "tabular-data-resource"
.
add_resource()
sets profile
to that value.
schema
schema
is required. It is used by read_resource()
to parse data
types and missing values. It can either be a JSON object or a path or
URL referencing a JSON object. See vignette("table-schema")
for details.
dialect
dialect
is used by read_resource()
to parse a tabular data file. It
can either be a JSON object or a path or URL referencing a JSON object.
See vignette("table-dialect")
for details.
title
title
is ignored by read_resource()
and not set by
add_resource()
, unless provided:
add_resource(
package,
"iris",
iris,
title = "Edgar Anderson's Iris Data",
replace = TRUE
)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • df
#> Use `unclass()` to print the Data Package as a list.
description
description
is ignored by read_resource()
and not set by
add_resource()
unless provided
(cf. title
).
format
format
is ignored by read_resource()
. add_resource()
sets format
when data are provided as a file, based on the
provided delim
:
delim | format |
---|---|
"," (default) |
"csv" |
"\t" |
"tsv" |
any other value | "csv" |
path <- system.file("extdata", "v1", "observations_1.tsv", package = "frictionless")
package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE)
package$resources[[2]]$format
#> [1] "tsv"
add_resource()
leaves format
undefined when
data are provided as a data frame. write_package()
sets it
to "csv"
when writing to disk.
mediatype
mediatype
is ignored by read_resource()
. add_resource()
sets mediatype
when data are provided as a file, based on
the provided delim
:
delim | mediatype |
---|---|
"," (default) |
"text/csv" |
"\t" |
"text/tab-separated-values" |
any other value | "text/csv" |
path <- system.file("extdata", "v1", "observations_1.tsv", package = "frictionless")
package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE)
package$resources[[2]]$mediatype
#> [1] "text/tab-separated-values"
add_resource()
leaves mediatype
undefined
when data are provided as a data frame. write_package()
sets it to "text/csv"
when writing to disk.
encoding
encoding
(e.g. "windows-1252"
) is used by
read_resource()
to parse the file. It defaults to UTF-8 if
no encoding
is provided or if it cannot be recognized. The
returned data frame is always UTF-8.
add_resource()
guesses the encoding
(using
readr::guess_encoding()
) when data are provided as file. It
leaves the encoding
undefined when data are provided as a
data frame. write_package()
sets it to "utf-8"
when writing to disk.
path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless")
package <- add_resource(package, "deployments", data = path, delim = ",", replace = TRUE)
package$resources[[2]]$encoding
#> [1] "UTF-8"
bytes
bytes
is ignored by read_resource()
and not set by
add_resource()
unless provided
(cf. title
).
hash
hash
is ignored by read_resource()
and not set by
add_resource()
unless provided
(cf. title
).
sources
sources
is ignored by read_resource()
and not set by
add_resource()
unless provided
(cf. title
).
licenses
licenses
is ignored by read_resource()
and not set by
add_resource()
unless provided
(cf. title
).
compression
compression
(a recipe) is ignored by read_resource()
and not set by
add_resource()
.
Compression is derived from the provided path
instead.
If the path
ends in .gz
, .bz2
,
.xz
, or .zip
, the files are automatically
decompressed by read_resource()
(using default
readr::read_delim()
functionality). Only .gz
files can be read directly from URL path
s.