Validation in JSON-LD
Validating and consuming in JSON-LD in software development
Carl Boettiger
2024-11-27
Source:vignettes/articles/validation-in-json-ld.Rmd
validation-in-json-ld.Rmd
Introduction
Schema validation is a useful and important concept to the distribution of metadata in formats such as XML and JSON, in which the standard-provider creates a schema (specified in an XML-schema, XSD, for XML documents, or json-schema for JSON documents). Schemas allow us to go beyond the basic notation of making sure a file is simply valid XML or valid JSON, a requirement just to be read in by any parser. By detailing how the metadata must be structured, what elements must, can, and may not be included, and what data types may be used for those elements, schema help developers consuming the data to anticipate these details and thus build applications which know how to process them. For the data creator, validation is a convenient way to catch data input errors and ensure a consistent data structure.
Because schema validation must ensure predictable behavior without knowledge of what any specific application is going to do with the data, it tends to be very strict. A simple application may not care if certain fields are missing or if integers are mistaken for characters, while to another application these differences could lead it to throw fatal errors.
The approach of JSON-LD is less prescriptive. JSON-LD uses the notion of “framing” to let each application specify how it expects it data to be structured. JSON frames allow each developer consuming the data to handle many of the same issues that schema validation have previously assured. Readers should consult the official json-ld framing documentation for details on this approach.
A motivating example:
Consider the following codemeta document:
codemeta <-
'
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"name": "codemetar: Generate CodeMeta Metadata for R Packages",
"datePublished":" 2017-05-20",
"version": 1.0,
"author": [
{
"@type": "Person",
"givenName": "Carl",
"familyName": "Boettiger",
"email": "cboettig@gmail.com",
"@id": "http://orcid.org/0000-0002-1642-628X"
}],
"maintainer": {"@id": "http://orcid.org/0000-0002-1642-628X"}
}
'
jsonld::jsonld_compact(codemeta, "https://doi.org/10.5063/schema/codemeta-2.0")
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"type": "SoftwareSourceCode",
"author": {
"id": "http://orcid.org/0000-0002-1642-628X",
"type": "Person",
"email": "cboettig@gmail.com",
"familyName": "Boettiger",
"givenName": "Carl"
},
"datePublished": " 2017-05-20",
"maintainer": {
"id": "http://orcid.org/0000-0002-1642-628X"
},
"name": "codemetar: Generate CodeMeta Metadata for R Packages",
"version": 1
}
Framing: subsetting data
By default, frames return all the input data, while our application
may only be interested in some subset. Often it is sufficient just to
ignore these additional terms: in the example above it’s just as easy
for our application to work with author elements whether or not we have
dropped other elements such as meta$version
. To restrict a
frame to returning only the nodes we explicitly mention, we can use the
keyword @explicit
:
frame <- '{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@explicit": "true",
"@type": "Person",
"givenName": {},
"familyName": {}
}'
meta <-
jsonld_frame(codemeta, frame) %>%
fromJSON(codemeta, simplifyVector = FALSE) %>%
getElement("@graph")
meta[[1]]
$id
[1] "http://orcid.org/0000-0002-1642-628X"
$type
[1] "Person"
$familyName
[1] "Boettiger"
$givenName
[1] "Carl"
Note that this has only returned the requested fields in the graph
(along with the @id
and @type
, which are
always included if provided, since they may be required to interpret the
data properly). This frame extracts the givenName
and
familyName
of any Person
node it finds,
regardless of where it occurs, while omitting the rest of the data. Note
that since the frame requests these elements at the top level, they are
returned as such, with each match a separate entry in the
@graph
. Our example has only one person in
meta[[1]]
, had we more matches they would appear in
meta[[2]]
, etc. Note these returns are un-ordered.
Framing: expanding node references
The same underlying data can often be expressed in different ways,
particularly when dealing with nested data. Framing can be of great help
here to reshape the data into the structure required by the application.
For instance, it would be natural to access the email
of
the maintainer
in the same manner we did the author, but
this fails for our example as maintainer
is defined only by
reference to an ID:
meta <- fromJSON(codemeta, simplifyVector = FALSE)
paste("For complaints, email", meta$maintainer$email)
[1] "For complaints, email "
We can confirm that maintainer
is just an ID:
meta$maintainer
$`@id`
[1] "http://orcid.org/0000-0002-1642-628X"
We can use a frame with the special directive
"@embed": "@always"
to say that we want the full maintainer
information embedded an not just referred to by id alone. Then we can
subset maintainer
just like we do author.
frame <- '{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@embed": "@always"
}'
meta <-
jsonld_frame(codemeta, frame) %>%
fromJSON(codemeta, simplifyVector = FALSE) %>%
getElement("@graph") %>% getElement(1)
Now we can do
paste("For complaints, email", meta$maintainer$email)
[1] "For complaints, email cboettig@gmail.com"
and see that email
has been successfully returned from
the matching ID under author data.
Handling unexpected types
JSON-LD routines will simply refuse to compact data if the type
differs from what the context expects. Here is a sample data file that
declares that the buildInstructions
are included as text,
which differs from the context
file which explicitly states
that buildInstructions
should be a URL:
codemeta <-
'
{
"@context": "https://purl.org/codemeta/2.0",
"@type": "SoftwareSourceCode",
"name": "codemetar: Generate CodeMeta Metadata for R Packages",
"buildInstructions": {
"@value": "Just install this package using devtools::install_github",
"@type": "Text"
}
}
'
When we perform a framing or compaction operation,
buildInstructions
gets de-referenced to
codemeta:buildInstructions
, because it does not match the
context. This means that if our application asks for
meta$buildInstructions
:
meta <-
jsonld_frame(codemeta, '{"@context": "https://purl.org/codemeta/2.0"}') %>%
fromJSON(codemeta, simplifyVector = FALSE) %>%
getElement("@graph") %>% getElement(1)
## above is same as compacting:
#jsonld_compact(codemeta, "https://doi.org/10.5063/schema/codemeta-2.0") %>%
# fromJSON(codemeta, simplifyVector = FALSE)
meta$buildInstructions
NULL
We just get NULL
, rather than some unexpected type of
object (e.g. a string that is not a URL.) Note that the data is not
lost, but simply not dereferenced:
names(meta)
[1] "type" "name"
[3] "codemeta:buildInstructions"
meta["codemeta:buildInstructions"]
$`codemeta:buildInstructions`
$`codemeta:buildInstructions`$type
[1] "Text"
$`codemeta:buildInstructions`$`@value`
[1] "Just install this package using devtools::install_github"
Note that this behavior only happens because the data declared the
"@type": "Text"
explicitly. JSON-LD algorithms only believe
what they are told about type and only look for consistency in declared
types. If you give text but declare it as a "@type": "URL"
,
or don’t declare the type at all, JSON-LD algorithms won’t know anything
is amiss and the property will be compacted as usual.