Evaluate and Upload Data

The EDI data repository has a “staging” environment to test the upload and rendering of new data packages before publishing to “production”. These environments are functionally equivalent but the contained values are independent. For example, a data package identifier reserved by a user in “staging” will not work in “production” and vise versa.

library(EDIutils)

Evaluation and upload to the repository requires data entities are described with EML metadata. There are many tools for creating EML, EDI supports two: the EMLassemblyline R package for programmatic workflows, and the ezEML web form wizard. Research groups managing large volumes of metadata may want to consider the LTER core-metabase.

Authenticate

Authentication is required by functions involving data evaluation and upload, audit report access, event notifications, and other account based features. Request an account from support@edirepository.org. There are three options for authenticating:

# Interactively at the console
login()
#> User name: "my_name"
#> User password: "my_secret"

# Programmatically with function arguments
login(userId = "my_name", userPass = "my_secret")

# Programmatically with a file containing userId and userPass arguments
login(config = paste0(tempdir(), "/config.txt"))

The login() function exchanges credentials for a temporary (~10 hour) authentication token, which is written to the EDI_TOKEN environment variable referenced by EDIutils functions requiring authentication.

Reserve a Data Package ID

Data package reservations prevent conflicting use of the same identifier.

# Create reservation
identifier <- create_reservation(scope = "edi", env = "staging")
identifier
#> [1] 595

Evaluate

Evaluation checks for metadata accuracy and completeness.


# Evaluate data package
transaction <- evaluate_data_package(
 eml = paste0(tempdir(), "/edi.595.1.xml"), 
 env = "staging")
transaction
#> [1] "evaluate_163966785813042760"

# Check status
status <- check_status_evaluate(transaction, env = "staging")
status
#> [1] TRUE

Interpreting the Evaluation Report

Report Summary

The evaluation report summary provides a quick look at the data package evaluation results. Specifically, the total number of checks run and how many of these checks resulted in statuses of “valid”, “info”, “warn”, or “error”.

# Summarize report
read_evaluate_report_summary(transaction, env = "staging")
#> ===================================================
#>   EVALUATION REPORT
#> ===================================================
#>
#> PackageId: edi.595.1
#> Report Date/Time: 2021-12-16T22:49:25
#> Total Quality Checks: 29
#> Valid: 21
#> Info: 8
#> Warn: 0
#> Error: 0

The meaning of these status messages:

Valid - The result of the quality check matches the expectation.
Info - The result of the quality check may or may not match the expectation, but since the expectation is not required, information is returned instead of a Warn or Error.
Warn - The result of the quality check does not match the expectation. A match is not explicitly required to publish the data package, but strongly recommended.
Error - The result of the quality check does not match the expectation. A match is required before the data package can be published.

Any evaluation check that results in a warning or error status should be resolved before moving ahead (note that errors must be corrected). Resolve problems with the data and metadata and repeat the evaluation process until all errors (and preferably all warnings) are resolved.

Full Report

The full evaluation report provides detailed information on each check and some diagnostics to help resolve issues. The full report can be printed to the console or written to file as plain text or as html to be viewed in a web browser (recommended). See ?read_evaluate_report for details.

# Read the evaluation report
report <- read_evaluate_report(transaction, as = "char", env = "staging")
message(report)
#> ===================================================
#>   EVALUATION REPORT
#> ===================================================
#>   
#> PackageId: edi.595.1
#> Report Date/Time: 2021-12-16T08:17:40
#> Total Quality Checks: 29
#> Valid: 21
#> Info: 8
#> Warn: 0
#> Error: 0
#> 
#> ---------------------------------------------------
#>   DATASET REPORT
#> ---------------------------------------------------
#>   
#> IDENTIFIER: packageIdPattern
#> NAME: packageId pattern matches "scope.identifier.revision"
#> DESCRIPTION: Check against LTER requirements for scope.identifier.revision
#> EXPECTED: 'scope.n.m', where 'n' and 'm' are integers and 'scope' is one ...
#> FOUND: edi.595.1
#> STATUS: valid
#> EXPLANATION: 
#> SUGGESTION: 
#> REFERENCE: 
#> 
#> IDENTIFIER: emlVersion
#> NAME: EML version 2.1.0 or beyond
#> DESCRIPTION: Check the EML document declaration for version 2.1.0 or higher
#> EXPECTED: eml://ecoinformatics.org/eml-2.1.0 or higher
#> FOUND: https://eml.ecoinformatics.org/eml-2.2.0
#> STATUS: valid
#> EXPLANATION: Validity of this quality report is dependent on this check ...
#> SUGGESTION: 
#> REFERENCE: 
#> ...

The Evaluation Report is broken into multiple parts, always starting with the Dataset Report, and followed by an Entity Report for each entity (data object/file) included in the data package. These are differentiated by header lines with the Entity Name and Identifier.

The Dataset and Entity Reports share the same layout:

# - The number of the quality check
Identifier - The identifier of the quality check
Status - The status of the result of the quality check
Quality Check - Describes the type of the quality check (data, metadata, or congruency), the system (knb, lter), and the status that results on failure
Name - The name of the quality check
Description - Brief description of the quality check
Expected - The result that the quality check is expecting
Found - The actual result of the quality check
Explanation - Additional information describing the rationale of the quality check
Suggestion - Potential data package improvements to implement to pass the quality check
Reference - Source of the rationale for the quality check or where to find more information

Parse through the document and address any errors or warnings (denoted by the Error and Warn labels). To understand why a quality check failed, first read the Name and Description of the quality check to determine what was being tested and how the test was being conducted. Then, compare the Expected result to what was Found. If it is still not clear what caused the failure, try to gain additional insight from the Explanation, Suggestion, and Reference fields, or contact the EDI Data Curation Team for clarification (info@edirepository.org).

Upload

Upload after errors and warnings are fixed.

# Create a new data package
transaction <- create_data_package(
 eml = paste0(tempdir(), "/edi.595.1.xml"), 
 env = "staging")
transaction
#> [1] "create_163966765080210573__edi.595.1"

# Check status
status <- check_status_create(
 transaction = transaction, 
 env = "staging")
status
#> [1] TRUE

Update

Update is the same as upload, but with an incremented data package version number (e.g. “edi.595.2” supersedes “edi.595.1”). NOTE: The new identifier must be added to the “packageId” element in the EML and as the new EML file name.

#' # Update data package
#' transaction <- update_data_package(
#'   eml = paste0(tempdir(), "/edi.595.2.xml"), 
#'   env = "staging")
#' transaction
#' #> [1] "update_edi.595_163966788658131920__edi.595.2"
#' 
#' # Check status
#' status <- check_status_update(
#'   transaction = transaction, 
#'   env = "staging")
#' status
#' #> [1] TRUE

Once everything looks good in the “staging” environment, then repeat the above reservation and upload steps in the “production” environment where the data package will be assigned a DOI and made discoverable with other published data.