vignettes/issues.Rmd
issues.Rmd
rgbif
now has the ability to clean data retrieved from GBIF based on GBIF issues. These issues are returned in data retrieved from GBIF, e.g., through the occ_search()
function. Inspired by magrittr
, we’ve setup a workflow for cleaning data based on using the operator %>%
. You don’t have to use it, but as we show below, it can make the process quite easy.
Note that you can also query based on issues, e.g., occ_search(taxonKey=1, issue='DEPTH_UNLIKELY')
. However, we imagine it’s more likely that you want to search for occurrences based on a taxonomic name, or geographic area, not based on issues, so it makes sense to pull data down, then clean as needed using the below workflow with occ_issues()
.
Note that occ_issues()
only affects the data element in the gbif class that is returned from a call to occ_search()
. Maybe in a future version we will remove the associated records from the hierarchy and media elements as they are remove from the data element.
occ_issues()
also works with data from occ_download()
.
Install from CRAN
install.packages("rgbif")
Or install the development version from GitHub
remotes::install_github("ropensci/rgbif")
Load rgbif
Get taxon key for Helianthus annuus
(key <- name_suggest(q='Helianthus annuus', rank='species')$key[1])
#> NULL
Then pass to occ_search()
(res <- occ_search(taxonKey=key, limit=100))
#> Records found [1650859425]
#> Records returned [100]
#> No. unique hierarchies [68]
#> No. media records [100]
#> No. facets [0]
#> Args [limit=100, offset=0, fields=all]
#> # A tibble: 100 x 96
#> key scientificName decimalLatitude decimalLongitude issues datasetKey
#> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 1135… Cryptosphaeri… 48.0 16.5 colma… 0afba960-…
#> 2 1632… Belenois java… -34.9 139. txmat… fa375330-…
#> 3 1632… Belenois java… -34.9 139. txmat… fa375330-…
#> 4 1830… Lopadostoma g… 48.0 16.5 colma… 0afba960-…
#> 5 1830… Platystomum o… 48.0 16.5 colma… 0afba960-…
#> 6 1831… Trechispora f… 48.0 16.5 colma… 0afba960-…
#> 7 1831… Nemania serpe… 48.0 16.5 colma… 0afba960-…
#> 8 1897… Anisotome aro… -38.9 175. cdrou… 83ae84cf-…
#> 9 1897… Lagenifera Ca… -38.9 175. cdrou… 83ae84cf-…
#> 10 1897… Aciphylla hec… -45.5 169. cdrou… 83ae84cf-…
#> # … with 90 more rows, and 90 more variables: publishingOrgKey <chr>,
#> # installationKey <chr>, publishingCountry <chr>, protocol <chr>,
#> # lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
#> # hostingOrganizationKey <chr>, basisOfRecord <chr>, occurrenceStatus <chr>,
#> # taxonKey <int>, kingdomKey <int>, phylumKey <int>, classKey <int>,
#> # orderKey <int>, familyKey <int>, genusKey <int>, speciesKey <int>,
#> # acceptedTaxonKey <int>, acceptedScientificName <chr>, kingdom <chr>,
#> # phylum <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
#> # genericName <chr>, specificEpithet <chr>, infraspecificEpithet <chr>,
#> # taxonRank <chr>, taxonomicStatus <chr>, elevation <dbl>, year <int>,
#> # month <int>, day <int>, eventDate <chr>, lastInterpreted <chr>,
#> # license <chr>, identifiers <chr>, facts <chr>, relations <chr>,
#> # institutionKey <chr>, isInCluster <lgl>, geodeticDatum <chr>, class <chr>,
#> # countryCode <chr>, recordedByIDs <chr>, identifiedByIDs <chr>,
#> # country <chr>, identifier <chr>, recordedBy <chr>, catalogNumber <chr>,
#> # institutionCode <chr>, locality <chr>, gbifID <chr>, collectionCode <chr>,
#> # occurrenceID <chr>, identifiedBy <chr>, name <chr>,
#> # X.99d66b6c.9087.452f.a9d4.f15f2c2d0e7e. <chr>,
#> # coordinateUncertaintyInMeters <dbl>, stateProvince <chr>, references <chr>,
#> # eventID <chr>, dataGeneralizations <chr>, vernacularName <chr>,
#> # otherCatalogNumbers <chr>, taxonConceptID <chr>, modified <chr>,
#> # collectionKey <chr>, higherGeography <chr>, language <chr>,
#> # verbatimLocality <chr>, type <chr>, verbatimElevation <chr>,
#> # typeStatus <chr>, gadm <chr>, individualCount <int>,
#> # dynamicProperties <chr>, municipality <chr>, associatedReferences <chr>,
#> # county <chr>, locationID <chr>, fieldNotes <chr>, eventTime <chr>,
#> # behavior <chr>, sex <chr>, habitat <chr>,
#> # identificationVerificationStatus <chr>, identificationRemarks <chr>
The dataset gbifissues
can be retrieved using the function gbif_issues()
. The dataset’s first column code
is a code that is used by default in the results from occ_search()
, while the second column issue
is the full issue name given by GBIF. The third column is a full description of the issue.
head(gbif_issues())
#> code issue
#> 1 bri BASIS_OF_RECORD_INVALID
#> 2 ccm CONTINENT_COUNTRY_MISMATCH
#> 3 cdc CONTINENT_DERIVED_FROM_COORDINATES
#> 4 conti CONTINENT_INVALID
#> 5 cdiv COORDINATE_INVALID
#> 6 cdout COORDINATE_OUT_OF_RANGE
#> description
#> 1 The given basis of record is impossible to interpret or seriously different from the recommended vocabulary.
#> 2 The interpreted continent and country do not match up.
#> 3 The interpreted continent is based on the coordinates, not the verbatim string information.
#> 4 Uninterpretable continent values found.
#> 5 Coordinate value given in some form but GBIF is unable to interpret it.
#> 6 Coordinate has invalid lat/lon values out of their decimal max range.
#> type
#> 1 occurrence
#> 2 occurrence
#> 3 occurrence
#> 4 occurrence
#> 5 occurrence
#> 6 occurrence
You can query to get certain issues
gbif_issues()[ gbif_issues()$code %in% c('cdround','cudc','gass84','txmathi'), ]
#> code issue
#> 10 cdround COORDINATE_ROUNDED
#> 12 cudc COUNTRY_DERIVED_FROM_COORDINATES
#> 23 gass84 GEODETIC_DATUM_ASSUMED_WGS84
#> 39 txmathi TAXON_MATCH_HIGHERRANK
#> description
#> 10 Original coordinate modified by rounding to 5 decimals.
#> 12 The interpreted country is based on the coordinates, not the verbatim string information.
#> 23 Indicating that the interpreted coordinates assume they are based on WGS84 datum as the datum was either not indicated or interpretable.
#> 39 Matching to the taxonomic backbone can only be done on a higher rank and not the scientific name.
#> type
#> 10 occurrence
#> 12 occurrence
#> 23 occurrence
#> 39 occurrence
The code cdround
represents the GBIF issue COORDINATE_ROUNDED
, which means that
Original coordinate modified by rounding to 5 decimals.
The content for this information comes from https://gbif.github.io/gbif-api/apidocs/org/gbif/api/vocabulary/OccurrenceIssue.html
Now that we know a bit about GBIF issues, you can parse your data based on issues. Using the data generated above, and using the function %>%
imported from magrittr
, we can get only data with the issue gass84
, or GEODETIC_DATUM_ASSUMED_WGS84
(Note how the records returned goes down to 98 instead of the initial 100).
res %>%
occ_issues(gass84)
#> Records found [1650859425]
#> Records returned [77]
#> No. unique hierarchies [68]
#> No. media records [100]
#> No. facets [0]
#> Args [limit=100, offset=0, fields=all]
#> # A tibble: 77 x 96
#> key scientificName decimalLatitude decimalLongitude issues datasetKey
#> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 1897… Anisotome aro… -38.9 175. cdrou… 83ae84cf-…
#> 2 1897… Lagenifera Ca… -38.9 175. cdrou… 83ae84cf-…
#> 3 1897… Aciphylla hec… -45.5 169. cdrou… 83ae84cf-…
#> 4 1897… Ozothamnus va… -39.2 175. cdrou… 83ae84cf-…
#> 5 1897… Huperzia aust… -39.2 175. cdrou… 83ae84cf-…
#> 6 1897… Celmisia dens… -45.5 169. cdrou… 83ae84cf-…
#> 7 1897… Schizaea aust… -39.2 175. cdrou… 83ae84cf-…
#> 8 1897… Ourisia macro… -39.2 175. cdrou… 83ae84cf-…
#> 9 1897… Epilobium als… -39.2 175. cdrou… 83ae84cf-…
#> 10 1897… Lycopodium fa… -39.2 175. cdrou… 83ae84cf-…
#> # … with 67 more rows, and 90 more variables: publishingOrgKey <chr>,
#> # installationKey <chr>, publishingCountry <chr>, protocol <chr>,
#> # lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
#> # hostingOrganizationKey <chr>, basisOfRecord <chr>, occurrenceStatus <chr>,
#> # taxonKey <int>, kingdomKey <int>, phylumKey <int>, classKey <int>,
#> # orderKey <int>, familyKey <int>, genusKey <int>, speciesKey <int>,
#> # acceptedTaxonKey <int>, acceptedScientificName <chr>, kingdom <chr>,
#> # phylum <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
#> # genericName <chr>, specificEpithet <chr>, infraspecificEpithet <chr>,
#> # taxonRank <chr>, taxonomicStatus <chr>, elevation <dbl>, year <int>,
#> # month <int>, day <int>, eventDate <chr>, lastInterpreted <chr>,
#> # license <chr>, identifiers <chr>, facts <chr>, relations <chr>,
#> # institutionKey <chr>, isInCluster <lgl>, geodeticDatum <chr>, class <chr>,
#> # countryCode <chr>, recordedByIDs <chr>, identifiedByIDs <chr>,
#> # country <chr>, identifier <chr>, recordedBy <chr>, catalogNumber <chr>,
#> # institutionCode <chr>, locality <chr>, gbifID <chr>, collectionCode <chr>,
#> # occurrenceID <chr>, identifiedBy <chr>, name <chr>,
#> # X.99d66b6c.9087.452f.a9d4.f15f2c2d0e7e. <chr>,
#> # coordinateUncertaintyInMeters <dbl>, stateProvince <chr>, references <chr>,
#> # eventID <chr>, dataGeneralizations <chr>, vernacularName <chr>,
#> # otherCatalogNumbers <chr>, taxonConceptID <chr>, modified <chr>,
#> # collectionKey <chr>, higherGeography <chr>, language <chr>,
#> # verbatimLocality <chr>, type <chr>, verbatimElevation <chr>,
#> # typeStatus <chr>, gadm <chr>, individualCount <int>,
#> # dynamicProperties <chr>, municipality <chr>, associatedReferences <chr>,
#> # county <chr>, locationID <chr>, fieldNotes <chr>, eventTime <chr>,
#> # behavior <chr>, sex <chr>, habitat <chr>,
#> # identificationVerificationStatus <chr>, identificationRemarks <chr>
Note also that we’ve set up occ_issues()
so that you can pass in issue names without having to quote them, thereby speeding up data cleaning.
Next, we can remove data with certain issues just as easily by using a -
sign in front of the variable, like this, removing data with issues depunl
and mdatunl
.
res %>%
occ_issues(-depunl, -mdatunl)
#> Records found [1650859425]
#> Records returned [100]
#> No. unique hierarchies [68]
#> No. media records [100]
#> No. facets [0]
#> Args [limit=100, offset=0, fields=all]
#> # A tibble: 100 x 96
#> key scientificName decimalLatitude decimalLongitude issues datasetKey
#> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 1135… Cryptosphaeri… 48.0 16.5 colma… 0afba960-…
#> 2 1632… Belenois java… -34.9 139. txmat… fa375330-…
#> 3 1632… Belenois java… -34.9 139. txmat… fa375330-…
#> 4 1830… Lopadostoma g… 48.0 16.5 colma… 0afba960-…
#> 5 1830… Platystomum o… 48.0 16.5 colma… 0afba960-…
#> 6 1831… Trechispora f… 48.0 16.5 colma… 0afba960-…
#> 7 1831… Nemania serpe… 48.0 16.5 colma… 0afba960-…
#> 8 1897… Anisotome aro… -38.9 175. cdrou… 83ae84cf-…
#> 9 1897… Lagenifera Ca… -38.9 175. cdrou… 83ae84cf-…
#> 10 1897… Aciphylla hec… -45.5 169. cdrou… 83ae84cf-…
#> # … with 90 more rows, and 90 more variables: publishingOrgKey <chr>,
#> # installationKey <chr>, publishingCountry <chr>, protocol <chr>,
#> # lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
#> # hostingOrganizationKey <chr>, basisOfRecord <chr>, occurrenceStatus <chr>,
#> # taxonKey <int>, kingdomKey <int>, phylumKey <int>, classKey <int>,
#> # orderKey <int>, familyKey <int>, genusKey <int>, speciesKey <int>,
#> # acceptedTaxonKey <int>, acceptedScientificName <chr>, kingdom <chr>,
#> # phylum <chr>, order <chr>, family <chr>, genus <chr>, species <chr>,
#> # genericName <chr>, specificEpithet <chr>, infraspecificEpithet <chr>,
#> # taxonRank <chr>, taxonomicStatus <chr>, elevation <dbl>, year <int>,
#> # month <int>, day <int>, eventDate <chr>, lastInterpreted <chr>,
#> # license <chr>, identifiers <chr>, facts <chr>, relations <chr>,
#> # institutionKey <chr>, isInCluster <lgl>, geodeticDatum <chr>, class <chr>,
#> # countryCode <chr>, recordedByIDs <chr>, identifiedByIDs <chr>,
#> # country <chr>, identifier <chr>, recordedBy <chr>, catalogNumber <chr>,
#> # institutionCode <chr>, locality <chr>, gbifID <chr>, collectionCode <chr>,
#> # occurrenceID <chr>, identifiedBy <chr>, name <chr>,
#> # X.99d66b6c.9087.452f.a9d4.f15f2c2d0e7e. <chr>,
#> # coordinateUncertaintyInMeters <dbl>, stateProvince <chr>, references <chr>,
#> # eventID <chr>, dataGeneralizations <chr>, vernacularName <chr>,
#> # otherCatalogNumbers <chr>, taxonConceptID <chr>, modified <chr>,
#> # collectionKey <chr>, higherGeography <chr>, language <chr>,
#> # verbatimLocality <chr>, type <chr>, verbatimElevation <chr>,
#> # typeStatus <chr>, gadm <chr>, individualCount <int>,
#> # dynamicProperties <chr>, municipality <chr>, associatedReferences <chr>,
#> # county <chr>, locationID <chr>, fieldNotes <chr>, eventTime <chr>,
#> # behavior <chr>, sex <chr>, habitat <chr>,
#> # identificationVerificationStatus <chr>, identificationRemarks <chr>
Another thing we can do with occ_issues()
is go from issue codes to full issue names in case you want those in your dataset (here, showing only a few columns to see the data better for this demo):
out <- res %>% occ_issues(mutate = "expand")
head(out$data[,c(1,5)])
#> # A tibble: 6 x 2
#> key issues
#> <chr> <chr>
#> 1 1135442454 COLLECTION_MATCH_NONE,INSTITUTION_MATCH_FUZZY
#> 2 1632784162 TAXON_MATCH_HIGHERRANK,INDIVIDUAL_COUNT_INVALID,INSTITUTION_COLLEC…
#> 3 1632784175 TAXON_MATCH_HIGHERRANK,INDIVIDUAL_COUNT_INVALID,INSTITUTION_COLLEC…
#> 4 1830979757 COLLECTION_MATCH_NONE,INSTITUTION_MATCH_FUZZY
#> 5 1830979760 COLLECTION_MATCH_NONE,INSTITUTION_MATCH_FUZZY
#> 6 1831004050 COLLECTION_MATCH_NONE,INSTITUTION_MATCH_FUZZY
Sometimes you may want to have each type of issue as a separate column.
Split out each issue type into a separate column, with number of columns equal to number of issue types
out <- res %>% occ_issues(mutate = "split")
head(out$data[,c(1,5:10)])
#> # A tibble: 6 x 7
#> name colmano inmafu txmathi indci incomis cdround
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Cryptosphaeria eunomia var. frax… y y n n n n
#> 2 Belenois java (Sparrman, 1768) n y y y y n
#> 3 Belenois java (Sparrman, 1768) n y y y y n
#> 4 Lopadostoma gastrinum (Fr.) Trav… y y n n n n
#> 5 Platystomum obtectum (Peck) Lind… y y n n n n
#> 6 Trechispora farinacea (Pers.) Li… y y n n n n
Or you can expand each issue type into its full name, and split each issue into a separate column.
out <- res %>% occ_issues(mutate = "split_expand")
head(out$data[,c(1,5:10)])
#> # A tibble: 6 x 7
#> name COLLECTION_MATC… INSTITUTION_MAT… TAXON_MATCH_HIG… INDIVIDUAL_COUN…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Cryp… y y n n
#> 2 Bele… n y y y
#> 3 Bele… n y y y
#> 4 Lopa… y y n n
#> 5 Plat… y y n n
#> 6 Trec… y y n n
#> # … with 2 more variables: INSTITUTION_COLLECTION_MISMATCH <chr>,
#> # COORDINATE_ROUNDED <chr>
We hope this helps users get just the data they want, and nothing more. Let us know if you have feedback on data cleaning functionality in rgbif
at [email protected] or at https://github.com/ropensci/rgbif/issues.