Getting Dataset Metadata From GBIF
John Waller
2023-12-12
Source:vignettes/getting_dataset_info.Rmd
getting_dataset_info.Rmd
It can sometimes be useful to get information about datasets rather than the data which they contain.
In this article, I will explain the various methods available to access dataset meta-data stored within the GBIF registry.
dataset_search() and dataset_export()
-
dataset_search()
: Use this function if you want meta-data, counts, facets, and not necessarily all of the results. -
dataset_export()
: Use this function if you only want meta-data and all of the results as a table.
If you want just a (non-random) sample of datasets from the registry,
you can run dataset_search()
with no arguments, which will
return 100 datasets of various types. However, running
dataset_export()
with no filters, will download all
of the dataset meta-data in the registry.
dataset_search()
# dataset_export() # beware this will download 93K datasets!
There are a few types of datasets GBIF supports. The most well known
is the occurrence dataset. You can search for
occurrence type datasets using the type
filter.
dataset_search(type = "OCCURRENCE")
# dataset_export(type = "OCCURRENCE") # download all of the meta-data
Checklists are also another common type of dataset mediated by GBIF. You can use the multiple values separator “;”, in order to get both checklist and occurrence types.
dataset_search(type = "OCCURRENCE;CHECKLIST")
# dataset_export(type = "OCCURRENCE;CHECKLIST") # download both types
You might be wondering what the other possible types of datasets are
called. With the facets interface, it is possible to
get group-by counts for most of the dataset_search()
filters.
dataset_search(facet="type",limit=0)$facets
$type
name count
1 CHECKLIST 49628
2 OCCURRENCE 38523
3 SAMPLING_EVENT 3036
4 METADATA 386
Use facetLimit
to control the number of results returned
with the facets interface.
dataset_search(facet="publishingCountry",facetLimit=200,limit=0)$facets
name count
1 CH 47107
2 FR 15559
3 DE 9012
4 CO 2812
<...>
136 RS 1
137 SO 1
138 SS 1
139 TR 1
140 WF 1
Facets can also be used with other filters. For example to get the top countries publishing occurrence datasets.
dataset_search(facet="publishingCountry",type="OCCURRENCE",facetLimit=200,limit=0)$facets
Here are some more examples of using dataset_search()
filters:
# datasets published by Ukraine
dataset_search(publishingCountry = "UA")
# checklist datasets with a CC0 license
dataset_export(type="CHECKLIST", license = "CC0_1_0")
# Be aware that not all publishers fill in a subType
dataset_search(subType="TAXONOMIC_AUTHORITY")
# Get datasets hosted by Norway
dataset_search(hostingCountry = "NO")
# counts of datasets hosted by Norway but published by other countries
dataset_search(facet="publishingCountry",hostingCountry = "NO",limit=0,facetLimit=100)$facets
# get all datasets within the GRIIS porject
dataset_export(projectId = "GRIIS")
# keywords used by the GRIIS project
dataset_search(facet="keyword",projectId="GRIIS",limit=0,facetLimit=100)$facets
# datasets with data collected between 1600 and 1800
dataset_search(decade = "1600,1800")
# group-by license counts of occurrence type datasets
dataset_search(facet="license",type="OCCURRENCE",limit=0,facetLimit=10)$facets
# search for dataset by doi of the dataset
dataset_search(doi="10.15468/aomfnb")
# datasets hosted by Scandinavia
dataset_search(hostingCountry = "IS;FI;DK;NO;SE")
dataset_search(facet="hostingCountry",hostingCountry = "IS;FI;DK;NO;SE",limit=0,facetLimit=5)$facets
# all datasests in the VertNet network
dataset_export(networkKey = "99d66b6c-9087-452f-a9d4-f15f2c2d0e7e")
# all datasets with the keyword "DEPOBIO" hosted by France
dataset_export(keyword="DEPOBIO",hostingCountry="FR")
# number of occurrences
dataset_export(keyword="DEPOBIO",hostingCountry="FR")$occurrenceRecordsCount |> sum()
# datasets published by Cornell Lab of Ornithology
dataset_search(publishingOrg = "e2e717bf-551a-4917-bdc9-4fa0f342c530")
I haven’t yet mentioned dataset_suggest()
, which will
return less data than dataset_search()
, but is practically
the same function. Most of the time you will be wanting to use
dataset_search()
and dataset_export()
rather
than dataset_suggest()
. The endpoint for
dataset_suggest()
was designed for allowing the GBIF
website to function efficiently, and isn’t really that interesting for
rgbif users, but there might be some edge cases where
it is useful.
dataset
dataset()
returns other meta-data not necessarily found
with a dataset_search()
. Most of the time you will want to
use dataset_search()
, but there are times when you have to
use dataset()
. For example, when you want to search by
machine tag.
# return all datasets tagged as "citizen science"
dataset(machineTagNamespace="citizenScience.gbif.org")
Other functions
There are various other dataset functions that you might find useful. Particularly if you know the datasetKey uuid, there are a group of functions that can be used.
# get details of a single dataset
dataset_get("38b4c89f-584c-41bb-bd8f-cd1def33e92f")
# get the details of how the dataset is being ingested by GBIF
dataset_process("38b4c89f-584c-41bb-bd8f-cd1def33e92f",limit=3)
# what networks does the dataset belong to?
dataset_networks("3dab037f-a520-4bc3-b888-508755c2eb52")
# what datasets compose the dataset? Not many datasets have constituents.
dataset_constituents("7ddf754f-d193-4cc9-b351-99906754a03b",limit=3)
# what contacts did the publishers give for the dataset?
dataset_contact("7ddf754f-d193-4cc9-b351-99906754a03b")
# only works for CHECKLIST type datasets
dataset_metrics("7ddf754f-d193-4cc9-b351-99906754a03b")