Getting Occurrence Data From GBIF
There are two ways to get occurrence data from GBIF:
- occ_download(): unlimited records. Useful for research and citation.
- occ_search(): limited to 100K records. Useful primarily for testing.
occ_search() (and related function
occ_data()) should not be used for serious research. Users sometimes find it easier to use
occ_search() rather than
occ_download() because they do not need to supply a username or password, and also do not need to wait for a download to finish. However, any serious research project should always use
occ_download() is the best way to get GBIF mediated occurrences.
The main functions related to downloads are:
- occ_download(): start a download on GBIF servers.
- occ_download_prep(): preview a download request before sending to GBIF.
- occ_download_get(): retrieve a download from GBIF to your computer.
- occ_download_import(): load a download from your computer to R.
To make a download request,
occ_download() uses helper functions starting with pred. These functions define filters on the large GBIF occurrence table, so that only a usable subset is returned. The predicate functions are named for the ‘type’ of operation they do, following the terminology used by GBIF.
||key is equal to value||
||key is less than value.||
||key is less than or equal to value||
||key is greater than value||
||key is greater than or equal to value||
||key is not value||
||key like pattern||
||lat-lon values within WKT polygon||
||column is not NULL||
||column is NULL||
||a logical and of predicate functions||
||a logical or of predicate functions||
||values are in the column||
It is required to set up your GBIF credentials to make downloads from GBIF. I suggest that you follow this short tutorial before continuing.
The following will download all occurrences of Lepus saxatilis. You can use
name_backbone("Lepus saxatilis") to find the taxonKey (usageKey).
# remember to set up your GBIF credentials occ_download(pred("taxonKey", 2436775),format = "SIMPLE_CSV")
<<gbif download>> Your download is being processed by GBIF: https://www.gbif.org/occurrence/download/0079311-210914110416597 Most downloads finish within 15 min. Check status with occ_download_wait('0079311-210914110416597') After it finishes, use d <- occ_download_get('0079311-210914110416597') %>% occ_download_import() to retrieve your download. Download Info: Username: jwaller E-mail: firstname.lastname@example.org Format: SIMPLE_CSV Download key: 0079311-210914110416597 Created: 2021-12-14T13:02:09.610+00:00 Citation Info: Please always cite the download DOI when using this data. https://www.gbif.org/citation-guidelines DOI: 10.15468/dl.dqp6a3 Citation: GBIF Occurrence Download https://doi.org/10.15468/dl.dqp6a3 Accessed from R via rgbif (https://github.com/ropensci/rgbif) on 2021-12-14
The print out tells us that we can wait for the download to finish with
occ_download_wait(). Most downloads under 100K records run very quickly. You can also check the status of a download on your GBIF user page.
occ_download_wait('0079311-210914110416597') # checks if download is finished
It is also possible save your download into an object and pass that into
gbif_download <- occ_download(pred("taxonKey", 2436775),format = "SIMPLE_CSV") occ_download_wait(gbif_download) d <- occ_download_get(gbif_download) %>% occ_download_import()
Note that the citation appears in the print out. This is what you would use if used this download in a research paper. Please also see GBIF’s citation guidelines when using GBIF mediated data.
GBIF Occurrence Download https://doi.org/10.15468/dl.dqp6a3 Accessed from R via rgbif (https://github.com/ropensci/rgbif) on 2021-12-14
gbif_citation('0078589-210914110416597') # or # gbif_citation(gbif_download)
Typically GBIF downloads follow a particular pattern, and the same filters are used again and again. These are some common filters that you should probably be using.
This download will …
- Remove default geospatial issues.
- Keep only records with coordinates.
- Remove absent records.
- Remove fossils and living specimens
- Retrieve all Lepus saxatilis.
The code above is commonly used, but pretty long. This is why
pred_default() was created.
Another common download pattern is long species list downloads. There is a tutorial about downloading from a long list of species here.
Here I make an overly complex download to highlight some of the capabilities of
occ_download(). Most useful downloads are much simpler.
occ_download( type="and", pred("taxonKey", 2436775), pred("hasGeospatialIssue", FALSE), pred("hasCoordinate", TRUE), pred("occurrenceStatus","PRESENT"), pred_gte("year", 1900), pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))), pred_or( pred("country","ZA"), pred("gadm","ETH") ), pred_or( pred_not(pred_in("establishmentMeans",c("MANAGED","INTRODUCED"))), pred_isnull("establishmentMeans") ), pred_or( pred_lt("coordinateUncertaintyInMeters",10000), pred_isnull("coordinateUncertaintyInMeters") ), format = "SIMPLE_CSV" )
This download will …
pred("taxonKey", 2436775): all Lepus saxatilis records
pred("hasGeospatialIssue", FALSE): remove default geospatial issues.
pred("hasCoordinate", TRUE): keep only records with coordinates.
pred("occurrenceStatus","PRESENT"): remove absent records.
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))): Remove fossils and living specimens
pred_gte("year", 1900): after/or year 1900
pred_or(pred("country","ZA"),pred("gadm","ETH"))in South Africa or Ethiopia, using separate polygon systems. See
enumeration_country()for country codes.
pred_or(pred_not(pred_in("establishmentMeans",c("MANAGED","INTRODUCED"))),pred_isnull("establishmentMeans")): establishmentMeans column does not contain managed or introduced species, but can be left blank.
pred_or(pred_lt("coordinateUncertaintyInMeters",10000),pred_isnull("coordinateUncertaintyInMeters")): coordinateUncertaintyInMeters is less 10K meter or is left blank.
format = "SIMPLE_CSV": return just a tsv file of occurrences.
Another sometimes useful pattern is downloading all occurrences except some group. Birds make up a large portion of GBIF occurrences. If you wanted to download everything but birds, you could use
Sometimes users will want to download records using a large polygon. It is worth noting that many land-based polygons can be captured using the gadm filter. Here I will download all occurrences within this biodiversity hotspot known as Wallacea.
A polygon may contain a maximum of 10,000 points, but in practice this number might be less depending the complexity of the polygon. You also have to make sure your polygons are in “anticlockwise” ordering of points. See downloads documentation.
# Simple code to go from shapefile to WKT # large_wkt <- sf::st_read("large_shapefile") %>% # sf::st_geometry() %>% # sf::st_as_text() <- "POLYGON ((127.0171 4.9391, 124.5973 4.7960, 121.7968 3.7617, large_wkt 119.0816 3.0776, 119.1999 0.5229, 117.3936 -5.1010, 116.4971 -6.7425, 115.9096 -8.2031, 115.5687 -9.9150, 117.2358 -10.0975, 120.9361 -11.4096, 122.5775 -11.8123, 123.5516 -11.8544, 125.5775 -11.2832, 128.6224 -9.7196, 131.1873 -9.1914, 132.1547 -8.3925, 133.4920 -6.4151, 133.6129 -5.8375, 133.5079 -5.1369, 133.1861 -4.7011, 131.4894 -3.3231, 129.8271 -2.4649, 129.3679 -2.0044, 129.1699 -1.1486, 129.7026 -0.2859, 129.7691 0.2902, 129.4364 2.4420, 128.9881 3.3626, 128.3585 4.1683, 127.7041 4.6918, 127.0171 4.9391))" occ_download(pred_within(large_wkt),format = "SIMPLE_CSV"))
Sometimes GBIF data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the country instead. This is a data issue because users might be unaware that an observation is pinned to a country center and assume it is a precise location.
It is possible to filter out country/area centroids in a download using the
# download occurrences that are at least 2km from a centroid in Sweden occ_download( pred_gte("distanceFromCentroidInMeters","2000"), pred("country","SE"), format = "SIMPLE_CSV")
GBIF currently uses only PCLI level centroids from the catalogue of centroids.
GBIF is a large data aggregator. It mediates occurrences records from a large variety of sources:
- Citizen Science Apps
- Ecological Surveys
- Camera Traps
- Satellite Tracking
- Research Projects
For this reason, not all of the occurrences from GBIF are “fit for use”, meaning they are not suitable for a particular purpose or project. Some data-quality issues are so well understood that there are automated ways to detect and remove them from a dataset.
- Country Centroids
- Living Specimens
- Uncertain Records
- Country Coordinate Mismatch
- Zero-Zero Coordinate
- Any-Zero Coordinates
- Gridded Datasets
Since rgbif is not a data cleaning package, please see the following resources for post-processing your occurrence downloads: