Skip to contents

There are two ways to get occurrence data from GBIF:

  1. occ_download(): unlimited records. Useful for research and citation.
  2. occ_search(): limited to 100K records. Useful primarily for testing.

The function occ_search() (and related function occ_data()) should not be used for serious research. Users sometimes find it easier to use occ_search() rather than occ_download() because they do not need to supply a username or password, and also do not need to wait for a download to finish. However, any serious research project should always use occ_download() instead.

occ_download()

occ_download() is the best way to get GBIF mediated occurrences.

The main functions related to downloads are:

To make a download request, occ_download() uses helper functions starting with pred. These functions define filters on the large GBIF occurrence table, so that only a usable subset is returned. The predicate functions are named for the ‘type’ of operation they do, following the terminology used by GBIF.

function description example
pred() key is equal to value pred("taxonKey",212)
pred_lt() key is less than value. pred_lt("coordinateUncertaintyInMeters",5000)
pred_lte() key is less than or equal to value pred_lte("year", 1900)
pred_gt() key is greater than value pred_gt("elevation", 1000)
pred_gte() key is greater than or equal to value pred_gte("depth", 1000)
pred_not() key is not value pred_not("taxonKey",212)
pred_like() key like pattern pred_like("catalogNumber","PAPS5-560*")
pred_within() lat-lon values within WKT polygon pred_within('POLYGON((-14 42, 9 38, -7 26, -14 42))')
pred_notnull() column is not NULL pred_notnull("establishmentMeans")
pred_isnull() column is NULL pred_isnull("recordedBy")
pred_and() a logical and of predicate functions pred_and(pred_lte("elevation",5000),pred("taxonKey",212))
pred_or() a logical or of predicate functions pred_or(pred_gt("elevation", 1000), pred_isnull("elevation"))
pred_in() values are in the column pred_in("taxonKey",c(2977832,2977901,2977966))

A Very Simple Download

It is required to set up your GBIF credentials to make downloads from GBIF. I suggest that you follow this short tutorial before continuing.

The following will download all occurrences of Lepus saxatilis. You can use name_backbone("Lepus saxatilis") to find the taxonKey (usageKey).

# remember to set up your GBIF credentials
occ_download(pred("taxonKey", 2436775),format = "SIMPLE_CSV")
<<gbif download>>
  Your download is being processed by GBIF:
  https://www.gbif.org/occurrence/download/0079311-210914110416597
  Most downloads finish within 15 min.
  Check status with
  occ_download_wait('0079311-210914110416597')
  After it finishes, use
  d <- occ_download_get('0079311-210914110416597') %>%
    occ_download_import()
  to retrieve your download.
Download Info:
  Username: jwaller
  E-mail: jwaller@gbif.org
  Format: SIMPLE_CSV
  Download key: 0079311-210914110416597
  Created: 2021-12-14T13:02:09.610+00:00
Citation Info:  
  Please always cite the download DOI when using this data.
  https://www.gbif.org/citation-guidelines
  DOI: 10.15468/dl.dqp6a3
  Citation:
  GBIF Occurrence Download https://doi.org/10.15468/dl.dqp6a3 Accessed from R via rgbif (https://github.com/ropensci/rgbif) on 2021-12-14

The print out tells us that we can wait for the download to finish with occ_download_wait(). Most downloads under 100K records run very quickly. You can also check the status of a download on your GBIF user page.

occ_download_wait('0079311-210914110416597') # checks if download is finished

The print out tells you can get this download using occ_download_get() and occ_download_import().

d <- occ_download_get('0079311-210914110416597') %>%
  occ_download_import()

It is also possible save your download into an object and pass that into occ_download_get().

gbif_download <- occ_download(pred("taxonKey", 2436775),format = "SIMPLE_CSV")

occ_download_wait(gbif_download)

d <- occ_download_get(gbif_download) %>%
  occ_download_import()

Note that the citation appears in the print out. This is what you would use if used this download in a research paper. Please also see GBIF’s citation guidelines when using GBIF mediated data.

GBIF Occurrence Download https://doi.org/10.15468/dl.dqp6a3 Accessed from R via rgbif (https://github.com/ropensci/rgbif) on 2021-12-14

You could also get this citation by running gbif_citation() or checking your user page.

gbif_citation('0078589-210914110416597')
# or
# gbif_citation(gbif_download)

A More Realistic Download

Typically GBIF downloads follow a particular pattern, and the same filters are used again and again. These are some common filters that you should probably be using.

occ_download(
pred("hasGeospatialIssue", FALSE),
pred("hasCoordinate", TRUE),
pred("occurrenceStatus","PRESENT"), 
pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
pred("taxonKey", 2436775),
format = "SIMPLE_CSV"
)

This download will …

  • Remove default geospatial issues.
  • Keep only records with coordinates.
  • Remove absent records.
  • Remove fossils and living specimens
  • Retrieve all Lepus saxatilis.

The code above is commonly used, but pretty long. This is why pred_default() was created.

# shorter equivalent to download above
occ_download(
pred_default(), 
pred("taxonKey", 2436775), 
format = "SIMPLE_CSV"
)

Long species list downloads

Another common download pattern is long species list downloads. There is a tutorial about downloading from a long list of species here.

A Complex Download For Illustration

Here I make an overly complex download to highlight some of the capabilities of occ_download(). Most useful downloads are much simpler.

occ_download(
type="and",
    pred("taxonKey", 2436775),
    pred("hasGeospatialIssue", FALSE),
    pred("hasCoordinate", TRUE),
    pred("occurrenceStatus","PRESENT"), 
    pred_gte("year", 1900),
    pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))),
  pred_or(
    pred("country","ZA"),
    pred("gadm","ETH")
    ),
  pred_or(
    pred_not(pred_in("establishmentMeans",c("MANAGED","INTRODUCED"))),
    pred_isnull("establishmentMeans")
    ),
  pred_or(  
    pred_lt("coordinateUncertaintyInMeters",10000),
    pred_isnull("coordinateUncertaintyInMeters")
    ),
format = "SIMPLE_CSV"
)

This download will …

  • pred("taxonKey", 2436775) : all Lepus saxatilis records
  • pred("hasGeospatialIssue", FALSE) : remove default geospatial issues.
  • pred("hasCoordinate", TRUE) : keep only records with coordinates.
  • pred("occurrenceStatus","PRESENT") : remove absent records.
  • pred_not(pred_in("basisOfRecord",c("FOSSIL_SPECIMEN","LIVING_SPECIMEN"))): Remove fossils and living specimens
  • pred_gte("year", 1900) : after/or year 1900
  • pred_or(pred("country","ZA"),pred("gadm","ETH")) in South Africa or Ethiopia, using separate polygon systems. See enumeration_country() for country codes.
  • pred_or(pred_not(pred_in("establishmentMeans",c("MANAGED","INTRODUCED"))),pred_isnull("establishmentMeans")) : establishmentMeans column does not contain managed or introduced species, but can be left blank.
  • pred_or(pred_lt("coordinateUncertaintyInMeters",10000),pred_isnull("coordinateUncertaintyInMeters")) : coordinateUncertaintyInMeters is less 10K meter or is left blank.
  • format = "SIMPLE_CSV" : return just a tsv file of occurrences.

Not Downloads

Another sometimes useful pattern is downloading all occurrences except some group. Birds make up a large portion of GBIF occurrences. If you wanted to download everything but birds, you could use pred_not().

# name_backbone("Aves")
occ_download(pred_not("taxonKey", 212),format = "SIMPLE_CSV")

Big Polygon Downloads

Sometimes users will want to download records using a large polygon. It is worth noting that many land-based polygons can be captured using the gadm filter. Here I will download all occurrences within this biodiversity hotspot known as Wallacea.

A polygon may contain a maximum of 10,000 points, but in practice this number might be less depending the complexity of the polygon. You also have to make sure your polygons are in “anticlockwise” ordering of points. See downloads documentation.


# Simple code to go from shapefile to WKT
# large_wkt <- sf::st_read("large_shapefile") %>% 
# sf::st_geometry() %>% 
# sf::st_as_text()

large_wkt <- "POLYGON ((127.0171 4.9391, 124.5973 4.7960, 121.7968 3.7617,
119.0816 3.0776, 119.1999 0.5229, 117.3936 -5.1010, 116.4971 -6.7425,
115.9096 -8.2031, 115.5687 -9.9150, 117.2358 -10.0975, 120.9361 -11.4096,
122.5775 -11.8123, 123.5516 -11.8544, 125.5775 -11.2832, 128.6224 -9.7196,
131.1873 -9.1914, 132.1547 -8.3925, 133.4920 -6.4151, 133.6129 -5.8375,
133.5079 -5.1369, 133.1861 -4.7011, 131.4894 -3.3231, 129.8271 -2.4649, 
129.3679 -2.0044, 129.1699 -1.1486, 129.7026 -0.2859, 129.7691 0.2902, 
129.4364 2.4420, 128.9881 3.3626, 128.3585 4.1683, 127.7041 4.6918,
127.0171 4.9391))" 

occ_download(pred_within(large_wkt),format = "SIMPLE_CSV"))

Downloading verbatim DWCA extensions

Additional Darwin Core extension data can also be included in a DWCA download. These data tables are not processed by GBIF. They are as-published.

The extension tables available for download are provided using occ_download_describe("dwca")$verbatimExtensions. They can be requested by adding a verbatim_extensions() expression to the occ_download() request. The format has to be “DWCA” for verbatim_extensions() to work.

occ_download(
    pred("country", "USA"),
    verbatim_extensions(
    "http://rs.tdwg.org/chrono/terms/ChronometricAge",
    "http://rs.gbif.org/terms/1.0/DNADerivedData"
    ),
    format = "DWCA"
    )

Filter Country Centroids

Sometimes GBIF data publishers will not know the exact lat-lon location of a record and will enter the lat-long center of the country instead. This is a data issue because users might be unaware that an observation is pinned to a country center and assume it is a precise location.

It is possible to filter out country/area centroids in a download using the distanceFromCentroidInMeters filter.

# download occurrences that are at least 2km from a centroid in Sweden
occ_download(
pred_gte("distanceFromCentroidInMeters","2000"),
pred("country","SE"),
format = "SIMPLE_CSV")

GBIF currently uses only PCLI level centroids from the catalogue of centroids.

Data Quality

GBIF is a large data aggregator. It mediates occurrences records from a large variety of sources:

  • Museums
  • eDNA
  • Citizen Science Apps
  • Ecological Surveys
  • Camera Traps
  • Satellite Tracking
  • Herbaria
  • Paleontology
  • Research Projects

For this reason, not all of the occurrences from GBIF are “fit for use”, meaning they are not suitable for a particular purpose or project. Some data-quality issues are so well understood that there are automated ways to detect and remove them from a dataset.

  • Country Centroids
  • Living Specimens
  • Fossils
  • Uncertain Records
  • Country Coordinate Mismatch
  • Zero-Zero Coordinate
  • Any-Zero Coordinates
  • Gridded Datasets

Since rgbif is not a data cleaning package, please see the following resources for post-processing your occurrence downloads: