Clean Biological Occurrence Records

Clean using the following use cases (checkmarks indicate fxns exist - not necessarily complete):

  • [x] Impossible lat/long values: e.g., latitude 75
  • [x] Incomplete cases: one or the other of lat/long missing
  • [x] Unlikely lat/long values: e.g., points at 0,0
  • [x] Deduplication: try to identify duplicates, esp. when pulling data from multiple sources, e.g., can try to use occurrence IDs, if provided
  • [x] Date based cleaning
  • [x] Outside political boundary: User input to check for points in the wrong country, or points outside of a known country
  • [x] Taxonomic name based cleaning: via taxize (one method so far)
  • Political centroids: unlikely that occurrences fall exactly on these points, more likely a default position (Draft function started, but not exported, and commented out). see issue #6
  • Herbaria/Museums: many specimens may have location of the collection they are housed in, see issue #20
  • Habitat type filtering: e.g., fish should not be on land; marine fish should not be in fresh water
  • Check for contextually wrong values: That is, if 99 out of 100 lat/long coordinates are within the continental US, but 1 is in China, then perhaps something is wrong with that one point
  • Collector/recorder names: see issue #19

A note about examples: We think that using a piping workflow with %>% makes code easier to build up, and easier to understand. However, in some examples we provide examples without the pipe to demonstrate traditional usage.


Stable CRAN version

Development version


Coordinate based cleaning


Remove impossible coordinates (using sample data included in the pkg)

Remove incomplete coordinates

Remove unlikely coordinates (e.g., those at 0,0)

Do all three

Don’t drop bad data

dframe(sample_data_1) %>% coord_incomplete(drop = TRUE) %>% NROW
#> [1] 1306
dframe(sample_data_1) %>% coord_incomplete(drop = FALSE) %>% NROW
#> [1] 1500


Standardize/convert dates

Drop records without dates

Create date field from other fields


Filter by FAO areas

wkt <- 'POLYGON((72.2 38.5,-173.6 38.5,-173.6 -41.5,72.2 -41.5,72.2 38.5))'
manta_ray <- rgbif::name_backbone("Mobula alfredi")$usageKey
res <- rgbif::occ_data(manta_ray, geometry = wkt, limit=300, hasCoordinate = TRUE)
dat <- sf::st_as_sf(res$data, coords = c("decimalLongitude", "decimalLatitude"))
dat <- sf::st_set_crs(dat, 4326)
tmp <- eco_region(dframe(res$data), dataset = "fao", region = "OCEAN:Indian")
tmp <- tmp[!$decimalLongitude), ]
tmp2 <- sf::st_as_sf(tmp, coords = c("decimalLongitude", "decimalLatitude"))
tmp2 <- sf::st_set_crs(tmp2, 4326)