eBird is an online tool for recording bird observations. Since its inception, nearly 500 million records of bird sightings (i.e. combinations of location, date, time, and bird species) have been collected, making eBird one of the largest citizen science projects in history and an extremely valuable resource for bird research and conservation. The full eBird database is packaged as a text file and available for download as the eBird Basic Dataset (EBD). Due to the large size of this dataset, it must be filtered to a smaller subset of desired observations before reading into R. This filtering is most efficiently done using AWK, a Unix utility and programming language for processing column formatted text data. This package acts as a front end for AWK, allowing users to filter eBird data before import into R.
This vignette is divided into three sections. The first section
provides background on the eBird data and motivation for the development
of this package. The second section outlines the use of auk
for filtering text file to produce a presence-only dataset. The final
section demonstrates how auk
can be used to produce
zero-filled, presence-absence (or more correctly presence–non-detection)
data, a necessity for many modeling and analysis applications.
Quick start
This package uses the command-line program AWK to extract subsets of the eBird Basic Dataset for use in R. This is a multi-step process:
- Define a reference to the eBird data file.
- Define a set of spatial, temporal, or taxonomic filters. Each type
of filter corresponds to a different function,
e.g.
auk_species
to filter by species. At this stage the filters are only set up, no actual filtering is done until the next step. - Filter the eBird data text file, producing a new text file with only the selected rows.
- Import this text file into R as a data frame.
Because the eBird dataset is so large, step 3 typically takes several hours to run. Here’s a simple example that extract all Canada Jay records from within Canada.
library(auk)
# path to the ebird data file, here a sample included in the package
# in practice, provide path to ebd, e.g. input_file <- "data/ebd_relFeb-2018.txt"
input_file <- system.file("extdata/ebd-sample.txt", package = "auk")
# output text file
output_file <- "ebd_filtered_grja.txt"
ebird_data <- input_file %>%
# 1. reference file
auk_ebd() %>%
# 2. define filters
auk_species(species = "Canada Jay") %>%
auk_country(country = "Canada") %>%
# 3. run filtering
auk_filter(file = output_file) %>%
# 4. read text file into r data frame
read_ebd()
For those not familiar with the pipe operator (%>%
),
the above code could be rewritten:
input_file <- system.file("extdata/ebd-sample.txt", package = "auk")
output_file <- "ebd_filtered_grja.txt"
ebd <- auk_ebd(input_file)
ebd_filters <- auk_species(ebd, species = "Canada Jay")
ebd_filters <- auk_country(ebd_filters, country = "Canada")
ebd_filtered <- auk_filter(ebd_filters, file = output_file)
ebd_df <- read_ebd(ebd_filtered)
Background
The eBird Basic Dataset
The eBird database currently contains nearly 500 million bird observations, and this rate of increase is accelerating as new users join eBird. These data are an extremely valuable tool both for basic science and conservation; however, given the sheer amount of data, accessing eBird data poses a unique challenge. Currently, access to the complete set of eBird observations is provided via the eBird Basic Dataset (EBD). This is a tab-separated text file, released quarterly, containing all validated bird sightings in the eBird database at the time of release. Each row corresponds to the sighting of a single species within a checklist and, in addition to the species and number of individuals reported, information is provided at the checklist level (location, time, date, search effort, etc.).
In addition, eBird provides a Sampling Event Data file that contains the checklist-level data for every valid checklist submitted to eBird, including checklists for which no species of birds were reported. In this file, each row corresponds to a checklist and only the checklist-level variables are included, not the associated bird data. While the eBird Basic Dataset provides presence-only data, it can be combined with the Sampling Event Data file to produce presence-absence data. This process is described below.
For full metadata on the both datasets, consult the documentation provided when the files are downloaded.
auk
vs. rebird
Those interested in eBird data may also want to consider rebird
, an R
package that provides an interface to the eBird
APIs. The functions in rebird
are mostly limited to
accessing recent (i.e. within the last 30 days) observations, although
ebirdfreq()
does provide historical frequency of
observation data. In contrast, auk
gives access to the full
set of ~ 500 million eBird observations. For most ecological
applications, users will require auk
; however, for some use
cases, e.g. building tools for birders, rebird
provides a
quick and easy way to access data.
Data access
To access eBird data, begin by creating an eBird account and signing in. Then visit the Download Data page. eBird data access is free; however, you will need to request access in order to obtain access to the EBD. Filling out the access request form allows eBird to keep track of the number of people using the data and obtain information on the applications for which the data are used
Once you have access to the data, proceed to the download page. There are two download options: prepackage download and custom download. Downloading the prepackaged option gives you access to the full global dataset. If you choose this route, you’ll likely want to download both the EBD (~ 25 GB) and corresponding Sampling Event Data (~ 2.5 GB). If you know you’re likely to only need data for a single species, or a small region, you can request a custom download be prepared consisting of only a subset of the data. This will result in significantly smaller files; however, note that custom requests that would result in huge numbers of checklists (e.g. all records from the US) won’t work. In either case, download and decompress the files.
Example data
This package comes with two example datasets. The first is suitable for practicing filtering the EBD and producing presence-only data. It’s a sample of 400 records from the EBD. It contains data from North and Central America from 2010-2012 on 3 jay species: Canada Jay, Blue Jay, and Green Jay. It can be accessed with:
library(auk)
library(dplyr)
system.file("extdata/ebd-sample.txt", package = "auk")
The second is suitable for producing zero-filled, presence-absence data. It contains every sighting from Singapore in the first half of 2012 of Collared Kingfisher, White-throated Kingfisher, and Blue-eared Kingfisher. The full Sampling Event Data file is also included, and contains all checklists from Singapore in the first half of 2012. These files can be accessed with:
# ebd
system.file("extdata/zerofill-ex_ebd.txt", package = "auk")
# sampling event data
system.file("extdata/zerofill-ex_sampling.txt", package = "auk")
Important note: in this vignette,
system.file()
is used to return the path to the example
data included in this package. When using auk
in practice,
provide the path to the location of the EBD on your computer, this could
be a relative path, e.g. "data/ebd_relFeb-2018.txt"
, or an
absolute path,
e.g. "~/ebird/ebd_relFeb-2018/ebd_relFeb-2018.txt"
.
AWK
R typically works with objects in memory and, as a result, there is a hard limit on the size of objects that can be brought into R. Because eBird contains nearly 500 million sightings, the eBird Basic Dataset is an inherently large file (~150 GB uncompressed) and therefore impossible to manipulate directly in R. Thus it is generally necessary to create a subset of the file outside of R, then import this smaller subset for analysis.
AWK is a Unix utility and programming language for processing column formatted text data. It is highly flexible and extremely fast, making it a valuable tool for pre-processing the eBird data in order to create the smaller subset of data that is required. Users of the data can use AWK to produce a smaller file, subsetting the full text file taxonomically, spatially, or temporally, in order to produce a smaller file that can then be loaded in to R for visualization, analysis, and modelling.
Although AWK is a powerful tool, it has three disadvantages: it requires learning the syntax of a new language, it is only accessible via the command line, and it results in a portion of your workflow existing outside of R. This package is a wrapper for AWK specifically designed for filtering eBird data The goal is to ease the use of the this data by removing the hurdle of learning and using AWK.
Linux and Mac users should already have AWK installed on their
machines, however, Windows uses will need to install Cygwin to gain access to AWK. Note
that Cygwin should be installed in the default location
(C:/cygwin/bin/gawk.exe
or
C:/cygwin64/bin/gawk.exe
) in order for auk
to
work. To check that AWK is installed and can be found run
auk_getpath()
.
If AWK is installed in a non-standard location, or can’t be found by
auk
, you can manually set the path to AWK. To do so, set
the AWK_PATH
environment in your .Renviron
file. For example, Mac and Linux users might add the following line:
AWK_PATH=/usr/bin/awk
while Windows users might add:
AWK_PATH=C:/cygwin64/bin/gawk.exe
A note on versions
This package contains a current (as of the time of package release)
version of the bird
taxonomy used by eBird. This taxonomy determines the species that
can be reported in eBird and therefore the species that users of
auk
can extract from the EBD. eBird releases an updated
taxonomy once a year, typically in August, at which time
auk
will be updated to include the current taxonomy. When
using auk
, users should be careful to ensure that the
version they’re using is in sync with the EBD file they’re working with.
This is most easily accomplished by always using the most recent version
of auk
and the most recent release of the eBird Basic
Dataset
Presence data
The most common use of the eBird data is to produce a set of bird sightings, i.e. where and when was a given species seen. For example, this type of data could be used to produce a map of sighting locations, or to determine if a given bird has been seen in an area of interest. For more analytic work, such as species distribution modeling, presence and absence data are likely preferred (see Guillera-Arroita et al. 2015). Producing presence-absence data will be covered in the next section.
The auk_ebd
object
This package uses an auk_ebd
object to keep track of the
input data file, any filters defined, and the output file that is
produced after filtering has been executed. By keeping everything
wrapped up in one object, the user can keep track of exactly what set of
input data and filters produced any given output data. To set up the
initial auk_ebd
object, use auk_ebd()
:
ebd <- system.file("extdata/ebd-sample.txt", package = "auk") %>%
auk_ebd()
ebd
#> Input
#> EBD: /usr/local/lib/R/site-library/auk/extdata/ebd-sample.txt
#>
#> Output
#> Filters not executed
#>
#> Filters
#> Species: all
#> Countries: all
#> States: all
#> Counties: all
#> BCRs: all
#> Bounding box: full extent
#> Years: all
#> Date: all
#> Start time: all
#> Last edited date: all
#> Protocol: all
#> Project code: all
#> Duration: all
#> Distance travelled: all
#> Records with breeding codes only: no
#> Exotic Codes: all
#> Complete checklists only: no
Defining filters
auk
uses a pipeline-based workflow for
defining filters, which can then be compiled into an AWK script. Any of
the following filters can be applied:
-
auk_species()
: filter by species using common or scientific names. -
auk_country()
: filter by country using the standard English names or ISO 2-letter country codes. -
auk_state()
: filter by state using the eBird state codes, see?ebird_states
. -
auk_bcr()
: filter by Bird Conservation Region (BCR) using BCR codes, see?bcr_codes
. -
auk_bbox()
: filter by spatial bounding box, i.e. a range of latitudes and longitudes in decimal degrees. -
auk_date()
: filter to checklists from a range of dates. To extract observations from a range of dates, regardless of year, use the wildcard “*
” in place of the year, e.g.date = c("*-05-01", "*-06-30")
for observations from May and June of any year. -
auk_last_edited()
: filter to checklists from a range of last edited dates, useful for extracting just new or recently edited data. -
auk_protocol()
: filter to checklists that following a specific search protocol, either stationary, traveling, or casual. -
auk_project()
: filter to checklists collected as part of a specific project (e.g. a breeding bird survey). -
auk_time()
: filter to checklists started during a range of times-of-day. -
auk_duration()
: filter to checklists with observation durations within a given range. -
auk_distance()
: filter to checklists with distances travelled within a given range. -
auk_breeding()
: only retain observations that have an associate breeding bird atlas code. -
auk_complete()
: only retain checklists in which the observer has specified that they recorded all species seen or heard. It is necessary to retain only complete records for the creation of presence-absence data, because the “absence” information is inferred by the lack of reporting of a species on checklists.
Note that all of the functions listed above only modify the
auk_ebd
object, in order to define the filters. Once the
filters have been defined, the filtering is actually conducted using
auk_filter()
.
ebd_filters <- ebd %>%
# species: common and scientific names can be mixed
auk_species(species = c("Canada Jay", "Cyanocitta cristata")) %>%
# country: codes and names can be mixed; case insensitive
auk_country(country = c("US", "Canada", "mexico")) %>%
# bbox: long and lat in decimal degrees
# formatted as `c(lng_min, lat_min, lng_max, lat_max)`
auk_bbox(bbox = c(-100, 37, -80, 52)) %>%
# date: use standard ISO date format `"YYYY-MM-DD"`
auk_date(date = c("2012-01-01", "2012-12-31")) %>%
# time: 24h format
auk_time(start_time = c("06:00", "09:00")) %>%
# duration: length in minutes of checklists
auk_duration(duration = c(0, 60)) %>%
# complete: all species seen or heard are recorded
auk_complete()
ebd_filters
#> Input
#> EBD: /usr/local/lib/R/site-library/auk/extdata/ebd-sample.txt
#>
#> Output
#> Filters not executed
#>
#> Filters
#> Species: Cyanocitta cristata, Perisoreus canadensis
#> Countries: CA, MX, US
#> States: all
#> Counties: all
#> BCRs: all
#> Bounding box: Lon -100 - -80; Lat 37 - 52
#> Years: all
#> Date: 2012-01-01 - 2012-12-31
#> Start time: 06:00-09:00
#> Last edited date: all
#> Protocol: all
#> Project code: all
#> Duration: 0-60 minutes
#> Distance travelled: all
#> Records with breeding codes only: no
#> Exotic Codes: all
#> Complete checklists only: yes
In all cases, extensive checks are performed to ensure filters are
valid. For example, species are checked against the official eBird
taxonomy and countries are checked using the countrycode
package. This is particularly important because filtering is a time
consuming process, so catching errors in advance can avoid several hours
of wasted time.
Executing filters
Each of the functions described in the Defining filters
section only defines a filter. Once all of the required filters have
been set, auk_filter()
should be used to compile them into
an AWK script and execute it to produce an output file. So, as an
example of bringing all of these steps together, the following commands
will extract all Canada Jay and Blue Jay records from Canada and save
the results to a tab-separated text file for subsequent use:
output_file <- "ebd_filtered_blja-grja.txt"
ebd_jays <- system.file("extdata/ebd-sample.txt", package = "auk") %>%
auk_ebd() %>%
auk_species(species = c("Canada Jay", "Cyanocitta cristata")) %>%
auk_country(country = "Canada") %>%
auk_filter(file = output_file)
Filtering the full EBD typically takes at least a couple hours, so set it running then go grab lunch!
Reading
eBird Basic Dataset files can be read with read_ebd()
.
This is a wrapper around readr::read_delim()
.
read_ebd()
uses stringsAsFactors = FALSE
,
quote = ""
, sets column classes, and converts variable
names to snake_case
.
system.file("extdata/ebd-sample.txt", package = "auk") %>%
read_ebd() %>%
glimpse()
#> Rows: 398
#> Columns: 48
#> $ checklist_id <chr> "G1131664", "G1131665", "G1158137", "G115813…
#> $ global_unique_identifier <chr> "URN:CornellLabOfOrnithology:EBIRD:OBS294400…
#> $ last_edited_date <chr> "2021-03-29 21:21:52.583259", "2020-02-01 20…
#> $ taxonomic_order <dbl> 20724, 20724, 20674, 20674, 20724, 20724, 20…
#> $ category <chr> "species", "species", "species", "species", …
#> $ taxon_concept_id <chr> "avibase-361B447A", "avibase-361B447A", "avi…
#> $ common_name <chr> "Green Jay", "Green Jay", "Canada Jay", "Can…
#> $ scientific_name <chr> "Cyanocorax yncas", "Cyanocorax yncas", "Per…
#> $ exotic_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ observation_count <chr> "2", "6", "1", "1", "X", "3", "4", "3", "6",…
#> $ breeding_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ breeding_category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ behavior_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ age_sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ country <chr> "United States", "United States", "Canada", …
#> $ country_code <chr> "US", "US", "CA", "CA", "MX", "MX", "CA", "U…
#> $ state <chr> "Texas", "Texas", "British Columbia", "Briti…
#> $ state_code <chr> "US-TX", "US-TX", "CA-BC", "CA-BC", "MX-NLE"…
#> $ county <chr> "Zapata", "Starr", "Northern Rockies", "Nort…
#> $ county_code <chr> "US-TX-505", "US-TX-427", "CA-BC-NR", "CA-BC…
#> $ iba_code <chr> NA, NA, NA, NA, "MX_69", NA, NA, NA, NA, NA,…
#> $ bcr_code <int> 36, 36, 6, 6, 48, 36, 13, 10, 36, 48, 48, 8,…
#> $ usfws_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ atlas_block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ locality <chr> "Zapata Library / City Park (LTC 085)", "Fal…
#> $ locality_id <chr> "L846015", "L128962", "L343808", "L343808", …
#> $ locality_type <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H",…
#> $ latitude <dbl> 26.90220, 26.58361, 58.82617, 58.82617, 25.5…
#> $ longitude <dbl> -99.27106, -99.14529, -122.90187, -122.90187…
#> $ observation_date <date> 2011-11-14, 2011-11-14, 2011-06-14, 2011-06…
#> $ time_observations_started <chr> "06:45:00", "08:15:00", "10:30:00", "07:00:0…
#> $ observer_id <chr> "obsr554038", "obsr146271", "obsr12384", "ob…
#> $ sampling_event_identifier <chr> "S21633922", "S9118288", "S22036612", "S2203…
#> $ protocol_type <chr> "Traveling", "Traveling", "Stationary", "Sta…
#> $ protocol_code <chr> "P22", "P22", "P21", "P21", "P22", "P22", "P…
#> $ project_code <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD",…
#> $ duration_minutes <int> 30, 60, 60, 90, 90, 90, 90, 35, 60, 60, 75, …
#> $ effort_distance_km <dbl> 1.609, 3.219, NA, NA, 1.000, 1.500, NA, 3.21…
#> $ effort_area_ha <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4.04…
#> $ number_observers <int> 2, 2, 13, 13, 7, 2, 2, 5, 4, 5, 5, 2, 5, 10,…
#> $ all_species_reported <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier <chr> "G1131664", "G1131665", "G1158137", "G115813…
#> $ has_media <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
#> $ approved <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ reviewed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
#> $ reason <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ trip_comments <chr> NA, NA, "BCFO extension trip", "BCFO extensi…
#> $ species_comments <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
auk_filter()
returns an auk_ebd
object with
the output file paths stored in it. This auk_ebd
object can
then be passed directly to auk_read()
, allowing for a
complete pipeline. For example, we can create an auk_ebd
object, define filters, filter with AWK, and read in the results all at
once.
output_file <- "ebd_filtered_blja-grja.txt"
ebd_df <- system.file("extdata/ebd-sample.txt", package = "auk") %>%
auk_ebd() %>%
auk_species(species = c("Canada Jay", "Cyanocitta cristata")) %>%
auk_country(country = "Canada") %>%
auk_filter(file = output_file) %>%
read_ebd()
Saving the AWK command
The AWK script can be saved for future reference by providing an
output filename to awk_file
. In addition, by setting
execute = FALSE
the AWK script will be generated but not
run.
awk_script <- system.file("extdata/ebd-sample.txt", package = "auk") %>%
auk_ebd() %>%
auk_species(species = c("Canada Jay", "Cyanocitta cristata")) %>%
auk_country(country = "Canada") %>%
auk_filter(awk_file = "awk-script.txt", execute = FALSE)
# read back in and prepare for printing
awk_file <- readLines(awk_script)
unlink("awk-script.txt")
awk_file[!grepl("^[[:space:]]*$", awk_file)] %>%
paste0(collapse = "\n") %>%
cat()
#> BEGIN {
#> FS = OFS = " "
#> split("Cyanocitta cristata Perisoreus canadensis", speciesValues, " ")
#> for (i in speciesValues) species[speciesValues[i]] = 1
#> split("CA", countryValues, " ")
#> for (i in countryValues) countries[countryValues[i]] = 1
#> }
#> {
#> keep = 1
#> # filters
#> if (keep == 1 && ($7 in species)) {
#> keep = 1
#> } else {
#> keep = 0
#> }
#> if (keep == 1 && ($17 in countries)) {
#> keep = 1
#> } else {
#> keep = 0
#> }
#> # keeps header
#> if (NR == 1) {
#> keep = 1
#> }
#> if (keep == 1) {
#> print $0
#> }
#> }
Group checklists
eBird allows observers birding together to share checklists. This
process creates a new copy of the original checklist for each observer
with whom the original checklist was shared; these copies can then be
tweaked to add or remove some species that weren’t seen by the entire
group, or altering the sampling-event data. For most applications, it’s
best to remove these duplicate (or near-duplicate) checklists.
auk_unique()
removes duplicates resulting from group
checklists by selecting the observation with the lowest
sampling_event_identifier
(a unique ID for each checklist);
this is the original checklists from which shared copies were generated.
In addition to removing duplicates, a checklist_id
field is
added, which is equal to the sampling_event_identifier
for
non-group checklists and the group_identifier
for grouped
checklists. After running auk_unique()
, every species will
have a single entry for each checklist_id
.
read_ebd()
automatically runs auk_unique()
,
however, we can use unique = FALSE
then manually run
auk_unique()
.
# read in an ebd file and don't automatically remove duplicates
ebd_dupes <- system.file("extdata/ebd-sample.txt", package = "auk") %>%
read_ebd(unique = FALSE)
# remove duplicates
ebd_unique <- auk_unique(ebd_dupes)
# compare number of rows
nrow(ebd_dupes)
#> [1] 400
nrow(ebd_unique)
#> [1] 398
Taxonomic rollup
The eBird Basic Dataset includes both true species and other taxa,
including domestics, hybrids, subspecies, “spuhs”, and recognizable
forms. In some cases, a checklist may contain multiple records for the
same species, for example, both Audubon’s and Myrtle Yellow-rumped
Warblers, as well as some records that are not resolvable to species,
for example, “warbler sp.”. For most use cases, a single record for each
species on each checklist is desired. The function
ebd_rollup()
addresses these cases by removing taxa not
identifiable to species and rolling up taxa identified below species
level to a single record for each species in each checklist.
# read in sample data without rolling up
ebd <- system.file("extdata/ebd-rollup-ex.txt", package = "auk") %>%
read_ebd(rollup = FALSE)
# apply roll up
ebd_ru <- auk_rollup(ebd)
# all taxa not identifiable to species are dropped
# taxa below species have been rolled up to species
unique(ebd$category)
#> [1] "domestic" "form" "hybrid" "intergrade" "slash"
#> [6] "spuh" "species" "issf"
unique(ebd_ru$category)
#> [1] "species"
# yellow-rump warbler subspecies rollup
# without rollup, there are three observations
ebd %>%
filter(common_name == "Yellow-rumped Warbler") %>%
select(checklist_id, category, common_name, subspecies_common_name,
observation_count)
#> # A tibble: 4 × 5
#> checklist_id category common_name subspecies_common_name observation_count
#> <chr> <chr> <chr> <chr> <chr>
#> 1 S44943108 intergrade Yellow-rumpe… Yellow-rumped Warbler… 1
#> 2 S129851825 species Yellow-rumpe… NA 1
#> 3 S129851825 issf Yellow-rumpe… Yellow-rumped Warbler… 1
#> 4 S129851825 issf Yellow-rumpe… Yellow-rumped Warbler… 2
# with rollup, they have been combined
ebd_ru %>%
filter(common_name == "Yellow-rumped Warbler") %>%
select(checklist_id, category, common_name, observation_count)
#> # A tibble: 2 × 4
#> checklist_id category common_name observation_count
#> <chr> <chr> <chr> <chr>
#> 1 S129851825 species Yellow-rumped Warbler 4
#> 2 S44943108 species Yellow-rumped Warbler 1
By default, read_ebd()
calls ebd_rollup()
when importing an eBird Basic Dataset file. To avoid this, and retain
subspecies, use read_ebd(rollup = FALSE)
.
Zero-filled, presence-absence data
For many applications, presence-only data are sufficient; however, for modeling and analysis, presence-absence data are required. eBird observers only explicitly collect presence data, but they have the option of flagging their checklist as “complete” meaning that they are reporting all the species they saw or heard, and identified. Therefore, given a list of positive sightings (the basic dataset) and a list of all checklists (the sampling event data) it is possible to infer absences by filling zeros for all species not explicitly reported. This section of the vignette describes functions for producing zero-filled, presence-absence data.
Filtering
When preparing to create zero-filled data, the eBird Basic Dataset
and sampling event data must be filtered to the same set of checklists
to ensure consistency. To ensure these two datasets are synced, provide
both to auk_ebd
, then filter as described in the
previous section. This will ensure that all the filters applied to the
ebd (except species) will be applied to the sampling event data so that
we’ll be working with the same set of checklists. It is critical that
auk_compete()
is called, since complete checklists are a
requirement for zero-filling.
For example, the following filters to only include sightings of Collared Kingfisher between 6 and 10am:
# to produce zero-filled data, provide an EBD and sampling event data file
f_ebd <- system.file("extdata/zerofill-ex_ebd.txt", package = "auk")
f_smp <- system.file("extdata/zerofill-ex_sampling.txt", package = "auk")
filters <- auk_ebd(f_ebd, file_sampling = f_smp) %>%
auk_species("Collared Kingfisher") %>%
auk_time(c("06:00", "10:00")) %>%
auk_complete()
filters
#> Input
#> EBD: /usr/local/lib/R/site-library/auk/extdata/zerofill-ex_ebd.txt
#> Sampling events: /usr/local/lib/R/site-library/auk/extdata/zerofill-ex_sampling.txt
#>
#> Output
#> Filters not executed
#>
#> Filters
#> Species: Todiramphus chloris
#> Countries: all
#> States: all
#> Counties: all
#> BCRs: all
#> Bounding box: full extent
#> Years: all
#> Date: all
#> Start time: 06:00-10:00
#> Last edited date: all
#> Protocol: all
#> Project code: all
#> Duration: all
#> Distance travelled: all
#> Records with breeding codes only: no
#> Exotic Codes: all
#> Complete checklists only: yes
As with presence-only data, call auk_filter()
to
actually run AWK. Output files must be provided for both the EBD and
sampling event data.
## ebd_sed_filtered <- auk_filter(filters,
## file = "ebd-filtered.txt",
## file_sampling = "sampling-filtered.txt")
ebd_sed_filtered
#> Input
#> EBD: /usr/local/lib/R/site-library/auk/extdata/zerofill-ex_ebd.txt
#> Sampling events: /usr/local/lib/R/site-library/auk/extdata/zerofill-ex_sampling.txt
#>
#> Output
#> EBD: ebd-filtered.txt
#> Sampling events: sampling-filtered.txt
#>
#> Filters
#> Species: Todiramphus chloris
#> Countries: all
#> States: all
#> Counties: all
#> BCRs: all
#> Bounding box: full extent
#> Years: all
#> Date: all
#> Start time: 06:00-10:00
#> Last edited date: all
#> Protocol: all
#> Project code: all
#> Duration: all
#> Distance travelled: all
#> Records with breeding codes only: no
#> Exotic Codes: all
#> Complete checklists only: yes
Reading and zero-filling
The filtered datasets can now be combined into a zero-filled,
presence-absence dataset using auk_zerofill()
.
## ebd_zf <- auk_zerofill(ebd_sed_filtered)
ebd_zf
#> Zero-filled EBD: 131 unique checklists, for 1 species.
Filenames or data frames of the basic dataset and sampling event data
can also be passed to auk_zerofill()
; see the documentation
for these cases. By default, auk_zerofill()
returns an
auk_zerofill
object consisting of two data frames that can
be linked by a common checklist_id
field:
-
ebd_zf$sampling_events
contains the checklist information -
ebd_zf$observations
contains the species counts and a binary presence-absence variable
head(ebd_zf$observations)
#> # A tibble: 6 × 8
#> checklist_id scientific_name breeding_code breeding_category behavior_code
#> <chr> <chr> <chr> <chr> <chr>
#> 1 G2470228 Todiramphus chloris NA NA NA
#> 2 G366411 Todiramphus chloris NA NA NA
#> 3 S10006552 Todiramphus chloris NA NA NA
#> 4 S10006731 Todiramphus chloris NA NA NA
#> 5 S10006786 Todiramphus chloris NA NA NA
#> 6 S10011787 Todiramphus chloris NA NA NA
#> # ℹ 3 more variables: age_sex <chr>, observation_count <chr>,
#> # species_observed <lgl>
glimpse(ebd_zf$sampling_events)
#> Rows: 131
#> Columns: 31
#> $ checklist_id <chr> "S9843037", "S34396450", "S9589770", "S16642…
#> $ last_edited_date <chr> "2022-01-13 07:47:42.702684", "2022-01-13 07…
#> $ country <chr> "Singapore", "Singapore", "Singapore", "Sing…
#> $ country_code <chr> "SG", "SG", "SG", "SG", "SG", "SG", "SG", "S…
#> $ state <chr> "Singapore", "Singapore", "Singapore", "Sing…
#> $ state_code <chr> "SG-", "SG-", "SG-", "SG-", "SG-", "SG-", "S…
#> $ county <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ county_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ iba_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bcr_code <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ usfws_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ atlas_block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ locality <chr> "Pulau Ubin", "Pulau Ubin", "Pulau Ubin", "P…
#> $ locality_id <chr> "L1055540", "L1055540", "L1055540", "L105554…
#> $ locality_type <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H",…
#> $ latitude <dbl> 1.403608, 1.403608, 1.403608, 1.403608, 1.35…
#> $ longitude <dbl> 103.9688, 103.9688, 103.9688, 103.9688, 103.…
#> $ observation_date <date> 2012-02-16, 2012-06-23, 2012-01-15, 2012-01…
#> $ time_observations_started <chr> "08:00:00", "09:00:00", "08:00:00", "08:00:0…
#> $ observer_id <chr> "obs204697", "obs816783", "obs205759", "obs4…
#> $ sampling_event_identifier <chr> "S9843037", "S34396450", "S9589770", "S16642…
#> $ protocol_type <chr> "Traveling", "Traveling", "Traveling", "Trav…
#> $ protocol_code <chr> "P22", "P22", "P22", "P22", "P21", "P22", "P…
#> $ project_code <chr> "EBIRD", "EBIRD", "EBIRD_CAN", "EBIRD_AU", "…
#> $ duration_minutes <int> 300, 180, 300, 510, 23, 195, 150, 105, 190, …
#> $ effort_distance_km <dbl> 1.609, 4.000, 3.000, 10.000, NA, 3.000, 3.00…
#> $ effort_area_ha <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ number_observers <int> 2, 2, 6, 1, 2, 3, 12, 3, 1, 1, 3, 3, 2, 1, 1…
#> $ all_species_reported <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ trip_comments <chr> "With Ailin Chuah on day trip", NA, NA, "Spe…
This format is efficient for storage because the checklist
information isn’t duplicated, however, a single flat data frame is often
required for analysis. To collapse the two data frames together use
collapse_zerofill()
, or call auk_zerofill()
with collapse = TRUE
.
## ebd_zf_df <- auk_zerofill(ebd_filtered, collapse = TRUE)
ebd_zf_df <- collapse_zerofill(ebd_zf)
class(ebd_zf_df)
#> [1] "tbl_df" "tbl" "data.frame"
ebd_zf_df
#> # A tibble: 131 × 38
#> checklist_id last_edited_date country country_code state state_code county
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 S9843037 2022-01-13 07:47:4… Singap… SG Sing… SG- NA
#> 2 S34396450 2022-01-13 07:47:4… Singap… SG Sing… SG- NA
#> 3 S9589770 2023-01-31 19:39:0… Singap… SG Sing… SG- NA
#> 4 S16642917 2023-01-31 19:39:0… Singap… SG Sing… SG- NA
#> 5 S10410041 2013-10-14 16:08:30 Singap… SG Sing… SG- NA
#> 6 S10366236 2021-04-02 03:48:3… Singap… SG Sing… SG- NA
#> 7 S34396153 2021-08-03 11:08:2… Singap… SG Sing… SG- NA
#> 8 S9760550 2022-03-12 04:34:1… Singap… SG Sing… SG- NA
#> 9 S16899954 2022-03-02 06:47:0… Singap… SG Sing… SG- NA
#> 10 S10920563 2019-08-10 00:41:22 Singap… SG Sing… SG- NA
#> # ℹ 121 more rows
#> # ℹ 31 more variables: county_code <chr>, iba_code <chr>, bcr_code <int>,
#> # usfws_code <chr>, atlas_block <chr>, locality <chr>, locality_id <chr>,
#> # locality_type <chr>, latitude <dbl>, longitude <dbl>,
#> # observation_date <date>, time_observations_started <chr>,
#> # observer_id <chr>, sampling_event_identifier <chr>, protocol_type <chr>,
#> # protocol_code <chr>, project_code <chr>, duration_minutes <int>, …
Acknowledgements
This package is based on the AWK scripts provided in a presentation given by Wesley Hochachka, Daniel Fink, Tom Auer, and Frank La Sorte at the 2016 NAOC eBird Data Workshop on August 15, 2016.
auk
benefited significantly from the rOpenSci review process, including
helpful suggestions from Auriel Fournier and Edmund Hart.
References
eBird Basic Dataset. Version: ebd_relFeb-2018. Cornell Lab of Ornithology, Ithaca, New York. May 2013.
Guillera-Arroita, G., J.J. Lahoz-Monfort, J. Elith, A. Gordon, H. Kujala, P.E. Lentini, M.A. McCarthy, R. Tingley, and B.A. Wintle. 2015. Is my species distribution model fit for purpose? Matching data and models to applications. Global Ecology and Biogeography 24:276-292.