Introduction to popler
Aldo Compagnoni, Sam Levin
2023-05-05
Source:vignettes/introduction-to-popler.Rmd
introduction-to-popler.Rmd
The popler R package is an interface that allows browsing and querying population data collected at Long Term Ecological Research (LTER) network sites located across the United States of America. A subset of the population data from the LTER network is contained in an online database called popler
. The popler R package is an interface to this online database, allowing users to:
- explore what type of population data is contained in the
popler
database - download data contained in the
popler
database - filter and validate the data once it is downloaded
Installation
The popler R package is currently in the development phase, and it should be downloaded directly from its GitHub page. Before doing this, make sure to install the devtools R package. Once devtools is installed on your machine, install and load popler:
Metadata: what type of data is contained in the popler database?
popler
provides data from hundreds of research projects. The metadata of these projects allow understanding what population data are provided by each project. The popler
R package provides three functions to explore these metadata.
pplr_dictionary()
pplr_dictionary()
shows:
- what the variables contained in the metadata of each project and their meaning are.
- what data these variables contain.
To see metadata variables and their meaning:
## variable description
## 1 title title of project
## 2 proj_metadata_key unique project id
## 3 lterid lter name
## 4 datatype type of abundance data (e.g. count,biomass)
## 5 structured_data are abundance observations grouped (e.g. based on age)?
## 6 studytype experimental or observational study?
## 7 duration_years duration of project in years
## 8 community does data set contain multiple taxa?
## 9 structure types of indidivual structure
## 10 treatment types of treatment
## 11 lat_lter lter site latitude
## 12 lng_lter lter site longitude
## 13 species specific epithet of a taxonomic unit
## 14 kingdom kingdom
## 15 phylum phylum
## 16 class class
## 17 order order
## 18 family family
## 19 genus genus
To show what data each variable actually contains, specify one or more variable:
pplr_dictionary(lterid, duration_years)
## $`lterid (NA)`
## [1] "SBC" "SEV" "SGS" "VCR" "AND" "NWT" "BNZ" "CDR" "GCE" "ARC" "CAP" "FCE"
## [13] "HFR" "KBS" "CWT" "HBR" "MCM" "JRN" "CCE" "KNZ" "LUQ" "MCR" "NTL" "PAL"
## [25] "PIE"
##
## $`duration_years (NA)`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 9.00000 12.00000 15.87258 20.00000 86.00000
Last, but not least, the same information provided by pplr_dictionary
can be visualized in an html page containing hyperlinks. To open such html page, execute the pplr_report_dictionary
function.
pplr_browse()
pplr_browse()
accesses and subsets the popler metadata table directly. Calling the function returns a table that contains the metadata of all the projects in popler
:
all_studies <- pplr_browse()
This metadata table can be subset by specifying a logical expression. This is useful to focus on datasets of interest.
poa_metadata <- pplr_browse(genus == "Poa" & species == "fendleriana")
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## ℹ The deprecated feature was likely used in the popler package.
## Please report the issue at <https://github.com/ropensci/popler/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
poa_metadata
## # A tibble: 4 × 20
## # Groups: title, proj_metadata_key, lterid, datatype, structured_data,
## # studytype, duration_years, community, studystartyr, studyendyr,
## # structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## # treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## # [4]
## title proj_metadata_key lterid datatype structured_data studytype
## * <chr> <int> <chr> <chr> <chr> <chr>
## 1 Livestock Exclosu… 35 SEV cover no exp
## 2 Pinon Juniper Net… 36 SEV cover yes obs
## 3 Pinon-Juniper (Co… 53 SEV cover yes obs
## 4 Transect Plant Li… 681 JRN cover no obs
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## # studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## # structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## # treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## # lat_lter <dbl>, lng_lter <dbl>, taxas <named list>
Moreover, akin to pplr_report_dictionary()
, browse can generate a report and open it as an html page. To do so, set the report
variable to TRUE
. Alternatively, you can pass an object created by pplr_browse()
to pplr_report_metadata()
to create the same report.
pplr_browse(lterid == "SEV", report = TRUE)
SEV <- pplr_browse(lterid == "SEV")
pplr_report_metadata(SEV)
The keyword argument
pplr_browse()
can also single out projects based on partial matching across the metadata variables that contain characters. Specify the character string you want to search using the keyword
argument (note that this function ignores variables that contain numeric values):
pplr_browse(keyword = "parasite", report = TRUE)
Download data
Once you identified one or more datasets of interest, download their raw data using pplr_get_data()
. You can use this function to download data in three ways:
Providing
pplr_get_data()
with an object created throughpplr_browse()
.Providing
pplr_get_data()
with an object created bypplr_browse()
, and with an additional logical expression to further subset this object of classbrowse
.Providing
pplr_get_data()
with a logical expression. This logical expression will typically indicate the specific project(s) the user is interested in downloading.
Below are examples on the three ways to use pplr_get_data()
:
# option 1
poa_metadata <- pplr_browse(genus == "Poa" & species == "fendleriana")
poa_data <- pplr_get_data(poa_metadata)
# option 2
poa_data_11 <- pplr_get_data(poa_metadata, duration_years > 10)
# option 3
parasite_data <- pplr_get_data(proj_metadata_key == 25)
Here, we emphasize two important characteristics of pplr_get_data()
. First, similarly to pplr_browse()
, the function selects datasets based on the variables described in pplr_dictionary()
. Second, pplr_get_data()
will download entire datasets that satisfy user-defined conditions. Hence, for example, in the example above where genus == "Poa" & species == "fendleriana"
, the function will download three datasets which will include data on Poa fendleriana, along with the many other taxa that happen to co-occur with Poa fendleriana in those datasets.
In case you are using a slow internet connection, datasets may take some time to download. Therefore, popler
provides two utility functions for saving downloaded data locally and efficiently. They are thin wrappers around saveRDS
and readRDS
that allow you to store large data sets in highly compressed formats. .rds
files also have the advantage of rapid read and write times from R, making them optimal for saving data sets for later usage. Note from the examples below: you should not specify the file type when specifying the path.
# save the large data set for later usage
pplr_save(poa_data, file = "some/file/path/Poa_Data")
# when you're ready to use it again, pick up where you left off.
poa_data_reloaded <- pplr_load(file = "some/file/path/Poa_Data")
# These will be identical
stopifnot(identical(poa_data, poa_data_reloaded))
Carefully vet the methods of downloaded data sets.
We urge the user to carefully read the documentation of each project before using it for research purposes. Data sets downloaded with popler
share the same data structure, but each project has its peculiarities. To show the metadata of the downloaded data sets, use pplr_report_metadata
on the data object produced by pplr_get_data()
. To read the methods of each project, click on the ‘metadata link’ hyperlink provided in the html page.
pplr_report_metadata(poa_data)
Data structure
In popler
, datasets are objects produced by pplr_get_data()
which have the same structure. This structure is documented formally in vignette('popler-database-structure', package = 'popler')
. Here, we provide a brief description on how spatial replicates and taxonomic information are stored in the database.
Spatial replicates are identified using variables that match the patterns spatial_replication_level_X
and spatial_replication_level_X_label
. Here X
is a number referring to one of maximum 5 nested levels of spatial replication. X
can vary from 1 to 5, with 1 referring to the largest spatial replication level - the one within which are nested all smaller spatial replicates. So for example, spatial_replication_level_1
can represent a site, and spatial_replication_level_2
represents a plot. In this specific case, spatial_replication_level_1_label
will contain the string ‘site’, and spatial_replication_level_2_label
will contain the string ‘plot’.
Taxonomic units are identified through species codes in the sppcode
variable, or through the genus
and species
variables. The sppcode
variable usually contains alphanumeric codes. The genus
and species
variables are Latin binomial name. Occasionally, some datasets will contain higher taxonomic classifications (such as family
, class
, etc.).
Spatio-temporal replication
Users can explore the level of temporal replication at each nested level of spatial replication using the pplr_site_rep_plot()
and pplr_site_rep()
functions.
pplr_site_rep_plot()
produces a scatterplot that shows which sites (spatial_replication_level_1
) were sampled in a given year.
pplr_site_rep()
allows the user to subset datasets downloaded by pplr_get_data()
based on the frequency and number of yearly replicates contained at a specific level of spatial replication. For example, this function allows to identify the replicates of the second level of spatial replication (e.g. plots within sites) which contain two samples per years (their frequency), for 10 years (the number of yearly replicates). pplr_site_rep()
returns a logical vector to subset the pplr_get_data()
object. For additional examples on how to explore and vet popler
data, see vignette('vetting-popler', package = 'popler')
.
Extra covariates
Most data sets provided by the USA LTER network contain more variables than those accommodated by the schema of popler
. In order not to loose the original data, popler
stores all extra information in a character variable named covariates
. The popler
package provides two ways to format these covariates into a data frame: the cov_unpack
argument in pplr_get data()
, and the pplr_cov_unpack()
function in popler
.
Setting the cov_unpack
argument to TRUE
returns a data frame that combines the variables of a default query to popler, and the covariates contained in each particular study downloaded through popler:
d_47_cov <- pplr_get_data(proj_metadata_key == 47, cov_unpack = TRUE)
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## You have downloaded data from 1 project.
## The identification number of this project is: 47.
##
## To learn more about study design, use metadata_url()
## To cite the study, use pplr_citation().
head(d_47_cov)
## year month day spatial_replication_level_1 spatial_replication_level_2
## 1 2010 6 18 site_sev_pdog_restoration_B NA
## 2 2010 6 18 site_sev_pdog_restoration_B NA
## 3 2010 6 18 site_sev_pdog_restoration_B NA
## 4 2010 6 18 site_sev_pdog_restoration_B NA
## 5 2010 6 18 site_sev_pdog_restoration_B NA
## 6 2010 6 18 site_sev_pdog_restoration_B NA
## treatment_type_1 structure_type_1 structure_type_2 structure_type_3
## 1 B 349 M A
## 2 B 348 M A
## 3 B 347 M A
## 4 B 245 M A
## 5 B 244 M A
## 6 B NA M A
## abundance_observation structure_type_4 authors
## 1 1 1085 Ana Davidson, Stephanie Baker
## 2 1 1580 Ana Davidson, Stephanie Baker
## 3 1 1350 Ana Davidson, Stephanie Baker
## 4 1 1130 Ana Davidson, Stephanie Baker
## 5 1 1480 Ana Davidson, Stephanie Baker
## 6 1 1140 Ana Davidson, Stephanie Baker
## authors_contact genus species datatype
## 1 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 2 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 3 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 4 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 5 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 6 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## spatial_replication_level_1_label spatial_replication_level_2_label
## 1 PLOT Capture_Site
## 2 PLOT Capture_Site
## 3 PLOT Capture_Site
## 4 PLOT Capture_Site
## 5 PLOT Capture_Site
## 6 PLOT Capture_Site
## proj_metadata_key X1_label X1_value X2_label X2_value X3_label X3_value
## 1 47 SEASON SUMMER YEAR 2010 RECAP RELEASE
## 2 47 SEASON SUMMER YEAR 2010 RECAP RELEASE
## 3 47 SEASON SUMMER YEAR 2010 RECAP RELEASE
## 4 47 SEASON SUMMER YEAR 2010 RECAP RELEASE
## 5 47 SEASON SUMMER YEAR 2010 RECAP RELEASE
## 6 47 SEASON SUMMER YEAR 2010 RECAP RELEASE
## X4_label X4_value X5_label X5_value X6_label X6_value X7_label
## 1 Comments NA PIT_TAG NA TAG_2_RT 249.0 Capture_Site
## 2 Comments NA PIT_TAG NA TAG_2_RT 248.0 Capture_Site
## 3 Comments NA PIT_TAG NA TAG_2_RT 246.0 Capture_Site
## 4 Comments NA PIT_TAG NA TAG_2_RT 346.0 Capture_Site
## 5 Comments NA PIT_TAG NA TAG_2_RT NA Capture_Site
## 6 Comments NA PIT_TAG NA TAG_2_RT 224.0 Capture_Site
## X7_value X8_label X8_value
## 1 Las Colinas <NA> <NA>
## 2 Las Colinas <NA> <NA>
## 3 Las Colinas <NA> <NA>
## 4 Las Colinas <NA> <NA>
## 5 Las Colinas <NA> <NA>
## 6 Las Colinas <NA> <NA>
Using the pplr_cov_unpack()
function on a data frame downloaded using pplr_get_data()
returns a separate data frame of the covariates contained in the downloaded object.
d_47 <- pplr_get_data(proj_metadata_key == 47)
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## You have downloaded data from 1 project.
## The identification number of this project is: 47.
##
## To learn more about study design, use metadata_url()
## To cite the study, use pplr_citation().
head(pplr_cov_unpack(d_47))
## X1_label X1_value X2_label X2_value X3_label X3_value X4_label X4_value
## 1 SEASON SUMMER YEAR 2010 RECAP RELEASE Comments NA
## 2 SEASON SUMMER YEAR 2010 RECAP RELEASE Comments NA
## 3 SEASON SUMMER YEAR 2010 RECAP RELEASE Comments NA
## 4 SEASON SUMMER YEAR 2010 RECAP RELEASE Comments NA
## 5 SEASON SUMMER YEAR 2010 RECAP RELEASE Comments NA
## 6 SEASON SUMMER YEAR 2010 RECAP RELEASE Comments NA
## X5_label X5_value X6_label X6_value X7_label X7_value X8_label
## 1 PIT_TAG NA TAG_2_RT 249.0 Capture_Site Las Colinas <NA>
## 2 PIT_TAG NA TAG_2_RT 248.0 Capture_Site Las Colinas <NA>
## 3 PIT_TAG NA TAG_2_RT 246.0 Capture_Site Las Colinas <NA>
## 4 PIT_TAG NA TAG_2_RT 346.0 Capture_Site Las Colinas <NA>
## 5 PIT_TAG NA TAG_2_RT NA Capture_Site Las Colinas <NA>
## 6 PIT_TAG NA TAG_2_RT 224.0 Capture_Site Las Colinas <NA>
## X8_value
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>