Introduction to popler

The popler R package is an interface that allows browsing and querying population data collected at Long Term Ecological Research (LTER) network sites located across the United States of America. A subset of the population data from the LTER network is contained in an online database called popler. The popler R package is an interface to this online database, allowing users to:

explore what type of population data is contained in the popler database
download data contained in the popler database
filter and validate the data once it is downloaded

Installation

The popler R package is currently in the development phase, and it should be downloaded directly from its GitHub page. Before doing this, make sure to install the devtools R package. Once devtools is installed on your machine, install and load popler:

# devtools::install_github("AldoCompagnoni/popler", build_vignettes = TRUE)
library(popler)

Metadata: what type of data is contained in the popler database?

popler provides data from hundreds of research projects. The metadata of these projects allow understanding what population data are provided by each project. The popler R package provides three functions to explore these metadata.

pplr_dictionary()

pplr_dictionary() shows:

what the variables contained in the metadata of each project and their meaning are.
what data these variables contain.

To see metadata variables and their meaning:

pplr_dictionary()

##             variable                                             description
## 1              title                                        title of project
## 2  proj_metadata_key                                       unique project id
## 3             lterid                                               lter name
## 4           datatype             type of abundance data (e.g. count,biomass)
## 5    structured_data are abundance observations grouped (e.g. based on age)?
## 6          studytype                    experimental or observational study?
## 7     duration_years                            duration of project in years
## 8          community                    does data set contain multiple taxa?
## 9          structure                           types of indidivual structure
## 10         treatment                                      types of treatment
## 11          lat_lter                                      lter site latitude
## 12          lng_lter                                     lter site longitude
## 13           species                    specific epithet of a taxonomic unit
## 14           kingdom                                                 kingdom
## 15            phylum                                                  phylum
## 16             class                                                   class
## 17             order                                                   order
## 18            family                                                  family
## 19             genus                                                   genus

To show what data each variable actually contains, specify one or more variable:

pplr_dictionary(lterid, duration_years)

## $`lterid (NA)`
##  [1] "SBC" "SEV" "SGS" "VCR" "AND" "NWT" "BNZ" "CDR" "GCE" "ARC" "CAP" "FCE"
## [13] "HFR" "KBS" "CWT" "HBR" "MCM" "JRN" "CCE" "KNZ" "LUQ" "MCR" "NTL" "PAL"
## [25] "PIE"
## 
## $`duration_years (NA)`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  9.00000 12.00000 15.87258 20.00000 86.00000

Last, but not least, the same information provided by pplr_dictionary can be visualized in an html page containing hyperlinks. To open such html page, execute the pplr_report_dictionary function.

pplr_report_dictionary()

pplr_browse()

pplr_browse() accesses and subsets the popler metadata table directly. Calling the function returns a table that contains the metadata of all the projects in popler:

all_studies <- pplr_browse()

This metadata table can be subset by specifying a logical expression. This is useful to focus on datasets of interest.

poa_metadata  <- pplr_browse(genus == "Poa" & species == "fendleriana")

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## ℹ The deprecated feature was likely used in the popler package.
##   Please report the issue at <https://github.com/ropensci/popler/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

poa_metadata

## # A tibble: 4 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [4]
##   title              proj_metadata_key lterid datatype structured_data studytype
## * <chr>                          <int> <chr>  <chr>    <chr>           <chr>    
## 1 Livestock Exclosu…                35 SEV    cover    no              exp      
## 2 Pinon Juniper Net…                36 SEV    cover    yes             obs      
## 3 Pinon-Juniper (Co…                53 SEV    cover    yes             obs      
## 4 Transect Plant Li…               681 JRN    cover    no              obs      
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

Moreover, akin to pplr_report_dictionary(), browse can generate a report and open it as an html page. To do so, set the report variable to TRUE. Alternatively, you can pass an object created by pplr_browse() to pplr_report_metadata() to create the same report.

pplr_browse(lterid == "SEV", report = TRUE)

SEV <- pplr_browse(lterid == "SEV")

pplr_report_metadata(SEV)

The keyword argument

pplr_browse() can also single out projects based on partial matching across the metadata variables that contain characters. Specify the character string you want to search using the keyword argument (note that this function ignores variables that contain numeric values):

pplr_browse(keyword = "parasite", report = TRUE)

Download data

Once you identified one or more datasets of interest, download their raw data using pplr_get_data(). You can use this function to download data in three ways:

Providing pplr_get_data() with an object created through pplr_browse().
Providing pplr_get_data() with an object created by pplr_browse(), and with an additional logical expression to further subset this object of class browse.
Providing pplr_get_data() with a logical expression. This logical expression will typically indicate the specific project(s) the user is interested in downloading.

Below are examples on the three ways to use pplr_get_data():

# option 1
poa_metadata    <- pplr_browse(genus == "Poa" & species == "fendleriana") 
poa_data        <- pplr_get_data(poa_metadata) 
# option 2
poa_data_11     <- pplr_get_data(poa_metadata, duration_years > 10) 
# option 3
parasite_data   <- pplr_get_data(proj_metadata_key == 25)

Here, we emphasize two important characteristics of pplr_get_data(). First, similarly to pplr_browse(), the function selects datasets based on the variables described in pplr_dictionary(). Second, pplr_get_data() will download entire datasets that satisfy user-defined conditions. Hence, for example, in the example above where genus == "Poa" & species == "fendleriana", the function will download three datasets which will include data on Poa fendleriana, along with the many other taxa that happen to co-occur with Poa fendleriana in those datasets.

In case you are using a slow internet connection, datasets may take some time to download. Therefore, popler provides two utility functions for saving downloaded data locally and efficiently. They are thin wrappers around saveRDS and readRDS that allow you to store large data sets in highly compressed formats. .rds files also have the advantage of rapid read and write times from R, making them optimal for saving data sets for later usage. Note from the examples below: you should not specify the file type when specifying the path.

# save the large data set for later usage
pplr_save(poa_data, file = "some/file/path/Poa_Data")

# when you're ready to use it again, pick up where you left off.

poa_data_reloaded <- pplr_load(file = "some/file/path/Poa_Data")

# These will be identical
stopifnot(identical(poa_data, poa_data_reloaded))

Carefully vet the methods of downloaded data sets.

We urge the user to carefully read the documentation of each project before using it for research purposes. Data sets downloaded with popler share the same data structure, but each project has its peculiarities. To show the metadata of the downloaded data sets, use pplr_report_metadata on the data object produced by pplr_get_data(). To read the methods of each project, click on the ‘metadata link’ hyperlink provided in the html page.

pplr_report_metadata(poa_data)

Data structure

In popler, datasets are objects produced by pplr_get_data() which have the same structure. This structure is documented formally in vignette('popler-database-structure', package = 'popler'). Here, we provide a brief description on how spatial replicates and taxonomic information are stored in the database.

Spatial replicates are identified using variables that match the patterns spatial_replication_level_X and spatial_replication_level_X_label. Here X is a number referring to one of maximum 5 nested levels of spatial replication. X can vary from 1 to 5, with 1 referring to the largest spatial replication level - the one within which are nested all smaller spatial replicates. So for example, spatial_replication_level_1 can represent a site, and spatial_replication_level_2 represents a plot. In this specific case, spatial_replication_level_1_label will contain the string ‘site’, and spatial_replication_level_2_label will contain the string ‘plot’.

Taxonomic units are identified through species codes in the sppcode variable, or through the genus and species variables. The sppcode variable usually contains alphanumeric codes. The genus and species variables are Latin binomial name. Occasionally, some datasets will contain higher taxonomic classifications (such as family, class, etc.).

Spatio-temporal replication

Users can explore the level of temporal replication at each nested level of spatial replication using the pplr_site_rep_plot() and pplr_site_rep() functions.

pplr_site_rep_plot() produces a scatterplot that shows which sites (spatial_replication_level_1) were sampled in a given year.

pplr_site_rep() allows the user to subset datasets downloaded by pplr_get_data() based on the frequency and number of yearly replicates contained at a specific level of spatial replication. For example, this function allows to identify the replicates of the second level of spatial replication (e.g. plots within sites) which contain two samples per years (their frequency), for 10 years (the number of yearly replicates). pplr_site_rep() returns a logical vector to subset the pplr_get_data() object. For additional examples on how to explore and vet popler data, see vignette('vetting-popler', package = 'popler').

Extra covariates

Most data sets provided by the USA LTER network contain more variables than those accommodated by the schema of popler. In order not to loose the original data, popler stores all extra information in a character variable named covariates. The popler package provides two ways to format these covariates into a data frame: the cov_unpack argument in pplr_get data(), and the pplr_cov_unpack() function in popler.

Setting the cov_unpack argument to TRUE returns a data frame that combines the variables of a default query to popler, and the covariates contained in each particular study downloaded through popler:

d_47_cov <- pplr_get_data(proj_metadata_key == 47, cov_unpack = TRUE)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

## You have downloaded data from 1 project.
## The identification number of this project is: 47.
## 
## To learn more about study design, use metadata_url()
## To cite the study, use pplr_citation().

head(d_47_cov)

##   year month day spatial_replication_level_1 spatial_replication_level_2
## 1 2010     6  18 site_sev_pdog_restoration_B                          NA
## 2 2010     6  18 site_sev_pdog_restoration_B                          NA
## 3 2010     6  18 site_sev_pdog_restoration_B                          NA
## 4 2010     6  18 site_sev_pdog_restoration_B                          NA
## 5 2010     6  18 site_sev_pdog_restoration_B                          NA
## 6 2010     6  18 site_sev_pdog_restoration_B                          NA
##   treatment_type_1 structure_type_1 structure_type_2 structure_type_3
## 1                B              349                M                A
## 2                B              348                M                A
## 3                B              347                M                A
## 4                B              245                M                A
## 5                B              244                M                A
## 6                B               NA                M                A
##   abundance_observation structure_type_4                       authors
## 1                     1             1085 Ana Davidson, Stephanie Baker
## 2                     1             1580 Ana Davidson, Stephanie Baker
## 3                     1             1350 Ana Davidson, Stephanie Baker
## 4                     1             1130 Ana Davidson, Stephanie Baker
## 5                     1             1480 Ana Davidson, Stephanie Baker
## 6                     1             1140 Ana Davidson, Stephanie Baker
##                               authors_contact   genus   species   datatype
## 1 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 2 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 3 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 4 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 5 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
## 6 davidson@unm.edu, srbaker@sevilleta.unm.edu Cynomys gunnisoni individual
##   spatial_replication_level_1_label spatial_replication_level_2_label
## 1                              PLOT                      Capture_Site
## 2                              PLOT                      Capture_Site
## 3                              PLOT                      Capture_Site
## 4                              PLOT                      Capture_Site
## 5                              PLOT                      Capture_Site
## 6                              PLOT                      Capture_Site
##   proj_metadata_key X1_label X1_value X2_label X2_value X3_label X3_value
## 1                47   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE
## 2                47   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE
## 3                47   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE
## 4                47   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE
## 5                47   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE
## 6                47   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE
##   X4_label X4_value X5_label X5_value X6_label X6_value     X7_label
## 1 Comments       NA  PIT_TAG       NA TAG_2_RT    249.0 Capture_Site
## 2 Comments       NA  PIT_TAG       NA TAG_2_RT    248.0 Capture_Site
## 3 Comments       NA  PIT_TAG       NA TAG_2_RT    246.0 Capture_Site
## 4 Comments       NA  PIT_TAG       NA TAG_2_RT    346.0 Capture_Site
## 5 Comments       NA  PIT_TAG       NA TAG_2_RT       NA Capture_Site
## 6 Comments       NA  PIT_TAG       NA TAG_2_RT    224.0 Capture_Site
##      X7_value X8_label X8_value
## 1 Las Colinas     <NA>     <NA>
## 2 Las Colinas     <NA>     <NA>
## 3 Las Colinas     <NA>     <NA>
## 4 Las Colinas     <NA>     <NA>
## 5 Las Colinas     <NA>     <NA>
## 6 Las Colinas     <NA>     <NA>

Using the pplr_cov_unpack() function on a data frame downloaded using pplr_get_data() returns a separate data frame of the covariates contained in the downloaded object.

d_47 <- pplr_get_data(proj_metadata_key == 47)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

## You have downloaded data from 1 project.
## The identification number of this project is: 47.
## 
## To learn more about study design, use metadata_url()
## To cite the study, use pplr_citation().

head(pplr_cov_unpack(d_47))

##   X1_label X1_value X2_label X2_value X3_label X3_value X4_label X4_value
## 1   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE Comments       NA
## 2   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE Comments       NA
## 3   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE Comments       NA
## 4   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE Comments       NA
## 5   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE Comments       NA
## 6   SEASON   SUMMER     YEAR     2010    RECAP  RELEASE Comments       NA
##   X5_label X5_value X6_label X6_value     X7_label    X7_value X8_label
## 1  PIT_TAG       NA TAG_2_RT    249.0 Capture_Site Las Colinas     <NA>
## 2  PIT_TAG       NA TAG_2_RT    248.0 Capture_Site Las Colinas     <NA>
## 3  PIT_TAG       NA TAG_2_RT    246.0 Capture_Site Las Colinas     <NA>
## 4  PIT_TAG       NA TAG_2_RT    346.0 Capture_Site Las Colinas     <NA>
## 5  PIT_TAG       NA TAG_2_RT       NA Capture_Site Las Colinas     <NA>
## 6  PIT_TAG       NA TAG_2_RT    224.0 Capture_Site Las Colinas     <NA>
##   X8_value
## 1     <NA>
## 2     <NA>
## 3     <NA>
## 4     <NA>
## 5     <NA>
## 6     <NA>

Aldo Compagnoni, Sam Levin

2023-05-05