Vetting popler

Introduction: identifying groups of data sets

The popler R package was built to foster scientific synthesis using LTER long-term population data. The premise of such synthesis is using data from many research projects that share characteristics of scientific interest. To identify projects sharing salient attributes, popler uses the metadata information associated with each LTER project. In particular, it is fairly easy to select projects based on one or more of the following features:

Replication, temporal or spatial.
Taxonomic group(s).
Study characteristics.
Geographic location.

Vetting the database based on these criteria is intuitive. However, popler also facilitates identifying data sets in other ways. Below we provide several examples on how to select LTER data based on the four types of features described above. Moreover, in the final section we also show how to carry out more complicated types of searches.

1. Replication

Temporal replication

If you are interested in long-term data, you will likely want to select projects based on how many years the data was collected for. This is straightforward:

library(popler)
pplr_browse(duration_years > 10)

## # A tibble: 163 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [163]
##    title             proj_metadata_key lterid datatype structured_data studytype
##  * <chr>                         <int> <chr>  <chr>    <chr>           <chr>    
##  1 SBC LTER: Reef: …                 1 SBC    individ… no              obs      
##  2 SBC LTER: Reef: …                 2 SBC    count    no              obs      
##  3 SBC LTER: Reef: …                 3 SBC    count    yes             obs      
##  4 SBC LTER: Reef: …                 4 SBC    cover    no              obs      
##  5 SBC LTER: Time s…                12 SBC    density  no              obs      
##  6 SBC LTER: Santa …                13 SBC    count    no              obs      
##  7 SBC LTER: Santa …                14 SBC    cover    no              obs      
##  8 SBC LTER: Santa …                15 SBC    biomass  no              obs      
##  9 SBC LTER: Reef: …                17 SBC    biomass  no              obs      
## 10 Long-Term Core S…                21 SEV    count    yes             obs      
## # ℹ 153 more rows
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

Note that most LTER projects contemplate sampling at a yearly or sub-yearly frequency. Thus, studies longer than 10 years often guarantee a longitudinal series of 10 or more observations. Note that the duration_years variable is calculated as studyendyr - studystartyr. Thus, an additional variable named samplefreq characterizes the approximate sample frequency of each study.

pplr_dictionary(samplefreq)

## $`samplefreq (NA)`
##  [1] "year"         "yr"           "season:yr"    "biweekly"     "month"       
##  [6] "month:year"   "monthly"      "season:year"  "bimonthly"    "NaN"         
## [11] "biennial"     "quadrennial"  "irregular"    "quinquennial" "day"

pplr_browse(samplefreq == "monthly")

## # A tibble: 1 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [1]
##   title              proj_metadata_key lterid datatype structured_data studytype
## * <chr>                          <int> <chr>  <chr>    <chr>           <chr>    
## 1 SBC LTER: Cross-s…                20 SBC    count    no              obs      
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

Note that samplefreq is not a default variable included in the pplr_dictionary or pplr_browse() functions. This can be viewed by specifying the full_tbl = TRUE argument in either of the above functions.

###1. Spatial replication

Before downloading data

If you wish to select data sets based on their spatial replication, you need to consider that popler organizes data in nested spatial levels. For example, in many plant studies data is collected at the plot level, which can be nested within block, which in turn can be nested within site. popler labels spatial levels using numbers. Spatial level 1 is the coarsest level of replication which contains all other spatial replicates. In the example above, spatial level 1 is site, spatial level 2 is block, and spatial level 3 is plot. popler allows for a total of 5 spatial levels. Given the above, you can select studies based on three criteria:

The total number of spatial replicates.
The number of replicates within a specific spatial level.
The number of nested spatial replicates.

Below we provide three examples for each one of these respective cases.

pplr_browse(tot_spat_rep > 100)

## # A tibble: 158 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [158]
##    title             proj_metadata_key lterid datatype structured_data studytype
##  * <chr>                         <int> <chr>  <chr>    <chr>           <chr>    
##  1 SBC LTER: Reef: …                 1 SBC    individ… no              obs      
##  2 SBC LTER: Reef: …                 2 SBC    count    no              obs      
##  3 SBC LTER: Reef: …                 3 SBC    count    yes             obs      
##  4 SBC LTER: Reef: …                 5 SBC    individ… no              exp      
##  5 SBC LTER: Reef: …                 6 SBC    count    yes             exp      
##  6 SBC LTER: Reef: …                 7 SBC    count    no              exp      
##  7 SBC LTER: Time s…                12 SBC    density  no              obs      
##  8 SBC LTER: Santa …                13 SBC    count    no              obs      
##  9 SBC LTER: Santa …                14 SBC    cover    no              obs      
## 10 SBC LTER: Santa …                15 SBC    biomass  no              obs      
## # ℹ 148 more rows
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

pplr_browse(spatial_replication_level_5_number_of_unique_reps > 1)

## # A tibble: 4 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [4]
##   title              proj_metadata_key lterid datatype structured_data studytype
## * <chr>                          <int> <chr>  <chr>    <chr>           <chr>    
## 1 Plant succession …               141 AND    cover    no              obs      
## 2 e093: Soil Hetero…               287 CDR    cover    no              exp      
## 3 Macroinfaunal cou…               862 PIE    count    no              exp      
## 4 Meiofaunal counts…               868 PIE    count    no              exp      
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

pplr_browse(n_spat_levs == 3)

## # A tibble: 96 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [96]
##    title             proj_metadata_key lterid datatype structured_data studytype
##  * <chr>                         <int> <chr>  <chr>    <chr>           <chr>    
##  1 SBC LTER: Santa …                13 SBC    count    no              obs      
##  2 SBC LTER: Santa …                15 SBC    biomass  no              obs      
##  3 SBC LTER: Santa …                16 SBC    count    no              obs      
##  4 Long-Term Core S…                21 SEV    count    yes             obs      
##  5 Rodent Parasite …                25 SEV    count    yes             obs      
##  6 Burn Exclosure R…                28 SEV    individ… no              exp      
##  7 Nitrogen Fertili…                29 SEV    cover    no              exp      
##  8 Pino Gate Prairi…                33 SEV    count    no              obs      
##  9 Warming-El Nino-…                34 SEV    cover    no              exp      
## 10 Livestock Exclos…                35 SEV    cover    no              exp      
## # ℹ 86 more rows
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

After downloading data

Users can also explore the spatial and temporal replication of the data more explicitly after downloading it with pplr_get_data() through two function: pplr_site_rep() and pplr_site_rep_plot().

pplr_site_rep() provides two options for exploring data that meet temporal replication requirements at a given spatial resolution. The user can choose to filter data by specifying a minimum sampling frequency per year and a minimum number of years that sample with that frequency. Because this function uses the sampling dates to calculate the frequency, it provides additional information that is not contained in the samplefreq column of the main metadata table.

# download some data (note: this download is >100MB)
SEV <- pplr_get_data(proj_metadata_key == 21)

# Create a summary table containing names of replication levels that contain 2 samples per year for 10 years. 
SEV_long_studies <- pplr_site_rep(SEV, 
                                  freq = 2, 
                                  duration = 10, 
                                  return_logical = FALSE)

# you can also subset it directly using the function and specifying it to return a logical vector
subset_vec <- pplr_site_rep(SEV,
                            freq = 2,
                            duration = 10,
                            return_logical = TRUE)
# store subset of data
SEV_long_data <- SEV[subset_vec, ]

Users can also visualize the frequency of sampling at the coarsest level of spatial replication using the pplr_site_rep_plot() function. This generates a ggplot that denotes whether or not a particular site was sampled in a particular year. Note that the coarsest level of spatial replication is called site and it is contained in the variable spatial_replication_level_1.

library(ggplot2)

# return the plot object w/ return_plot = TRUE
pplr_site_rep_plot(SEV_long_data, return_plot = TRUE) +
  ggtitle("Long Term Data from Sevilleta LTER")
  
# or return an invisible copy of the input data and keep piping
library(dplyr)
SEV_long_data %>%
  pplr_site_rep_plot(return_plot = FALSE) %>%
  pplr_report_metadata()

###2. Taxonomic group

popler is not limited to specific taxonomic groups, and it currently contains mostly data on animals and plants. To select information based on taxonomic groups, simply specify which group and which category you wish to select. The default settings of popler provide seven taxonomic groups: kingdom, phylum, class, order, family, genus, and species in each request. Column sppcode provides the identifier, usually an alphanumeric code, associated with each taxonomic entity in the original dataset. Note that not all LTER studies provide full taxonomic information; hence, browsing studies by taxonomic information will provide partial results (in the example below, not all insects studies will be identified).

pplr_dictionary(class)

## $`class (class)`
##  [1] "Phaeophycea"            "Actinopterygii"         "Chondrichthyes"        
##  [4] "Osteichthes"            "Asteroidea"             "Gastropoda"            
##  [7] "Anthozoa"               "Cephalopoda"            "Malacostraca"          
## [10] "Phaeophyceae"           "Bivalvia"               "Holothuroidea"         
## [13] "Echinoidea"             "Ascidiacea"             "Demospongiae"          
## [16] "Polychaeta"             "Ophiuroidea"            "Ascidiacae"            
## [19] "Rhodophyceae"           "Hydrozoa"               "Gymnolaemata"          
## [22] "Liliopsida"             "Ascidacea"              "Chlorophyceae"         
## [25] "Bacillariophyta"        "Maxillopoda"            "Calcarea"              
## [28] "Ophiuroidea/Asteroidea" "Ophiuroidae"            "Floriophyccae"         
## [31] "Mammalia"               "Bacillariophyceae"      "Conoidasida"           
## [34] "Secernentea"            "Cestoda"                "Archiacanthocephala"   
## [37] "cestode"                "Adenophorea"            "Insecta"               
## [40] "Arachnida"              "Catenotaeniidae"        "Insect"                
## [43] "Reptilia"               "Aves"                   "Collembola"            
## [46] "Clitellata"             "Hexapoda"               "Lecanoromycetes"       
## [49] "Turbellaria"            "Ostracoda"              "Branchiobdellida"      
## [52] "Branchiopoda"           "Hirudinea"              "Oligochaeta"           
## [55] "Pelecypoda"             "Entogatha"              "Annelida"              
## [58] "Crustacea"              "Nematoda"               "Hydracarina"           
## [61] "Phylum Nemertea"        "Phylum Nematoda"        "Phylum Cnidaria"

pplr_browse(class == "Insecta")

## # A tibble: 7 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [7]
##   title              proj_metadata_key lterid datatype structured_data studytype
## * <chr>                          <int> <chr>  <chr>    <chr>           <chr>    
## 1 Rodent Parasite D…                25 SEV    count    yes             obs      
## 2 Effect of Habitat…                43 SEV    count    no              obs      
## 3 Small Mammal Excl…                60 SEV    count    no              exp      
## 4 SGS-LTER Long-Ter…                86 SGS    count    no              obs      
## 5 Aquatic insect sa…               133 AND    count    no              obs      
## 6 Bonanza Creek Exp…               156 BNZ    count    no              obs      
## 7 North Temperate L…               822 NTL    count    no              obs      
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

Note that the taxonomic information returned in pplr_browse() is housed in a data structure called list column. Each entry of this list column is itself a list that contains a data.frame with eight columns. Users can access this information using the following syntax.

insects <- pplr_browse(class == 'Insecta')

# access the taxonomic table from the first project in the insects object
insects$taxas[[1]]

## # A tibble: 7 × 8
##   sppcode species     kingdom  phylum     class   order        family    genus  
##   <chr>   <chr>       <chr>    <chr>      <chr>   <chr>        <chr>     <chr>  
## 1 cune    neomexicana Animalia Arthropoda Insecta Diptera      Oestridae Cutere…
## 2 cune    neomexicana Animalia Arthropoda Insecta Diptera      Oestridae Cutere…
## 3 cuau    austeni     Animalia Arthropoda Insecta Diptera      Oestridae Cutere…
## 4 flea    sp          Animalia Arthropoda Insecta Siphonaptera NA        NA     
## 5 cuau    austeni     Animalia Arthropoda Insecta Diptera      Oestridae Cutere…
## 6 flea    sp          Animalia Arthropoda Insecta Siphonaptera NA        NA     
## 7 cusp    species     Animalia Arthropoda Insecta Diptera      Oestridae Cutere…

# second table (etc.)
insects$taxas[[2]]

## # A tibble: 205 × 8
##    sppcode  species   kingdom  phylum     class   order       family       genus
##    <chr>    <chr>     <chr>    <chr>      <chr>   <chr>       <chr>        <chr>
##  1 ANPERPUL NA        Animalia Arthropoda Insecta Hymenoptera NA           NA   
##  2 APHABMOR morrisoni Animalia Arthropoda Insecta Hymenoptera APIDAE       Habr…
##  3 HAAGAANG angelicus Animalia Arthropoda Insecta Hymenoptera HALICTIDAE   Agap…
##  4 APDIAENA NA        Animalia Arthropoda Insecta Hymenoptera NA           NA   
##  5 MEOSMTIT titusi    Animalia Arthropoda Insecta Hymenoptera MEGACHILIDAE Osmia
##  6 ANPER005 5         Animalia Arthropoda Insecta Hymenoptera ANDRENIDAE   Perd…
##  7 HALASCOA NA        Animalia Arthropoda Insecta Hymenoptera NA           NA   
##  8 APTETALB NA        Animalia Arthropoda Insecta Hymenoptera NA           NA   
##  9 APANTPHE NA        Animalia Arthropoda Insecta Hymenoptera NA           NA   
## 10 HASPH002 2         Animalia Arthropoda Insecta Hymenoptera HALICTIDAE   Sphe…
## # ℹ 195 more rows

###3. Study characteristics

Metadata information provides a few variables to select studies based on their design. In particular:

studytype: indicates whether the study is observational or experimental. Options are obs or exp for observational and experimental studies, respectively.
treatment_type: type of treatments, if study is experimental.
community: indicates whether the project provides data on multiple species. Options are yes or no.
structured_data: indicates whether the project provides information on population structure. For example, a population can be sub-divided in age, size, or developmental classes. Options are yes or no.

Below we show how to use these three fields.

pplr_dictionary(community)

## $`community (NA)`
## [1] "no"  "yes"

pplr_browse(community == "no") # 20 single-species studies

## # A tibble: 43 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [43]
##    title             proj_metadata_key lterid datatype structured_data studytype
##  * <chr>                         <int> <chr>  <chr>    <chr>           <chr>    
##  1 SBC LTER: Reef: …                 1 SBC    individ… no              obs      
##  2 SBC LTER: Reef: …                 5 SBC    individ… no              exp      
##  3 SBC LTER: Santa …                16 SBC    count    no              obs      
##  4 SBC LTER: Reef: …                17 SBC    biomass  no              obs      
##  5 SBC LTER: Reef: …                18 SBC    count    yes             obs      
##  6 Population Ecolo…                44 SEV    individ… no              obs      
##  7 Gunnison's Prair…                47 SEV    individ… no              exp      
##  8 SGS-LTER Long-Te…                84 SGS    individ… no              obs      
##  9 Density of Seagr…                90 VCR    density  no              exp      
## 10 Spruce Seedling …               158 BNZ    individ… no              exp      
## # ℹ 33 more rows
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

pplr_dictionary(treatment)

## $`treatment (type of treatment)`
##  [1] "observational"                    "removal"                         
##  [3] "fire"                             "resource"                        
##  [5] "temp(T); precip(P); resources(N)" "consumer"                        
##  [7] "precip"                           "precipitation"                   
##  [9] "density"                          "disturbance"                     
## [11] "exclosure"                        "temperature"                     
## [13] "competition"                      "diversity"                       
## [15] "restoration"

nrow( pplr_browse(treatment == "fire") ) # 21 fire studies

## [1] 18

pplr_dictionary(studytype)

## $`studytype (NA)`
## [1] "obs" "exp"

nrow( pplr_browse(studytype == "obs") ) # 78 observational studies

## [1] 183

4. Geographic location.

To select studies based on the latitude and longitude of LTER headquarters around which datasets were, or are being collected, simply use the lat_lter and lng_lter numeric variables:

pplr_dictionary( lat_lter, lng_lter )

## $`lat_lter (NA)`
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -77.00000  33.43000  39.09000  35.65512  45.40000  66.63000 
## 
## $`lng_lter (NA)`
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -149.8300 -119.8400 -106.7400 -103.4849  -93.2000  162.5200

pplr_browse( lat_lter > 40 & lng_lter < -100 ) # single-species studies

## # A tibble: 58 × 20
## # Groups:   title, proj_metadata_key, lterid, datatype, structured_data,
## #   studytype, duration_years, community, studystartyr, studyendyr,
## #   structured_type_1, structured_type_2, structured_type_3, structured_type_4,
## #   treatment_type_1, treatment_type_2, treatment_type_3, lat_lter, lng_lter
## #   [58]
##    title             proj_metadata_key lterid datatype structured_data studytype
##  * <chr>                         <int> <chr>  <chr>    <chr>           <chr>    
##  1 SGS-LTER Long-Te…                63 SGS    cover    no              obs      
##  2 SGS-LTER Standar…                65 SGS    biomass  no              obs      
##  3 Open Top Chamber…                66 SGS    cover    no              exp      
##  4 SGS-LTER Boutelo…                69 SGS    count    no              exp      
##  5 SGS-LTER Boutelo…                70 SGS    cover    no              exp      
##  6 SGS-LTER Disturb…                71 SGS    cover    no              exp      
##  7 SGS-LTER Ecosyst…                72 SGS    count    no              exp      
##  8 SGS-LTER Ecosyst…                73 SGS    basal_c… no              exp      
##  9 SGS-LTER Effects…                74 SGS    cover    no              exp      
## 10 SGS-LTER Effects…                76 SGS    count    no              exp      
## # ℹ 48 more rows
## # ℹ 14 more variables: duration_years <int>, community <chr>,
## #   studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## #   structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>,
## #   treatment_type_1 <chr>, treatment_type_2 <chr>, treatment_type_3 <chr>,
## #   lat_lter <dbl>, lng_lter <dbl>, taxas <named list>

5. More complicated searches

Popler allows carrying out more complicated searches by allowing to i) simultaneously search several types of metadata variables, and ii) search studies matching a string pattern. In the first case, simply provide the function pplr_browse() with a logical statement regarding more than one metadata variable. For example, if you want studies on plants with at least 4 nested spatial levels, and 10 years of data:

pplr_browse(kingdom == "Plantae" & n_spat_levs == 4 & duration_years > 10)

In the second case, the keyword argument in function pplr_browse() will search for string patterns within the metadata of each study. For example, in case we were interested in studies using traps:

pplr_browse(keyword = 'trap')

Note that the keyword argument works with regular expressions as well:

# look for studies that include the words "trap" or "spatial"
pplr_browse(keyword = 'trap|spatial')

Aldo Compagnoni, Sam Levin

2023-05-05