popler is the R package to browse and query the
popler data base.
popler is a PostgreSQL data base that contains population-level datasets from the US long term ecological research (LTER) network. This package is currently only available on GitHub, but our ultimate goal is to submit it to CRAN. A detailed explanation on the structure of the
popler online database is contained in the dratf of the manuscript presenting popler and in the dedicated vignette. The
popler database is organized around four types of tables:
A. The tables containing information on population abundance. Population abundance can be of five types: count, biomass, cover, density, and at the individual-level.
B. A table containg to link each abundance value to the taxonomic units it refers to. In
popler, “taxonomic unit” generally refers to a species.
C. The location of each “site”. With “site” we refers to the largest spatial replicate available for a dataset. Many datasets provide abundance data from more than one site.
D. Metadata information referring to each separate dataset.
popler aims to facilitate finding, retrieving, and using time-series data of population abundance associated with the US LTER network. To find datasets, the functions in
popler aid in understanding and browsing the metadata information referring to each dataset. To retrieve data, the function
pplr_get_data() downloads single datasets or groups of datasets. These downloads share the same data structure. To use downloaded data, the package provides ancillary functions to consult and cite the original data sources, examine the temporal replication of each data set, and methods for a couple of
dplyr verbs to assist with data manipulation.
All exported functions in the
popler R package use the
pplr_ prefix. Moreover, the functions use lazy and/or tidy evaluation, meaning you do not need to manually quote most inputs.
We suggest to start exploring the metadata variables describing each dataset in popler using
pplr_dictionary(). This function works in two ways:
pplr_dictionary( proj_metadata_key ).
The output of
pplr_dictionary() is a data frame showing a description of each metadata variable.
## variable ## 1 title ## 2 proj_metadata_key ## 3 lterid ## 4 datatype ## 5 structured_data ## 6 studytype ## 7 duration_years ## 8 community ## 9 structure ## 10 treatment ## 11 lat_lter ## 12 lng_lter ## 13 species ## 14 kingdom ## 15 phylum ## 16 class ## 17 order ## 18 family ## 19 genus ## description ## 1 title of project ## 2 unique project id ## 3 lter name ## 4 type of abundance data (e.g. count,biomass) ## 5 are abundance observations grouped (e.g. based on age)? ## 6 experimental or observational study? ## 7 duration of project in years ## 8 does data set contain multiple taxa? ## 9 types of indidivual structure ## 10 types of treatment ## 11 lter site latitude ## 12 lter site longitude ## 13 specific epithet of a taxonomic unit ## 14 kingdom ## 15 phylum ## 16 class ## 17 order ## 18 family ## 19 genus
However, this function is more powerful when used with an argument. When
pplr_dictionary() is provided with the name of a metadata variable, it returns the possible unique values of the variable. For example, providing
datatype shows that popler contains five types of abundance data:
pplr_report_dictionary() function generates an
Rmd file and renders it into html. This html contains both the meaning of variables, and their unique values.
Once you are familiar with the meaning and content of
popler’s metadata variables,
pplr_browse() provides the metadata of the studies contained in
pplr_browse() also works with and without an input. Without input, the function produces a data frame including the metadata variables describing every study currently contained in the
popler database. Note that this data frame is a
tbl that inherits from the
browse class. Inputs to
pplr_browse() allow users to subset this data frame (e.g.
duration_years > 5). When subsetting, the unique values provided by
pplr_dictionary() are particularly useful. For more nuanced subsetting of available datasets, the
keyword argument allows to subset variables using partial matching. Note that
keyword will act primarily on information contained in the title of studies.
You can generate a human-readable report on metadata variables of the projects you subset using
pplr_browse by providing the function with the argument
report = TRUE . This argument uses
rmarkdown to render the metadata into an html file, and opens it into your default browser. Alternatively, you can perform the same action described above by providing the
browse object produced calling
pplr_browse to the function
# generate metadata report for all studies pplr_browse(report=TRUE) # alternatively all_studies <- pplr_browse() pplr_report_metadata(all_studies) # generate metadata report for parasite studies pplr_browse(keyword = 'parasite', report = TRUE) parasite_studies <- pplr_browse(keyword = 'parasite') # alternatively pplr_report_metadata(parasite_studies)
Once you explored the metadata and decided which projects interest you, it’s time to actually download the data!
pplr_get_data() connects to the data base via an API, and downloads the raw data based on the criteria supplied. Alternatively, if you’re happy with the projects represented in the
browse object you created earlier, you can simply pass that object to
pplr_get_data(). Note that if your
browse object contains 5 rows,
pplr_get_data() will download 5 separate datasets. All objects created with
pplr_get_data() inherit from
# create a browse object and use it to get data penguins <- pplr_browse(lterid == 'PAL') # unpack covariates as well penguin_raw_data <- pplr_get_data(penguins, cov_unpack = TRUE) # A very specific query more_raw_data <- pplr_get_data((proj_metadata_key == 43 | proj_metadata_key == 25) & year < 1995 )
We provide three ancillary functions to facilitate the use of the objects downloaded through
pplr_metadata_url() opens up a webpage containing study details. Before doing scientific analyses, we urge the users to review the peculiarities of each dataset by vetting their online documentation. Importantly,
pplr_metadata_url() also works on objects produced by
pplr_cov_unpack() transforms the data contained in the
covariates column of each downloaded dataset into separate columns. This can or cannot be useful depending on the objectives of the user. Note: you can also transform covariates into a data frame directly through
pplr_get_data() by providing the function with argument
cov_unpack = TRUE.
pplr_citation() produces a citation for each downloaded dataset.
The datasets contained in
popler present many heterogeneities, especially in terms of their spatio-temporal replication. Most studies present at least a few spatial replicates which were not sampled every year. Note that most datasets in
popler present at least one additional replication level. These spatial replicates are denoted with numbered variables of the form
X refers to the replication level which can go from 1 to 5. The names of these replication levels (e.g. plot, subplot, etc.) are contained in variable
Once you download a dataset, you can examine the temporal replication of the largest spatial replicate (the site, or
spatial_replication_level_1) using function
pplr_site_rep_plot(). This function produces a plot showing whether or not a given site was sampled in a year.
pplr_site_rep() produces either a logical vector for subsetting an existing
get_data object or a summary table of temporal replication for a given spatial resolution. You can control the minimum frequency of sampling and the minimum duration of sampling using the
duration arguments, respectively. Additionally, you can choose the level of spatial replication to filter providing an integer between 1 and 5 to the
return_logical allows you to control what is returned by the function.
TRUE returns a logical vector corresponding to rows of the
get_data that correspond to spatial replicates that meet the criteria of replication specified in the function.
FALSE returns a summary table describing the number of samples per year at the selected spatial resolution.
# Example with piping and subsetting w/ the logical vector output library(dplyr) SEV_studies <- pplr_get_data( lterid == 'SEV' datatype == 'invidual' ) long_SEV_studies <- SEV_studies %>% .[pplr_site_rep(input = ., duration = 12, rep_level = 3), ] %>% pplr_site_rep_plot() # Or, create the summary table SEV_summary <- SEV_studies %>% pplr_site_rep(duration = 13, rep_level = 1, return_logical = FALSE) # Modify the site_rep_plot() by hand using ggplot2 syntax library(ggplot2) pplr_site_rep_plot(long_SEV_studies, return_plot = TRUE) + ggtitle('Sevilleta LTER Temporal Replication')
popler supplies methods for a couple of
dplyr verbs to assist with data manipulation.
mutate methods are available for objects of
get_data classes. Other
dplyr verbs change the structure of the object too much for those classes to retain their meaning so they are not included in the package, but one can still use them for their own purposes.
Additional information on
popler is contained in a manuscript, and in the vignettes associated with the R package.
The manuscript, currently in draft form, presents the
popler database, the R package, and our recommendations on how to use them.
The R package contains three vignettes: one vignette illustrates the structure of the popler database, and two vignettes provide an introduction and a more detailed look at the intended workflow of the popler package.
In case these vignettes do not cover your particular use case, you still have questions, or you discover a bug, please don’t hesitate to create an issue.