The stantargets
package makes it easy to run a single
Stan model and keep track of the results. cmdstanr
fits the models, and targets
manages the workflow and helps avoid unnecessary computation.
First, write a Stan model file.
lines <- "data {
int <lower = 1> n;
vector[n] x;
vector[n] y;
real true_beta;
}
parameters {
real beta;
}
model {
y ~ normal(x * beta, 1);
beta ~ normal(0, 1);
}"
writeLines(lines, "x.stan")
A typical workflow proceeds as follows:
- Prepare a list of input data to Stan, including vector elements
x
andy
. - Fit the Stan model using the list of input data.
- Use the fitted model object to compute posterior summaries and convergence diagnostics.
- Use the fitted model object to extract posterior draws of parameters and store them in a tidy data frame.
- Use the fitted model to compute Hamiltonian Monte Carlo (HMC) diagnostics.
stantargets
expresses this workflow using the tar_stan_mcmc()
function. To use it in a targets
pipeline, invoke it from the _targets.R
script of the
project.
# _targets.R
library(targets)
library(stantargets)
generate_data <- function(n = 10) {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
# The _targets.R file ends with a list of target objects
# produced by stantargets::tar_stan_mcmc(), targets::tar_target(), or similar.
list(
tar_stan_mcmc(
example,
"x.stan",
generate_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
)
)
Above, tar_stan_mcmc(example, ...)
only defines
the pipeline. It does not actually run Stan, it declares the targets
that will eventually run Stan. Run tar_manifest()
to show
specific details about the targets.
tar_manifest()
#> # A tibble: 6 × 3
#> name command description
#> <chr> <chr> <chr>
#> 1 example_data "generate_data()" NA
#> 2 example_file_x "\"x.stan\"" x.stan
#> 3 example_mcmc_x "stantargets::tar_stan_mcmc_run(stan_file =… x.stan
#> 4 example_diagnostics_x "tibble::as_tibble(posterior::as_draws_df(e… x.stan
#> 5 example_summary_x "stantargets::tar_stan_summary_join_data(su… x.stan
#> 6 example_draws_x "tibble::as_tibble(posterior::as_draws_df(e… x.stan
Each target listed above is responsible for a piece of the workflow.
-
example_file_x
: Reproducibly track changes to the Stan model file. -
example_data
: Run the code you supplied to thedata
argument oftar_stan_mcmc()
and return a dataset compatible with Stan. -
example_mcmc_x
: Run the MCMC and return an object of classCmdStanMCMC
. -
example_draws_X
: Return a friendlytibble
of the posterior draws fromexample
. Uses the$draws()
method. Suppress withdraws = FALSE
intar_stan_mcmc()
. -
example_summaries_x
: Return a friendlytibble
of the posterior summaries fromexample
. Uses the$summary()
method. Suppress withsummary = FALSE
intar_stan_mcmc()
. -
example_diagnostics_x
: Return a friendlytibble
of the sampler diagnostics fromexample
. Uses the$sampler_diagnostics()
method. Suppress withdiagnostics = FALSE
intar_stan_mcmc()
.
The suffix _x
comes from the base name of the model
file, in this case x.stan
. If you supply multiple model
files to the stan_files
argument, all the models share the
same dataset, and the suffixes distinguish among the various
targets.
targets
produces a graph to show the dependency relationships among the targets.
Below, the MCMC depends on the model file and the data, and the draws,
summary, and diagnostics depend on the MCMC.
tar_visnetwork(targets_only = TRUE)
Run the computation with tar_make()
.
tar_make()
#> ▶ dispatched target example_data
#> ● completed target example_data [0.001 seconds, 219 bytes]
#> ▶ dispatched target example_file_x
#> ● completed target example_file_x [0 seconds, 166 bytes]
#> ▶ dispatched target example_mcmc_x
#> ● completed target example_mcmc_x [8.98 seconds, 178.216 kilobytes]
#> ▶ dispatched target example_diagnostics_x
#> ● completed target example_diagnostics_x [0.001 seconds, 85.808 kilobytes]
#> ▶ dispatched target example_summary_x
#> ● completed target example_summary_x [0.041 seconds, 1.076 kilobytes]
#> ▶ dispatched target example_draws_x
#> ● completed target example_draws_x [0.002 seconds, 72.04 kilobytes]
#> ▶ ended pipeline [9.62 seconds]
#>
The output lives in a special folder called _targets/
and you can retrieve it with functions tar_load()
and
tar_read()
(from targets
).
tar_read(example_summary_x)
#> # A tibble: 2 × 11
#> variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 lp__ -4.20 -3.93 0.706 0.311 -5.64 -3.70 1.00 1882. 1839.
#> 2 beta -0.885 -0.886 0.442 0.443 -1.60 -0.148 1.00 1654. 1614.
#> # ℹ 1 more variable: .join_data <dbl>
At this point, all our results are up to date because their dependencies did not change.
tar_make()
#> ✔ skipped target example_data
#> ✔ skipped target example_file_x
#> ✔ skipped target example_mcmc_x
#> ✔ skipped target example_diagnostics_x
#> ✔ skipped target example_summary_x
#> ✔ skipped target example_draws_x
#> ✔ skipped pipeline [0.071 seconds]
#>
But if we change the underlying code or data, some of the targets
will no longer be valid, and they will rerun during the next
tar_make()
. Below, we change the Stan model file, so the
MCMC reruns while the data is skipped. This behavior saves time and
enhances reproducibility.
write(" ", file = "x.stan", append = TRUE)
tar_outdated()
#> [1] "example_summary_x" "example_diagnostics_x" "example_file_x"
#> [4] "example_draws_x" "example_mcmc_x"
tar_visnetwork(targets_only = TRUE)
tar_make()
#> ✔ skipped target example_data
#> ▶ dispatched target example_file_x
#> ● completed target example_file_x [0 seconds, 168 bytes]
#> ▶ dispatched target example_mcmc_x
#> ● completed target example_mcmc_x [9.082 seconds, 178.226 kilobytes]
#> ▶ dispatched target example_diagnostics_x
#> ● completed target example_diagnostics_x [0.002 seconds, 85.808 kilobytes]
#> ▶ dispatched target example_summary_x
#> ● completed target example_summary_x [0.042 seconds, 1.076 kilobytes]
#> ▶ dispatched target example_draws_x
#> ● completed target example_draws_x [0.002 seconds, 72.04 kilobytes]
#> ▶ ended pipeline [9.693 seconds]
#>
At this point, we can add more targets and custom functions for additional post-processing.
# _targets.R
library(targets)
library(stantargets)
generate_data <- function(n = 10) {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc(
example,
"x.stan",
generate_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_summary(
custom_summary,
fit = example_mcmc_x,
summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
)
)
In the graph, our new custom_summary
target should be
connected to the upstream example
target, and only
custom_summary
should be out of date.
tar_visnetwork(targets_only = TRUE)
In the next tar_make()
, we skip the expensive MCMC and
just run the custom summary.
tar_make()
#> ✔ skipped target example_data
#> ✔ skipped target example_file_x
#> ✔ skipped target example_mcmc_x
#> ✔ skipped target example_diagnostics_x
#> ✔ skipped target example_summary_x
#> ▶ dispatched target custom_summary
#> ● completed target custom_summary [0.023 seconds, 580 bytes]
#> ✔ skipped target example_draws_x
#> ▶ ended pipeline [0.207 seconds]
#>
tar_read(custom_summary)
#> # A tibble: 2 × 4
#> variable q25 q75 .join_data
#> <chr> <dbl> <dbl> <dbl>
#> 1 lp__ -4.36 -3.75 NA
#> 2 beta -1.18 -0.580 NA
Multiple models
tar_stan_mcmc()
and related functions allow you to
supply multiple models to stan_files
. If you do, each model
will run on the same dataset. Consider a new model
y.stan
.
lines <- "data {
int <lower = 1> n;
vector[n] x;
vector[n] y;
real true_beta;
}
parameters {
real beta;
}
model {
y ~ normal(x * x * beta, 1); // Regress on x^2 instead of x.
beta ~ normal(0, 1);
}"
writeLines(lines, "y.stan")
To include this y.stan
, we add it to the
stan_files
argument of tar_stan_mcmc()
.
#> [1] FALSE
# _targets.R
library(targets)
library(stantargets)
generate_data <- function(n = 10) {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc(
example,
c("x.stan", "y.stan"), # another model
generate_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_summary(
custom_summary,
fit = example_mcmc_x,
summaries = list(~posterior::quantile2(.x, probs = c(0.25, 0.75)))
)
)
In the graph below, notice how the *_x
targets and
*_y
targets are both connected to example_data
upstream.
tar_visnetwork(targets_only = TRUE)
Generated quantities
It is possible to use the CmdStanMCMC
object from one
run to simulate generated quantities downstream. For example, the
tar_stan_gq_rep_summaries()
function takes a single
CmdStanMCMC
object, produces multiple replications of
generated quantities from multiple models, and aggregates the summaries
from each. The following pipeline uses this technique to repeatedly draw
from the posterior predictive distribution.
lines <- "data {
int <lower = 1> n;
vector[n] x;
vector[n] y;
}
parameters {
real beta;
}
model {
y ~ normal(x * beta, 1);
beta ~ normal(0, 1);
}
generated quantities {
array[n] real y_rep = normal_rng(x * beta, 1); // posterior predictive draws
}"
writeLines(lines, "gen.stan")
# _targets.R
library(targets)
library(stantargets)
generate_data <- function(n = 10) {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc(
example,
"x.stan",
generate_data(),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_stan_gq_rep_summary(
postpred,
stan_files = "gen.stan",
fitted_params = example_mcmc_x, # one CmdStanFit object
data = generate_data(), # Function runs once per rep.
batches = 2, # 2 dynamic branches
reps = 5, # 5 replications per branch
quiet = TRUE,
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
)
)
Since we have defined many objects in the pipeline, it is extra important to check the graph to be sure everything is connected.
tar_visnetwork(targets_only = TRUE)
Then, we run the computation. The original MCMC is already up to date, so we only run the targets relevant to the generated quantities.
tar_make()
#> ▶ dispatched target postpred_batch
#> ● completed target postpred_batch [0.001 seconds, 98 bytes]
#> ▶ dispatched target postpred_file_gen
#> ● completed target postpred_file_gen [8.634 seconds, 2.615 megabytes]
#> ✔ skipped target example_data
#> ✔ skipped target example_file_x
#> ▶ dispatched branch postpred_data_3f82502bb81fc7e3
#> ● completed branch postpred_data_3f82502bb81fc7e3 [0.005 seconds, 670 bytes]
#> ▶ dispatched branch postpred_data_f7ab7117578bb2a5
#> ● completed branch postpred_data_f7ab7117578bb2a5 [0.001 seconds, 672 bytes]
#> ● completed pattern postpred_data
#> ✔ skipped target example_mcmc_x
#> ▶ dispatched branch postpred_gen_7f380f9fd103c950
#> ● completed branch postpred_gen_7f380f9fd103c950 [3.284 seconds, 8.241 kilobytes]
#> ▶ dispatched branch postpred_gen_b23471e06f80f1f3
#> ● completed branch postpred_gen_b23471e06f80f1f3 [3.107 seconds, 8.241 kilobytes]
#> ● completed pattern postpred_gen
#> ✔ skipped target example_diagnostics_x
#> ✔ skipped target example_summary_x
#> ✔ skipped target example_draws_x
#> ▶ dispatched target postpred
#> ● completed target postpred [0.001 seconds, 13.564 kilobytes]
#> ▶ ended pipeline [15.764 seconds]
#>
Finally, we read the summaries of posterior predictive samples.
tar_read(postpred)
#> # A tibble: 100 × 16
#> variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 y_rep[1] 0.869 0.877 1.08 1.05 -0.945 2.65 1.00 3486. 3724.
#> 2 y_rep[2] 0.638 0.640 1.07 1.09 -1.09 2.40 1.00 3120. 3392.
#> 3 y_rep[3] 0.450 0.432 1.01 1.03 -1.16 2.14 0.999 3581. 3351.
#> 4 y_rep[4] 0.345 0.311 1.03 1.06 -1.33 2.05 1.00 4070. 4043.
#> 5 y_rep[5] 0.0632 0.0414 1.01 0.982 -1.57 1.76 1.00 3797. 4103.
#> 6 y_rep[6] -0.110 -0.0980 1.00 0.965 -1.78 1.57 0.999 4311. 3900.
#> 7 y_rep[7] -0.277 -0.263 1.02 0.997 -1.96 1.36 1.00 3611. 3701.
#> 8 y_rep[8] -0.451 -0.432 1.01 0.971 -2.13 1.18 1.00 3260. 3615.
#> 9 y_rep[9] -0.709 -0.696 1.04 1.04 -2.42 0.979 1.00 3823. 3883.
#> 10 y_rep[10] -0.875 -0.893 1.13 1.13 -2.71 1.01 1.00 3688. 3526.
#> # ℹ 90 more rows
#> # ℹ 6 more variables: .join_data <dbl>, .rep <chr>, .dataset_id <chr>,
#> # .seed <int>, .file <chr>, .name <chr>
More information
For more on targets
,
please visit the reference website https://docs.ropensci.org/targets/ or the user manual https://books.ropensci.org/targets/. The manual walks
though advanced features of targets
such as high-performance
computing and cloud
storage support.