Background

The introductory vignette vignette caters to Bayesian data analysis workflows with few datasets to analyze. However, it is sometimes desirable to run one or more Bayesian models repeatedly across multiple simulated datasets. Examples:

  1. Validate the implementation of a Bayesian model using simulation.
  2. Simulate a randomized controlled experiment to explore frequentist properties such as power and Type I error.

This vignette focuses on (1).

Example project

Visit https://github.com/wlandau/stantargets-example-validation for an example project based on this vignette. The example has an RStudio Cloud workspace which allows you to run the project in a web browser.

Simulation-based validation study

This particular example uses the concept of calibration that Bob Carpenter explains here (Carpenter 2017). The goal is to simulate multiple datasets from the model below, analyze each dataset, and assess how often the estimated posterior intervals cover the true parameters from the prior predictive simulations. If coverage is no systematically different from nominal, this is evidence that the model was implemented correctly. The quantile method by Cook, Gelman, and Rubin (2006) generalizes this concept, and simulation-based calibration (Talts et al. 2020) generalizes further. The interval-based technique featured in this vignette is not as robust as SBC, but it may be more expedient for large models because it does not require visual inspection of multiple histograms.

lines <- "data {
  int <lower = 1> n;
  vector[n] x;
  vector[n] y;
}
parameters {
  vector[2] beta;
}
model {
  y ~ normal(beta[1] + x * beta[2], 1);
  beta ~ normal(0, 1);
}"
writeLines(lines, "model.stan")

Next, we define a pipeline to simulate multiple datasets and fit each dataset with the model. In our data-generating function, we put the true parameter values of each simulation in a special .join_data list. stantargets will automatically join the elements of .join_data to the correspondingly named variables in the summary output. This will make it super easy to check how often our posterior intervals capture the truth. As for scale, generate 10 datasets (5 batches with 2 replications each) and run the model on each of the 10 datasets.1 By default, each of the 10 model runs computes 4 MCMC chains with 2000 MCMC iterations each (including burn-in) and you can adjust with the chains, iter_sampling, and iter_warmup arguments of tar_stan_mcmc_rep_summary().

# _targets.R
library(targets)
library(stantargets)
options(crayon.enabled = FALSE)
# Use computer memory more sparingly:
tar_option_set(memory = "transient", garbage_collection = TRUE)

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

list(
  tar_stan_mcmc_rep_summary(
    model,
    "model.stan",
    simulate_data(), # Runs once per rep.
    batches = 5, # Number of branch targets.
    reps = 2, # Number of model reps per branch target.
    variables = "beta",
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.975))
    ),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  )
)

We now have a pipeline that runs the model 10 times: 5 batches (branch targets) with 2 replications per batch.

Run the computation with tar_make()

tar_make()
#> • start target model_batch
#> • built target model_batch
#> • start target model_file_model
#> • built target model_file_model
#> • start branch model_data_b0b9380a
#> • built branch model_data_b0b9380a
#> • start branch model_data_ffcdb73c
#> • built branch model_data_ffcdb73c
#> • start branch model_data_b968a03a
#> • built branch model_data_b968a03a
#> • start branch model_data_f8763cb2
#> • built branch model_data_f8763cb2
#> • start branch model_data_0bfdabdc
#> • built branch model_data_0bfdabdc
#> • built pattern model_data
#> • start branch model_model_5d061b58
#> • built branch model_model_5d061b58
#> • start branch model_model_a9336683
#> • built branch model_model_a9336683
#> • start branch model_model_bde6a6d6
#> • built branch model_model_bde6a6d6
#> • start branch model_model_384f982f
#> • built branch model_model_384f982f
#> • start branch model_model_0d59666a
#> • built branch model_model_0d59666a
#> • built pattern model_model
#> • start target model
#> • built target model
#> • end pipeline

The result is an aggregated data frame of summary statistics, where the .rep column distinguishes among individual replicates. We have the posterior intervals for beta in columns q2.5 and q97.5. And thanks to .join_data in simulate_data(), there is a special .join_data column in the output to indicate the true value of each parameter from the simulation.

tar_load(model)
#> Warning in system("timedatectl", intern = TRUE): running command 'timedatectl'
#> had status 1
model
#> # A tibble: 20 × 7
#>    variable    q2.5   q97.5 .join_data .rep     .file      .name
#>    <chr>      <dbl>   <dbl>      <dbl> <chr>    <chr>      <chr>
#>  1 beta[1]   0.411   1.58        0.768 44806031 model.stan model
#>  2 beta[2]  -0.816   0.916       0.163 44806031 model.stan model
#>  3 beta[1]  -0.610   0.582      -0.602 8d780769 model.stan model
#>  4 beta[2]  -0.527   1.27        1.15  8d780769 model.stan model
#>  5 beta[1]   0.0582  1.26        0.631 c5dcc6df model.stan model
#>  6 beta[2]  -2.95   -1.16       -2.68  c5dcc6df model.stan model
#>  7 beta[1]  -0.740   0.446      -0.481 8c8449cd model.stan model
#>  8 beta[2]  -0.524   1.25        0.256 8c8449cd model.stan model
#>  9 beta[1]  -0.0260  1.14        0.875 32077a38 model.stan model
#> 10 beta[2]  -1.33    0.453      -0.543 32077a38 model.stan model
#> 11 beta[1]  -0.820   0.354      -0.450 cc5b2f80 model.stan model
#> 12 beta[2]  -0.113   1.64        1.31  cc5b2f80 model.stan model
#> 13 beta[1]   0.920   2.11        1.50  95b5f1c3 model.stan model
#> 14 beta[2]   0.894   2.68        1.35  95b5f1c3 model.stan model
#> 15 beta[1]  -3.37   -2.22       -2.84  88732b4f model.stan model
#> 16 beta[2]  -1.18    0.552      -0.164 88732b4f model.stan model
#> 17 beta[1]  -0.646   0.511      -0.679 4c00e370 model.stan model
#> 18 beta[2]  -1.55    0.146      -0.469 4c00e370 model.stan model
#> 19 beta[1]  -1.10    0.0979     -0.158 0dd342ae model.stan model
#> 20 beta[2]  -0.695   1.06        0.397 0dd342ae model.stan model

Now, let’s assess how often the estimated 95% posterior intervals capture the true values of beta. If the model is implemented correctly, the coverage value below should be close to 95%. (Ordinarily, we would increase the number of batches and reps per batch and run batches in parallel computing.)

library(dplyr)
model %>%
  group_by(variable) %>%
  summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
#> # A tibble: 2 × 2
#>   variable coverage
#>   <chr>       <dbl>
#> 1 beta[1]       0.9
#> 2 beta[2]       1

For maximum reproducibility, we should express the coverage assessment as a custom function and a target in the pipeline.

# _targets.R
library(targets)
library(stantargets)

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

list(
  tar_stan_mcmc_rep_summary(
    model,
    "model.stan",
    simulate_data(),
    batches = 5, # Number of branch targets.
    reps = 2, # Number of model reps per branch target.
    variables = "beta",
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.975))
    ),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  ),
  tar_target(
    coverage,
    model %>%
      group_by(variable) %>%
      summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
  )
)

The new coverage target should the only outdated target, and it should be connected to the upstream model target.

When we run the pipeline, only the coverage assessment should run. That way, we skip all the expensive computation of simulating datasets and running MCMC multiple times.

tar_make()
#> ✔ skip target model_batch
#> ✔ skip target model_file_model
#> ✔ skip branch model_data_b0b9380a
#> ✔ skip branch model_data_ffcdb73c
#> ✔ skip branch model_data_b968a03a
#> ✔ skip branch model_data_f8763cb2
#> ✔ skip branch model_data_0bfdabdc
#> ✔ skip pattern model_data
#> ✔ skip branch model_model_5d061b58
#> ✔ skip branch model_model_a9336683
#> ✔ skip branch model_model_bde6a6d6
#> ✔ skip branch model_model_384f982f
#> ✔ skip branch model_model_0d59666a
#> ✔ skip pattern model_model
#> ✔ skip target model
#> • start target coverage
#> • built target coverage
#> • end pipeline
tar_read(coverage)
#> # A tibble: 2 × 2
#>   variable coverage
#>   <chr>       <dbl>
#> 1 beta[1]       0.9
#> 2 beta[2]       1

Multiple models

tar_stan_rep_mcmc_summary() and similar functions allow you to supply multiple Stan models. If you do, each model will share the the same collection of datasets. Suppose we have a new model, model2.stan.

lines <- "data {
  int <lower = 1> n;
  vector[n] x;
  vector[n] y;
}
parameters {
  vector[2] beta;
}
model {
  y ~ normal(beta[1] + x * x * beta[2], 1); // Regress on x^2 instead of x.
  beta ~ normal(0, 1);
}"
writeLines(lines, "model2.stan")

To set up the simulation workflow to run on both models, we add model2.stan to the stan_files argument of tar_stan_rep_mcmc_summary(). And in the coverage summary below, we group by .name to compute a coverage statistic for each model.

# _targets.R
library(targets)
library(stantargets)

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

list(
  tar_stan_mcmc_rep_summary(
    model,
    c("model.stan", "model2.stan"), # another model
    simulate_data(),
    batches = 5,
    reps = 2,
    variables = "beta",
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.975))
    ),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  ),
  tar_target(
    coverage,
    model %>%
      group_by(.name, variable) %>%
      summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
  )
)

In the graph below, notice how targets model_model and model_model2 are both connected to model_data upstream. Downstream, model is equivalent to dplyr::bind_rows(model_model, model_model2), and it will have special columns .name and .file to distinguish among all the models.

References

Carpenter, Bob. 2017. “Bayesian Posteriors are Calibrated by Definition.” https://statmodeling.stat.columbia.edu/2017/04/12/bayesian-posteriors-calibrated/.

Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3): 675–92. http://www.jstor.org/stable/27594203.

Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” http://arxiv.org/abs/1804.06788.


  1. Internally, each batch is a dynamic branch target, and the number of replications determines the amount of work done within a branch. In the general case, batching is a way to find the right compromise between target-specific overhead and the horizontal scale of the pipeline.