Skip to contents

Background

The introductory vignette vignette caters to Bayesian data analysis workflows with few datasets to analyze. However, it is sometimes desirable to run one or more Bayesian models repeatedly across multiple simulated datasets. Examples:

  1. Validate the implementation of a Bayesian model using simulation.
  2. Simulate a randomized controlled experiment to explore frequentist properties such as power and Type I error.

This vignette focuses on (1).

Example project

Visit https://github.com/wlandau/stantargets-example-validation for an example project based on this vignette. The example has an RStudio Cloud workspace which allows you to run the project in a web browser.

Interval-based model validation pipeline

This particular example uses the concept of calibration that Bob Carpenter explains here (Carpenter 2017). The goal is to simulate multiple datasets from the model below, analyze each dataset, and assess how often the estimated posterior intervals cover the true parameters from the prior predictive simulations. If coverage is no systematically different from nominal, this is evidence that the model was implemented correctly. The quantile method by Cook, Gelman, and Rubin (2006) generalizes this concept, and simulation-based calibration (Talts et al. 2020) generalizes further. The interval-based technique featured in this vignette is not as robust as SBC, but it may be more expedient for large models because it does not require visual inspection of multiple histograms. See a later section in this vignette for an example of simulation-based calibration on this same model.

lines <- "data {
  int <lower = 1> n;
  vector[n] x;
  vector[n] y;
}
parameters {
  vector[2] beta;
}
model {
  y ~ normal(beta[1] + x * beta[2], 1);
  beta ~ normal(0, 1);
}"
writeLines(lines, "model.stan")

Next, we define a pipeline to simulate multiple datasets and fit each dataset with the model. In our data-generating function, we put the true parameter values of each simulation in a special .join_data list. stantargets will automatically join the elements of .join_data to the correspondingly named variables in the summary output. This will make it super easy to check how often our posterior intervals capture the truth. As for scale, generate 10 datasets (5 batches with 2 replications each) and run the model on each of the 10 datasets.1 By default, each of the 10 model runs computes 4 MCMC chains with 2000 MCMC iterations each (including burn-in) and you can adjust with the chains, iter_sampling, and iter_warmup arguments of tar_stan_mcmc_rep_summary().

# _targets.R
library(targets)
library(stantargets)
options(crayon.enabled = FALSE)
# Use computer memory more sparingly:
tar_option_set(memory = "transient", garbage_collection = TRUE)

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

list(
  tar_stan_mcmc_rep_summary(
    model,
    "model.stan",
    simulate_data(), # Runs once per rep.
    batches = 5, # Number of branch targets.
    reps = 2, # Number of model reps per branch target.
    variables = "beta",
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.975))
    ),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  )
)

We now have a pipeline that runs the model 10 times: 5 batches (branch targets) with 2 replications per batch.

tar_visnetwork()
#> Warning message:
#> package ‘targets’ was built under R version 4.3.3 
#> 

Run the computation with tar_make()

tar_make()
#>  dispatched target model_batch
#>  completed target model_batch [0.001 seconds]
#>  dispatched target model_file_model
#>  completed target model_file_model [15.249 seconds]
#>  dispatched branch model_data_b0b9380a
#>  completed branch model_data_b0b9380a [0.008 seconds]
#>  dispatched branch model_data_ffcdb73c
#>  completed branch model_data_ffcdb73c [0.003 seconds]
#>  dispatched branch model_data_b968a03a
#>  completed branch model_data_b968a03a [0.003 seconds]
#>  dispatched branch model_data_f8763cb2
#>  completed branch model_data_f8763cb2 [0.003 seconds]
#>  dispatched branch model_data_0bfdabdc
#>  completed branch model_data_0bfdabdc [0.003 seconds]
#>  completed pattern model_data
#>  dispatched branch model_model_5d061b58
#>  completed branch model_model_5d061b58 [1.529 seconds]
#>  dispatched branch model_model_a9336683
#>  completed branch model_model_a9336683 [1.307 seconds]
#>  dispatched branch model_model_bde6a6d6
#>  completed branch model_model_bde6a6d6 [1.31 seconds]
#>  dispatched branch model_model_384f982f
#>  completed branch model_model_384f982f [1.309 seconds]
#>  dispatched branch model_model_0d59666a
#>  completed branch model_model_0d59666a [1.312 seconds]
#>  completed pattern model_model
#>  dispatched target model
#>  completed target model [0.001 seconds]
#>  ended pipeline [23.279 seconds]
#> Warning message:
#> package ‘targets’ was built under R version 4.3.3 
#> 

The result is an aggregated data frame of summary statistics, where the .rep column distinguishes among individual replicates. We have the posterior intervals for beta in columns q2.5 and q97.5. And thanks to .join_data in simulate_data(), there is a special .join_data column in the output to indicate the true value of each parameter from the simulation.

tar_load(model)
model
#> # A tibble: 20 × 9
#>    variable    q2.5    q97.5 .join_data .rep      .dataset_id  .seed .file .name
#>    <chr>      <dbl>    <dbl>      <dbl> <chr>     <chr>        <int> <chr> <chr>
#>  1 beta[1]  -0.920   0.223      -0.199  79f81b21… model_data… 1.15e9 mode… model
#>  2 beta[2]  -0.189   1.58        1.29   79f81b21… model_data… 1.15e9 mode… model
#>  3 beta[1]   0.0542  1.28        0.783  4c16247b… model_data… 5.38e8 mode… model
#>  4 beta[2]  -1.05    0.655       0.241  4c16247b… model_data… 5.38e8 mode… model
#>  5 beta[1]   0.160   1.35        0.825  6d6a140b… model_data… 7.25e8 mode… model
#>  6 beta[2]   0.343   2.10        1.05   6d6a140b… model_data… 7.25e8 mode… model
#>  7 beta[1]   1.27    2.45        1.94   bd102c21… model_data… 5.61e8 mode… model
#>  8 beta[2]  -1.77    0.00163    -0.110  bd102c21… model_data… 5.61e8 mode… model
#>  9 beta[1]  -0.665   0.509       0.0841 03ae35b2… model_data… 1.57e8 mode… model
#> 10 beta[2]  -0.571   1.10       -0.114  03ae35b2… model_data… 1.57e8 mode… model
#> 11 beta[1]   0.0290  1.25        0.715  97fd3bc5… model_data… 3.10e8 mode… model
#> 12 beta[2]  -0.177   1.52        0.751  97fd3bc5… model_data… 3.10e8 mode… model
#> 13 beta[1]  -0.720   0.427      -0.434  fab012f0… model_data… 3.07e7 mode… model
#> 14 beta[2]   0.132   1.82        0.790  fab012f0… model_data… 3.07e7 mode… model
#> 15 beta[1]  -2.17   -1.04       -1.24   65033a7c… model_data… 1.28e9 mode… model
#> 16 beta[2]  -1.19    0.596      -0.604  65033a7c… model_data… 1.28e9 mode… model
#> 17 beta[1]  -0.266   0.903      -0.0354 ffae976f… model_data… 2.11e9 mode… model
#> 18 beta[2]  -1.32    0.378       0.305  ffae976f… model_data… 2.11e9 mode… model
#> 19 beta[1]  -0.806   0.411       0.137  ba3791e8… model_data… 1.10e9 mode… model
#> 20 beta[2]  -0.525   1.26        0.528  ba3791e8… model_data… 1.10e9 mode… model

Now, let’s assess how often the estimated 95% posterior intervals capture the true values of beta. If the model is implemented correctly, the coverage value below should be close to 95%. (Ordinarily, we would increase the number of batches and reps per batch and run batches in parallel computing.)

library(dplyr)
model %>%
  group_by(variable) %>%
  summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
#> # A tibble: 2 × 2
#>   variable coverage
#>   <chr>       <dbl>
#> 1 beta[1]         1
#> 2 beta[2]         1

For maximum reproducibility, we should express the coverage assessment as a custom function and a target in the pipeline.

# _targets.R
library(targets)
library(stantargets)

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

list(
  tar_stan_mcmc_rep_summary(
    model,
    "model.stan",
    simulate_data(),
    batches = 5, # Number of branch targets.
    reps = 2, # Number of model reps per branch target.
    variables = "beta",
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.975))
    ),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  ),
  tar_target(
    coverage,
    model %>%
      group_by(variable) %>%
      summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
  )
)

The new coverage target should the only outdated target, and it should be connected to the upstream model target.

tar_visnetwork()
#> Warning message:
#> package ‘targets’ was built under R version 4.3.3 
#> 

When we run the pipeline, only the coverage assessment should run. That way, we skip all the expensive computation of simulating datasets and running MCMC multiple times.

tar_make()
#>  skipped target model_batch
#>  skipped target model_file_model
#>  skipped branch model_data_b0b9380a
#>  skipped branch model_data_ffcdb73c
#>  skipped branch model_data_b968a03a
#>  skipped branch model_data_f8763cb2
#>  skipped branch model_data_0bfdabdc
#>  skipped pattern model_data
#>  skipped branch model_model_5d061b58
#>  skipped branch model_model_a9336683
#>  skipped branch model_model_bde6a6d6
#>  skipped branch model_model_384f982f
#>  skipped branch model_model_0d59666a
#>  skipped pattern model_model
#>  skipped target model
#>  dispatched target coverage
#>  completed target coverage [0.011 seconds]
#>  ended pipeline [0.205 seconds]
#> Warning message:
#> package ‘targets’ was built under R version 4.3.3 
#> 
tar_read(coverage)
#> # A tibble: 2 × 2
#>   variable coverage
#>   <chr>       <dbl>
#> 1 beta[1]         1
#> 2 beta[2]         1

Multiple models

tar_stan_rep_mcmc_summary() and similar functions allow you to supply multiple Stan models. If you do, each model will share the the same collection of datasets, and the .dataset_id column of the model target output allows for custom analyses that compare different models against each other. Suppose we have a new model, model2.stan.

lines <- "data {
  int <lower = 1> n;
  vector[n] x;
  vector[n] y;
}
parameters {
  vector[2] beta;
}
model {
  y ~ normal(beta[1] + x * x * beta[2], 1); // Regress on x^2 instead of x.
  beta ~ normal(0, 1);
}"
writeLines(lines, "model2.stan")

To set up the simulation workflow to run on both models, we add model2.stan to the stan_files argument of tar_stan_rep_mcmc_summary(). And in the coverage summary below, we group by .name to compute a coverage statistic for each model.

# _targets.R
library(targets)
library(stantargets)

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

list(
  tar_stan_mcmc_rep_summary(
    model,
    c("model.stan", "model2.stan"), # another model
    simulate_data(),
    batches = 5,
    reps = 2,
    variables = "beta",
    summaries = list(
      ~posterior::quantile2(.x, probs = c(0.025, 0.975))
    ),
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile()
  ),
  tar_target(
    coverage,
    model %>%
      group_by(.name, variable) %>%
      summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
  )
)

In the graph below, notice how targets model_model and model_model2 are both connected to model_data upstream. Downstream, model is equivalent to dplyr::bind_rows(model_model, model_model2), and it will have special columns .name and .file to distinguish among all the models.

tar_visnetwork()
#> Warning message:
#> package ‘targets’ was built under R version 4.3.3 
#> 

Simulation-based calibration

This section explores a more rigorous validation study which adopts the proper simulation-based calibration (SBC) method from (Talts et al. 2020). To use this method, we need a function that generates rank statistics from a simulated dataset and a data frame of posterior draws. If the model is implemented correctly, these rank statistics will be uniformly distributed for each model parameter. Our function will use the calculate_ranks_draws_matrix() function from the SBC R package (Kim et al. 2022).

get_ranks <- function(data, draws) {
  draws <- select(draws, starts_with(names(data$.join_data)))
  truth <- map_dbl(
    names(draws),
    ~eval(parse(text = .x), envir = data$.join_data)
  )
  out <- SBC::calculate_ranks_draws_matrix(truth, as_draws_matrix(draws))
  as_tibble(as.list(out))
}

To demonstrate this function, we simulate a dataset,

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

data <- simulate_data()

str(data)
#> List of 4
#>  $ n         : int 10
#>  $ x         : num [1:10] -1 -0.778 -0.556 -0.333 -0.111 ...
#>  $ y         : num [1:10] 2.27 1.49 1.6 1.01 2.02 ...
#>  $ .join_data:List of 1
#>   ..$ beta: num [1:2] 1.147 0.347

we make up a hypothetical set of posterior draws,

draws <- tibble(`beta[1]` = rnorm(100), `beta[2]` = rnorm(100))

draws
#> # A tibble: 100 × 2
#>    `beta[1]` `beta[2]`
#>        <dbl>     <dbl>
#>  1     0.653     0.425
#>  2    -0.786    -0.688
#>  3     0.961     0.523
#>  4     0.361    -1.19 
#>  5    -0.114     1.53 
#>  6     0.351    -1.37 
#>  7    -1.68      0.915
#>  8    -1.30     -0.839
#>  9    -1.70     -1.81 
#> 10     0.351     0.516
#> # ℹ 90 more rows

and we call get_ranks() to get the SBC rank statistics for each model parameter.

library(dplyr)
library(posterior)
library(purrr)
get_ranks(data = data, draws = draws)
#> # A tibble: 1 × 2
#>   `beta[1]` `beta[2]`
#>       <dbl>     <dbl>
#> 1        92        64

To put this into practice in a pipeline, we supply the symbol get_ranks to the transform argument of tar_stan_mcmc_rep_draws(). That way, instead of a full set of draws, each replication will return only the output of get_ranks() on those draws (plus a few helper columns). If supplied, the transform argument of tar_stan_mcmc_rep_draws() must be the name of a function in the pipeline. This function must accept arguments data and draws, and it must return a data frame.

# _targets.R
library(targets)
library(stantargets)

tar_option_set(packages = c("dplyr", "posterior", "purrr", "tibble"))

simulate_data <- function(n = 10L) {
  beta <- rnorm(n = 2, mean = 0, sd = 1)
  x <- seq(from = -1, to = 1, length.out = n)
  y <- rnorm(n, beta[1] + x * beta[2], 1)
  list(
    n = n,
    x = x,
    y = y,
    .join_data = list(beta = beta)
  )
}

get_ranks <- function(data, draws) {
  draws <- select(draws, starts_with(names(data$.join_data)))
  truth <- map_dbl(
    names(draws),
    ~eval(parse(text = .x), envir = data$.join_data)
  )
  out <- SBC::calculate_ranks_draws_matrix(truth, as_draws_matrix(draws))
  as_tibble(as.list(out))
}

list(
  tar_stan_mcmc_rep_draws(
    model,
    c("model.stan"),
    simulate_data(),
    batches = 5,
    reps = 2,
    variables = "beta",
    stdout = R.utils::nullfile(),
    stderr = R.utils::nullfile(),
    transform = get_ranks # Supply the transform to get SBC ranks.
  )
)

Our new function get_ranks() is a dependency of one of our downstream targets, so any changes to get_ranks() will force the results to refresh in the next run of the pipeline.

tar_visnetwork()
#> Warning message:
#> package ‘targets’ was built under R version 4.3.3 
#> 

Let’s run the pipeline to compute a set of rank statistics for each simulated dataset.

tar_make()
#>  skipped target model_batch
#>  skipped target model_file_model
#>  skipped branch model_data_b0b9380a
#>  skipped branch model_data_ffcdb73c
#>  skipped branch model_data_b968a03a
#>  skipped branch model_data_f8763cb2
#>  skipped branch model_data_0bfdabdc
#>  skipped pattern model_data
#>  dispatched branch model_model_5d061b58
#>  completed branch model_model_5d061b58 [1.616 seconds]
#>  dispatched branch model_model_a9336683
#>  completed branch model_model_a9336683 [1.315 seconds]
#>  dispatched branch model_model_bde6a6d6
#>  completed branch model_model_bde6a6d6 [1.315 seconds]
#>  dispatched branch model_model_384f982f
#>  completed branch model_model_384f982f [1.312 seconds]
#>  dispatched branch model_model_0d59666a
#>  completed branch model_model_0d59666a [1.315 seconds]
#>  completed pattern model_model
#>  ended pipeline [7.54 seconds]
#> Warning message:
#> package ‘targets’ was built under R version 4.3.3 
#> 

We have a data frame of rank statistics with one row per simulation rep and one column per model parameter.

tar_load(model_model)

model_model
#> # A tibble: 10 × 7
#>    `beta[1]` `beta[2]` .rep             .dataset_id            .seed .file .name
#>        <dbl>     <dbl> <chr>            <chr>                  <int> <chr> <chr>
#>  1      2786      3594 616d38be8447ad0b model_data_b0b9380a_1 1.15e9 mode… model
#>  2      2569      3349 6ec1b2359a7a8307 model_data_b0b9380a_2 5.38e8 mode… model
#>  3      2200      1321 e4bf5c09907fe828 model_data_ffcdb73c_1 7.25e8 mode… model
#>  4      2517      3811 4ee660b811cf326f model_data_ffcdb73c_2 5.61e8 mode… model
#>  5      2868       783 9d89e20669debc49 model_data_b968a03a_1 1.57e8 mode… model
#>  6      2393      2224 373fb9023c4346f8 model_data_b968a03a_2 3.10e8 mode… model
#>  7       617      1391 27232d1d592b5dff model_data_f8763cb2_1 3.07e7 mode… model
#>  8      3581       980 a37cc91dd8c3a25f model_data_f8763cb2_2 1.28e9 mode… model
#>  9       500      3853 e1a292714b94b21a model_data_0bfdabdc_1 2.11e9 mode… model
#> 10      3413      2573 86ebdc202b680a39 model_data_0bfdabdc_2 1.10e9 mode… model

If the model is implemented correctly, then each the rank statistics each model parameter should be uniformly distributed. In practice, you may need thousands of simulation reps to make a judgment.

library(ggplot2)
library(tidyr)
model_model %>%
  pivot_longer(
    starts_with("beta"),
    names_to = "parameter",
    values_to = "ranks"
  ) %>%
  ggplot(.) +
  geom_histogram(aes(x = ranks), bins = 10) +
  facet_wrap(~parameter) +
  theme_gray(12)

References

Carpenter, Bob. 2017. Bayesian Posteriors are Calibrated by Definition.” https://statmodeling.stat.columbia.edu/2017/04/12/bayesian-posteriors-calibrated/.
Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3): 675–92. http://www.jstor.org/stable/27594203.
Kim, Shinyoung, Hyunji Moon, Martin Modrák, and Teemu Säilynoja. 2022. SBC: Simulation Based Calibration for Rstan/Cmdstanr Models.
Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” https://arxiv.org/abs/1804.06788.