Background
The introductory vignette vignette caters to Bayesian data analysis workflows with few datasets to analyze. However, it is sometimes desirable to run one or more Bayesian models repeatedly across multiple simulated datasets. Examples:
- Validate the implementation of a Bayesian model using simulation.
- Simulate a randomized controlled experiment to explore frequentist properties such as power and Type I error.
This vignette focuses on (1).
Example project
Visit https://github.com/wlandau/stantargets-example-validation for an example project based on this vignette. The example has an RStudio Cloud workspace which allows you to run the project in a web browser.
Interval-based model validation pipeline
This particular example uses the concept of calibration that Bob Carpenter explains here (Carpenter 2017). The goal is to simulate multiple datasets from the model below, analyze each dataset, and assess how often the estimated posterior intervals cover the true parameters from the prior predictive simulations. If coverage is no systematically different from nominal, this is evidence that the model was implemented correctly. The quantile method by Cook, Gelman, and Rubin (2006) generalizes this concept, and simulation-based calibration (Talts et al. 2020) generalizes further. The interval-based technique featured in this vignette is not as robust as SBC, but it may be more expedient for large models because it does not require visual inspection of multiple histograms. See a later section in this vignette for an example of simulation-based calibration on this same model.
lines <- "data {
int <lower = 1> n;
vector[n] x;
vector[n] y;
}
parameters {
vector[2] beta;
}
model {
y ~ normal(beta[1] + x * beta[2], 1);
beta ~ normal(0, 1);
}"
writeLines(lines, "model.stan")
Next, we define a pipeline to simulate multiple datasets and fit each
dataset with the model. In our data-generating function, we put the true
parameter values of each simulation in a special .join_data
list. stantargets
will automatically join the elements of
.join_data
to the correspondingly named variables in the
summary output. This will make it super easy to check how often our
posterior intervals capture the truth. As for scale, generate 10
datasets (5 batches with 2 replications each) and run the model on each
of the 10 datasets.1 By default, each of the 10 model runs
computes 4 MCMC chains with 2000 MCMC iterations each (including
burn-in) and you can adjust with the chains
,
iter_sampling
, and iter_warmup
arguments of
tar_stan_mcmc_rep_summary()
.
# _targets.R
library(targets)
library(stantargets)
options(crayon.enabled = FALSE)
# Use computer memory more sparingly:
tar_option_set(memory = "transient", garbage_collection = TRUE)
simulate_data <- function(n = 10L) {
beta <- rnorm(n = 2, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- rnorm(n, beta[1] + x * beta[2], 1)
list(
n = n,
x = x,
y = y,
.join_data = list(beta = beta)
)
}
list(
tar_stan_mcmc_rep_summary(
model,
"model.stan",
simulate_data(), # Runs once per rep.
batches = 5, # Number of branch targets.
reps = 2, # Number of model reps per branch target.
variables = "beta",
summaries = list(
~posterior::quantile2(.x, probs = c(0.025, 0.975))
),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
)
)
We now have a pipeline that runs the model 10 times: 5 batches (branch targets) with 2 replications per batch.
Run the computation with tar_make()
tar_make()
#> ▶ dispatched target model_batch
#> ● completed target model_batch [0.001 seconds]
#> ▶ dispatched target model_file_model
#> ● completed target model_file_model [9.214 seconds]
#> ▶ dispatched branch model_data_5fcdec5f855f2d9c
#> ● completed branch model_data_5fcdec5f855f2d9c [0.005 seconds]
#> ▶ dispatched branch model_data_b6c9a18333c6a8ca
#> ● completed branch model_data_b6c9a18333c6a8ca [0 seconds]
#> ▶ dispatched branch model_data_5db4354944466148
#> ● completed branch model_data_5db4354944466148 [0.001 seconds]
#> ▶ dispatched branch model_data_4a40cb783277d5dc
#> ● completed branch model_data_4a40cb783277d5dc [0.001 seconds]
#> ▶ dispatched branch model_data_104af6d505e730d6
#> ● completed branch model_data_104af6d505e730d6 [0.001 seconds]
#> ● completed pattern model_data
#> ▶ dispatched branch model_model_50b3d9bcb9189fef
#> ● completed branch model_model_50b3d9bcb9189fef [1.586 seconds]
#> ▶ dispatched branch model_model_93bc2c2a4b8dc29f
#> ● completed branch model_model_93bc2c2a4b8dc29f [1.429 seconds]
#> ▶ dispatched branch model_model_e2ab729f4fa1dd45
#> ● completed branch model_model_e2ab729f4fa1dd45 [1.417 seconds]
#> ▶ dispatched branch model_model_5871bb9227fbbf93
#> ● completed branch model_model_5871bb9227fbbf93 [1.414 seconds]
#> ▶ dispatched branch model_model_820c742ab2ba1134
#> ● completed branch model_model_820c742ab2ba1134 [1.41 seconds]
#> ● completed pattern model_model
#> ▶ dispatched target model
#> ● completed target model [0 seconds]
#> ▶ ended pipeline [17.777 seconds]
#>
The result is an aggregated data frame of summary statistics, where
the .rep
column distinguishes among individual replicates.
We have the posterior intervals for beta
in columns
q2.5
and q97.5
. And thanks to
.join_data
in simulate_data()
, there is a
special .join_data
column in the output to indicate the
true value of each parameter from the simulation.
tar_load(model)
model
#> # A tibble: 20 × 9
#> variable q2.5 q97.5 .join_data .rep .dataset_id .seed .file .name
#> <chr> <dbl> <dbl> <dbl> <chr> <chr> <int> <chr> <chr>
#> 1 beta[1] -0.966 0.249 -0.199 79f81b21… model_data… 1.15e9 mode… model
#> 2 beta[2] -0.116 1.62 1.29 79f81b21… model_data… 1.15e9 mode… model
#> 3 beta[1] 0.0862 1.28 0.783 4c16247b… model_data… 5.38e8 mode… model
#> 4 beta[2] -1.06 0.706 0.241 4c16247b… model_data… 5.38e8 mode… model
#> 5 beta[1] 0.202 1.39 0.825 6d6a140b… model_data… 7.25e8 mode… model
#> 6 beta[2] 0.341 2.06 1.05 6d6a140b… model_data… 7.25e8 mode… model
#> 7 beta[1] 1.25 2.45 1.94 bd102c21… model_data… 5.61e8 mode… model
#> 8 beta[2] -1.73 -0.00195 -0.110 bd102c21… model_data… 5.61e8 mode… model
#> 9 beta[1] -0.697 0.527 0.0841 03ae35b2… model_data… 1.57e8 mode… model
#> 10 beta[2] -0.613 1.09 -0.114 03ae35b2… model_data… 1.57e8 mode… model
#> 11 beta[1] 0.0154 1.25 0.715 97fd3bc5… model_data… 3.10e8 mode… model
#> 12 beta[2] -0.195 1.55 0.751 97fd3bc5… model_data… 3.10e8 mode… model
#> 13 beta[1] -0.721 0.481 -0.434 fab012f0… model_data… 3.07e7 mode… model
#> 14 beta[2] 0.0793 1.83 0.790 fab012f0… model_data… 3.07e7 mode… model
#> 15 beta[1] -2.20 -1.04 -1.24 65033a7c… model_data… 1.28e9 mode… model
#> 16 beta[2] -1.15 0.557 -0.604 65033a7c… model_data… 1.28e9 mode… model
#> 17 beta[1] -0.278 0.877 -0.0354 ffae976f… model_data… 2.11e9 mode… model
#> 18 beta[2] -1.32 0.392 0.305 ffae976f… model_data… 2.11e9 mode… model
#> 19 beta[1] -0.799 0.409 0.137 ba3791e8… model_data… 1.10e9 mode… model
#> 20 beta[2] -0.475 1.20 0.528 ba3791e8… model_data… 1.10e9 mode… model
Now, let’s assess how often the estimated 95% posterior intervals
capture the true values of beta
. If the model is
implemented correctly, the coverage value below should be close to 95%.
(Ordinarily, we would increase
the number of batches and reps per batch and run batches in
parallel computing.)
library(dplyr)
model %>%
group_by(variable) %>%
summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
#> # A tibble: 2 × 2
#> variable coverage
#> <chr> <dbl>
#> 1 beta[1] 1
#> 2 beta[2] 1
For maximum reproducibility, we should express the coverage assessment as a custom function and a target in the pipeline.
# _targets.R
library(targets)
library(stantargets)
simulate_data <- function(n = 10L) {
beta <- rnorm(n = 2, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- rnorm(n, beta[1] + x * beta[2], 1)
list(
n = n,
x = x,
y = y,
.join_data = list(beta = beta)
)
}
list(
tar_stan_mcmc_rep_summary(
model,
"model.stan",
simulate_data(),
batches = 5, # Number of branch targets.
reps = 2, # Number of model reps per branch target.
variables = "beta",
summaries = list(
~posterior::quantile2(.x, probs = c(0.025, 0.975))
),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_target(
coverage,
model %>%
group_by(variable) %>%
summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
)
)
The new coverage
target should the only outdated target,
and it should be connected to the upstream model
target.
When we run the pipeline, only the coverage assessment should run. That way, we skip all the expensive computation of simulating datasets and running MCMC multiple times.
tar_make()
#> ✔ skipped target model_batch
#> ✔ skipped target model_file_model
#> ✔ skipped branch model_data_5fcdec5f855f2d9c
#> ✔ skipped branch model_data_b6c9a18333c6a8ca
#> ✔ skipped branch model_data_5db4354944466148
#> ✔ skipped branch model_data_4a40cb783277d5dc
#> ✔ skipped branch model_data_104af6d505e730d6
#> ✔ skipped pattern model_data
#> ✔ skipped branch model_model_50b3d9bcb9189fef
#> ✔ skipped branch model_model_93bc2c2a4b8dc29f
#> ✔ skipped branch model_model_e2ab729f4fa1dd45
#> ✔ skipped branch model_model_5871bb9227fbbf93
#> ✔ skipped branch model_model_820c742ab2ba1134
#> ✔ skipped pattern model_model
#> ✔ skipped target model
#> ▶ dispatched target coverage
#> ● completed target coverage [0.01 seconds]
#> ▶ ended pipeline [0.214 seconds]
#>
tar_read(coverage)
#> # A tibble: 2 × 2
#> variable coverage
#> <chr> <dbl>
#> 1 beta[1] 1
#> 2 beta[2] 1
Multiple models
tar_stan_rep_mcmc_summary()
and similar functions allow
you to supply multiple Stan models. If you do, each model will share the
the same collection of datasets, and the .dataset_id
column
of the model target output allows for custom analyses that compare
different models against each other. Suppose we have a new model,
model2.stan
.
lines <- "data {
int <lower = 1> n;
vector[n] x;
vector[n] y;
}
parameters {
vector[2] beta;
}
model {
y ~ normal(beta[1] + x * x * beta[2], 1); // Regress on x^2 instead of x.
beta ~ normal(0, 1);
}"
writeLines(lines, "model2.stan")
To set up the simulation workflow to run on both models, we add
model2.stan
to the stan_files
argument of
tar_stan_rep_mcmc_summary()
. And in the coverage summary
below, we group by .name
to compute a coverage statistic
for each model.
# _targets.R
library(targets)
library(stantargets)
simulate_data <- function(n = 10L) {
beta <- rnorm(n = 2, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- rnorm(n, beta[1] + x * beta[2], 1)
list(
n = n,
x = x,
y = y,
.join_data = list(beta = beta)
)
}
list(
tar_stan_mcmc_rep_summary(
model,
c("model.stan", "model2.stan"), # another model
simulate_data(),
batches = 5,
reps = 2,
variables = "beta",
summaries = list(
~posterior::quantile2(.x, probs = c(0.025, 0.975))
),
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile()
),
tar_target(
coverage,
model %>%
group_by(.name, variable) %>%
summarize(coverage = mean(q2.5 < .join_data & .join_data < q97.5))
)
)
In the graph below, notice how targets model_model
and
model_model2
are both connected to model_data
upstream. Downstream, model
is equivalent to
dplyr::bind_rows(model_model, model_model2)
, and it will
have special columns .name
and .file
to
distinguish among all the models.
Simulation-based calibration
This section explores a more rigorous validation study which adopts
the proper simulation-based calibration (SBC) method from (Talts et al. 2020). To use this method, we need
a function that generates rank statistics from a simulated dataset and a
data frame of posterior draws. If the model is implemented correctly,
these rank statistics will be uniformly distributed for each model
parameter. Our function will use the calculate_ranks_draws_matrix()
function from the SBC
R package
(Kim et al. 2022).
get_ranks <- function(data, draws) {
draws <- select(draws, starts_with(names(data$.join_data)))
truth <- map_dbl(
names(draws),
~eval(parse(text = .x), envir = data$.join_data)
)
out <- SBC::calculate_ranks_draws_matrix(truth, as_draws_matrix(draws))
as_tibble(as.list(out))
}
To demonstrate this function, we simulate a dataset,
simulate_data <- function(n = 10L) {
beta <- rnorm(n = 2, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- rnorm(n, beta[1] + x * beta[2], 1)
list(
n = n,
x = x,
y = y,
.join_data = list(beta = beta)
)
}
data <- simulate_data()
str(data)
#> List of 4
#> $ n : int 10
#> $ x : num [1:10] -1 -0.778 -0.556 -0.333 -0.111 ...
#> $ y : num [1:10] 2.11 1.91 2.3 2.3 2.2 ...
#> $ .join_data:List of 1
#> ..$ beta: num [1:2] 0.893 -0.778
we make up a hypothetical set of posterior draws,
draws <- tibble(`beta[1]` = rnorm(100), `beta[2]` = rnorm(100))
draws
#> # A tibble: 100 × 2
#> `beta[1]` `beta[2]`
#> <dbl> <dbl>
#> 1 -0.679 -0.937
#> 2 0.738 0.861
#> 3 -0.861 0.622
#> 4 0.421 -2.14
#> 5 1.45 0.485
#> 6 0.194 -0.447
#> 7 -0.691 0.546
#> 8 1.34 1.44
#> 9 2.74 0.243
#> 10 -0.944 0.169
#> # ℹ 90 more rows
and we call get_ranks()
to get the SBC rank statistics
for each model parameter.
library(dplyr)
library(posterior)
library(purrr)
get_ranks(data = data, draws = draws)
#> # A tibble: 1 × 2
#> `beta[1]` `beta[2]`
#> <dbl> <dbl>
#> 1 75 22
To put this into practice in a pipeline, we supply the symbol
get_ranks
to the transform
argument of
tar_stan_mcmc_rep_draws()
. That way, instead of a full set
of draws, each replication will return only the output of
get_ranks()
on those draws (plus a few helper columns). If
supplied, the transform
argument of
tar_stan_mcmc_rep_draws()
must be the name of a function in
the pipeline. This function must accept arguments data
and
draws
, and it must return a data frame.
# _targets.R
library(targets)
library(stantargets)
tar_option_set(packages = c("dplyr", "posterior", "purrr", "tibble"))
simulate_data <- function(n = 10L) {
beta <- rnorm(n = 2, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- rnorm(n, beta[1] + x * beta[2], 1)
list(
n = n,
x = x,
y = y,
.join_data = list(beta = beta)
)
}
get_ranks <- function(data, draws) {
draws <- select(draws, starts_with(names(data$.join_data)))
truth <- map_dbl(
names(draws),
~eval(parse(text = .x), envir = data$.join_data)
)
out <- SBC::calculate_ranks_draws_matrix(truth, as_draws_matrix(draws))
as_tibble(as.list(out))
}
list(
tar_stan_mcmc_rep_draws(
model,
c("model.stan"),
simulate_data(),
batches = 5,
reps = 2,
variables = "beta",
stdout = R.utils::nullfile(),
stderr = R.utils::nullfile(),
transform = get_ranks # Supply the transform to get SBC ranks.
)
)
Our new function get_ranks()
is a dependency of one of
our downstream targets, so any changes to get_ranks()
will
force the results to refresh in the next run of the pipeline.
Let’s run the pipeline to compute a set of rank statistics for each simulated dataset.
tar_make()
#> ✔ skipped target model_batch
#> ✔ skipped target model_file_model
#> ✔ skipped branch model_data_5fcdec5f855f2d9c
#> ✔ skipped branch model_data_b6c9a18333c6a8ca
#> ✔ skipped branch model_data_5db4354944466148
#> ✔ skipped branch model_data_4a40cb783277d5dc
#> ✔ skipped branch model_data_104af6d505e730d6
#> ✔ skipped pattern model_data
#> ▶ dispatched branch model_model_50b3d9bcb9189fef
#> ● completed branch model_model_50b3d9bcb9189fef [1.634 seconds]
#> ▶ dispatched branch model_model_93bc2c2a4b8dc29f
#> ● completed branch model_model_93bc2c2a4b8dc29f [1.33 seconds]
#> ▶ dispatched branch model_model_e2ab729f4fa1dd45
#> ● completed branch model_model_e2ab729f4fa1dd45 [1.334 seconds]
#> ▶ dispatched branch model_model_5871bb9227fbbf93
#> ● completed branch model_model_5871bb9227fbbf93 [1.337 seconds]
#> ▶ dispatched branch model_model_820c742ab2ba1134
#> ● completed branch model_model_820c742ab2ba1134 [1.335 seconds]
#> ● completed pattern model_model
#> ▶ ended pipeline [7.671 seconds]
#>
We have a data frame of rank statistics with one row per simulation rep and one column per model parameter.
tar_load(model_model)
model_model
#> # A tibble: 10 × 7
#> `beta[1]` `beta[2]` .rep .dataset_id .seed .file .name
#> <dbl> <dbl> <chr> <chr> <int> <chr> <chr>
#> 1 2807 3602 616d38be8447ad0b model_data_5fcdec5f8… 1.15e9 mode… model
#> 2 2564 3332 6ec1b2359a7a8307 model_data_5fcdec5f8… 5.38e8 mode… model
#> 3 2228 1338 e4bf5c09907fe828 model_data_b6c9a1833… 7.25e8 mode… model
#> 4 2470 3825 4ee660b811cf326f model_data_b6c9a1833… 5.61e8 mode… model
#> 5 2839 739 9d89e20669debc49 model_data_5db435494… 1.57e8 mode… model
#> 6 2446 2282 373fb9023c4346f8 model_data_5db435494… 3.10e8 mode… model
#> 7 644 1370 27232d1d592b5dff model_data_4a40cb783… 3.07e7 mode… model
#> 8 3580 977 a37cc91dd8c3a25f model_data_4a40cb783… 1.28e9 mode… model
#> 9 543 3842 e1a292714b94b21a model_data_104af6d50… 2.11e9 mode… model
#> 10 3462 2600 86ebdc202b680a39 model_data_104af6d50… 1.10e9 mode… model
If the model is implemented correctly, then each the rank statistics each model parameter should be uniformly distributed. In practice, you may need thousands of simulation reps to make a judgment.
library(ggplot2)
library(tidyr)
model_model %>%
pivot_longer(
starts_with("beta"),
names_to = "parameter",
values_to = "ranks"
) %>%
ggplot(.) +
geom_histogram(aes(x = ranks), bins = 10) +
facet_wrap(~parameter) +
theme_gray(12)