Approximate bayesian computation (ABC) with nlrx
Approximate bayesian computation (ABC) algorithms have been
increasingly used for calibration of agent-based simulation models. The
nlrx package provides different algorithms from the EasyABC package.
These algorithms can be used by attaching the corresponding simdesigns
(simdesign_ABCmcmc_Marjoram()
,
simdesign_ABCmcmc_Marjoram_original()
,
simdesign_ABCmcmc_Wegmann()
). Example 1 shows the process
of how to use ABC with nlrx. Additionally, Latin Hypercube Sampling
output can be used to calculate parameter distributions based on
rejection sampling and local linear regression. Example 2 shows, how the
simdesign_lhs()
can be used in combination with the abc package.
Example 1: Approximate bayesian computation with Monte-Carlo Markov-Chain
Here we present one example for the widely used Marjoram algorithm which combines ABC with a Markov-Chain Monte-Carlo parameter sampling scheme. However, the other two ABCmcmc simdesigns work in a very similar way except for the parameter definitions within the simdesigns (see respective documentation pages for help).
We use the Wolf Sheep Predation model from the models library to show a basic example of the calibration workflow.
Step 1: Create a nl object:
library(nlrx)
# Windows default NetLogo installation path (adjust to your needs!):
netlogopath <- file.path("C:/Program Files/NetLogo 6.0.3")
modelpath <- file.path(netlogopath, "app/models/Sample Models/Biology/Wolf Sheep Predation.nlogo")
outpath <- file.path("C:/out")
# Unix default NetLogo installation path (adjust to your needs!):
netlogopath <- file.path("/home/NetLogo 6.0.3")
modelpath <- file.path(netlogopath, "app/models/Sample Models/Biology/Wolf Sheep Predation.nlogo")
outpath <- file.path("/home/out")
nl <- nl(nlversion = "6.0.3",
nlpath = netlogopath,
modelpath = modelpath,
jvmmem = 1024)
Step 2: Attach an experiment
Because we want to apply a calibration algorithm, we need to define proper variable ranges. The algorithm is allowed to change the values of these parameters within these ranges in order to reach our specified target values. Possible choices for the distribution type are qunif, qnorm, qlnorm and qexp. Each model run is evaluated for each specified metric within the metrics slot. When the simdesign is attached (see step 3) we will define target values for each metric. Thus, we should only enter metrics and reporters that should be used for calibrating the model.
For this simple example, we just want to check if we can find a
parameterization that leads to a specific number of wolves and sheep
after a runtime of 100 ticks. Thus, we define count sheep
and count wolves
as metrics, set the runtime to 100 and set
tickmetrics to false
(because we only want to measure the
last simulation tick).
If more than one tick would be measured, the algorithm automatically calculates the mean value of the selected reporter over all measured ticks. If you wish to apply other functions to aggregate temporal information into one value, you can use a self-defined post-processing function when attaching the simdesign (see step 3).
nl@experiment <- experiment(expname="wolf-sheep",
outpath=outpath,
repetition=1,
tickmetrics="false",
idsetup="setup",
idgo="go",
runtime=100,
metrics=c("count sheep", "count wolves"),
variables = list("sheep-gain-from-food" = list(min=2, max=6, qfun="qunif"),
"wolf-gain-from-food" = list(min=10, max=30, qfun="qunif")),
constants = list('initial-number-sheep' = 100,
'initial-number-wolves' = 50,
"grass-regrowth-time" = 30,
"sheep-reproduce" = 4,
"wolf-reproduce" = 5,
"model-version" = "\"sheep-wolves-grass\"",
"show-energy?" = "false"))
Step 3: Attach a simulation design
We use the simdesign_ABCmcmc_Marjoram()
function to
attach a calibration simdesign. The summary_stat_target
represents the vector of our target values that we want to reach. These
values corresponds to the defined metrics of the experiment and should
have the same length and order. n_rec
defines the number of
samples and n_calibration
defines the number of calibration
runs. If this value is too low, a subscript out of bounds error message
might appear. If use_seed
is set to TRUE
, the
algorithm will automatically use a newly generated seed for each model
run. If it is set to false, a user-specified seed (set in the
run_nl_dyn()
function) will be used instead. The
progress_bar
gives you expected runtime information during
the execution of the algorithm. The nseeds command allows to generate a
vector of random-seeds that may be used for setting model seeds.
nl@simdesign <- simdesign_ABCmcmc_Marjoram(nl=nl,
summary_stat_target = c(100, 80),
n_rec = 100,
n_calibration=200,
use_seed = TRUE,
progress_bar = TRUE,
nseeds = 1)
These are the most important simdesign parameters, but there are many
more to fine control the behavior of the algorithm. Check out the
simdesign help page for more information. As already mentioned before,
it is also possible to define a custom post-processing function. To
apply this function for model output, the function name needs to be
entered as postpro_function
within the simdesign. For
example, we might want to use the maximum value of some measured metrics
to calibrate our model. Or maybe we want to run some tests and use the
test statistics as calibration criterion. In such cases, we can define a
function that accepts the nl object (with simulation results attached)
as function input and returns a vector of numerics. This vector needs to
represent the values that we defined as sumary_stat_target
and thus should have the same length and order. Below is an example of a
custom post-processing function that calculates the maximum value of
selected metrics over all simulation ticks.
post <- function(nl){
res <- getsim(nl, "simoutput") %>%
dplyr::select(getexp(nl, "metrics")) %>%
dplyr::summarise_each(list(max=max))
return(as.numeric(res))
}
nl@simdesign <- simdesign_ABCmcmc_Marjoram(nl=nl,
postpro_function = post,
summary_stat_target = c(100, 80),
n_rec = 100,
n_calibration=200,
use_seed = TRUE,
progress_bar = TRUE,
nseeds = 1)
Step 4: Run simulations
For calibration simdesigns, the run_nl_dyn()
function
lets you execute the simulations. There are some notable differences
between run_nl_all()
and run_nl_dyn()
. First,
because parameterizations depend of results from previous runs,
run_nl_dyn()
can not be parallelized. Second, the procedure
does not automatically loop over created random seeds of the simdesign.
If you want to repeat the same algorithm several times, just embed the
run_nl_dyn()
function in any kind of loop and iterate
through the nl@simdesign@simseeds
vector. We set the
use_seed
parameter of the simdesign to true, however we
still need to define a random-seed in run_nl_dyn although it will be
overwritten by the EasyABC functions.
results <- run_nl_dyn(nl, seed = nl@simdesign@simseeds[1])
Step 5: Investigate output
The output is reported as nested tibble which can be attached to the
nl object. There are many possible ways to inspect the simulation output
of the ABCmcmc
functions. Below you find some guidance on
how to summarize the output for calculating parameter statistics,
sampling distributions, sampling density and exporting the best
parameter combination.
setsim(nl, "simoutput") <- results
saveRDS(nl, file.path(nl@experiment@outpath, "ABCmcmc.rds"))
## Calculate descriptive statistics
getsim(nl, "simoutput") %>% # get simulation results from nl object
dplyr::select(param) %>% # select param column
tidyr::unnest(cols=param) %>% # unnest param column
dplyr::summarise_each(list(min=min, max=max, mean=mean, median=median)) %>% # calculate statistics
tidyr::gather(parameter, value) %>% # convert to long format
tidyr::separate(parameter, into = c("parameter", "stat"), sep = "_") %>% # seperate parameter name and statistic
tidyr::spread(stat, value) # convert back to wide format
## Plot histogram of parameter sampling distribution:
getsim(nl, "simoutput") %>% # get simulation results from nl object
dplyr::select(param) %>% # select param column
tidyr::unnest(cols=param) %>% # unnest param column
tidyr::gather(parameter, value) %>% # convert to long format
ggplot2::ggplot() + # plot histogram with a facet for each parameter
ggplot2::facet_wrap(~parameter, scales="free") +
ggplot2::geom_histogram(ggplot2::aes(x=value), bins = 40)
## Plot density of parameter sampling distribution:
getsim(nl, "simoutput") %>% # get simulation results from nl object
dplyr::select(param) %>% # select param column
tidyr::unnest(cols=param) %>% # unnest param column
tidyr::gather(parameter, value) %>% # convert to long format
ggplot2::ggplot() + # plot density with a facet for each parameter
ggplot2::facet_wrap(~parameter, scales="free") +
ggplot2::geom_density(ggplot2::aes(x=value, fill=parameter))
## Get best parameter combinations and corresponding function values
getsim(nl, "simoutput") %>% # get simulation results from nl object
dplyr::select(dist,epsilon) %>% # select dist and epsilon columns
tidyr::unnest(cols=c(dist,epsilon)) %>% # unnest dist and epsilon columns
dplyr::mutate(runID=dplyr::row_number()) %>% # add row ID column
dplyr::filter(dist == epsilon) %>% # only keep runs with dist=epsilon
dplyr::left_join(getsim(nl, "simoutput") %>% # join parameter values of best runs
dplyr::select(param) %>%
tidyr::unnest(cols=param) %>%
dplyr::mutate(runID=dplyr::row_number())) %>%
dplyr::left_join(getsim(nl, "simoutput") %>% # join output values best runs
dplyr::select(stats) %>%
tidyr::unnest(cols=stats) %>%
dplyr::mutate(runID=dplyr::row_number())) %>%
dplyr::select(runID, dist, epsilon, dplyr::everything()) # update order of columns
## Analyse mcmc using coda summary and plot functions:
summary(coda::as.mcmc(getsim(nl, "simoutput") %>%
dplyr::select(param) %>%
tidyr::unnest(cols=param)), quantiles =c(0.05,0.95,0.5))
plot(coda::as.mcmc(getsim(nl, "simoutput") %>%
dplyr::select(param) %>%
tidyr::unnest(cols=param)))
Example 2: Using Latin Hypercube Sampling for Approximate bayesian computation
Step 1: Create a nl object:
library(nlrx)
# Windows default NetLogo installation path (adjust to your needs!):
netlogopath <- file.path("C:/Program Files/NetLogo 6.0.3")
modelpath <- file.path(netlogopath, "app/models/Sample Models/Biology/Wolf Sheep Predation.nlogo")
outpath <- file.path("C:/out")
# Unix default NetLogo installation path (adjust to your needs!):
netlogopath <- file.path("/home/NetLogo 6.0.3")
modelpath <- file.path(netlogopath, "app/models/Sample Models/Biology/Wolf Sheep Predation.nlogo")
outpath <- file.path("/home/out")
nl <- nl(nlversion = "6.0.3",
nlpath = netlogopath,
modelpath = modelpath,
jvmmem = 1024)
Step 2: Attach an experiment
We want to run a Latin Hypercube sampling, thus we need to define proper variable ranges. We also need to define our output metrics. These metrics are also used for the rejection sampling later on.
nl@experiment <- experiment(expname="wolf-sheep",
outpath=outpath,
repetition=1,
tickmetrics="false",
idsetup="setup",
idgo="go",
runtime=100,
metrics=c("count sheep", "count wolves"),
variables = list("sheep-gain-from-food" = list(min=2, max=6, qfun="qunif"),
"wolf-gain-from-food" = list(min=10, max=30, qfun="qunif")),
constants = list('initial-number-sheep' = 100,
'initial-number-wolves' = 50,
"grass-regrowth-time" = 30,
"sheep-reproduce" = 4,
"wolf-reproduce" = 5,
"model-version" = "\"sheep-wolves-grass\"",
"show-energy?" = "false"))
Step 3: Attach a simulation design
We use the simdesign_lhs()
helper function to generate a
Latin Hypercube Sampling with 500 samples
nl@simdesign <- simdesign_lhs(nl,
samples=500,
nseeds=1,
precision=3)
Step 4: Run simulations
We can simply use run_nl_all()
to execute our
simulations.
results <- run_nl_all(nl)
Step 5: Investigate output
We first attach the output results to our nl object and store a copy of the nl object on disk.
For post-processing, we need a tibble with input parameter
distributions. This can be easily extracted from the
siminput
slot of the simdesign
by selecting
only the columns with variable (non-constant) parameters. Next, we need
corresponding outputs for these parameters. In this example we just take
the measured output tibble (simoutput
) and only select the
columns with our metrics. Of course, you can also perform additional
post-processing of these outputs if desired. Third, we need to define
expected values for our outputs. In our case, we just assume that
count sheep
should have a value of 100, whereas
count wolves
should have a value of 80.
input <- getsim(nl, "siminput") %>%
dplyr::select(names(getexp(nl, "variables")))
output <- getsim(nl, "simoutput") %>%
dplyr::select(getexp(nl, "metrics"))
target <- c("count sheep"=100, "count wolves"=80)
We use the abc
function of the abc package to perform
the rejection sampling. For this example, we perform both algorithms
provided by this function (“rejection” and “loclinear”).
results.abc.reject <- abc::abc(target=target,
param=input,
sumstat=output,
tol=0.3,
method="rejection")
results.abc.loclin <- abc::abc(target=target,
param=input,
sumstat=output,
tol=0.3,
method="loclinear")
Finally, we might want to compare the accepted parameter distributions of both algorithms and the initial distribution of the Latin Hypercube sampling. Thus, we reformat the results to a tidy data format and attach the initial parameter distributions. This dataset can now be used for displaying the parameter distributions with ggplot.
results.abc.all <- tibble::as_tibble(results.abc.reject$unadj.values) %>% # results from rejection method
tidyr::gather(parameter, value) %>%
dplyr::mutate(method="rejection") %>%
dplyr::bind_rows(tibble::as_tibble(results.abc.loclin$adj.values) %>% # results from local linear regression method
tidyr::gather(parameter, value) %>%
dplyr::mutate(method="loclinear")) %>%
dplyr::bind_rows(input %>% # initial parameter distribution (lhs)
tidyr::gather(parameter, value) %>%
dplyr::mutate(method="lhs"))
ggplot2::ggplot(results.abc.all) +
ggplot2::facet_grid(method~parameter, scales="free") +
ggplot2::geom_histogram(ggplot2::aes(x=value))
ggplot2::ggplot(results.abc.all) +
ggplot2::facet_wrap(~parameter, scales="free") +
ggplot2::geom_density(ggplot2::aes(x=value, fill=method), alpha=0.1)