Simulation-Based Calibration
A Bayesian posterior is calibrated if the posterior intervals have appropriate coverage. For example, 80% intervals are expected to contain the true parameter 80% of the time. If data is generated according to a model, Bayesian posterior inference with respect to that model is calibrated by construction. Simulation-based calibration (SBC) exploits this property of Bayesian inference to asses the soundness of a posterior sampler. Roughly, the way it works is by simulating parameters according to the prior, then simulating data conditioned on the simulated parameters, then testing posterior calibration of the inference algorithm over independently simulated data sets. This chapter follows Talts et al. (2018), which improves on the original approach developed by Cook, Gelman, and Rubin (2006).
Bayes is calibrated by construction
Suppose a Bayesian model is given in the form of a prior density
This is one way to define calibration, because it follows that posterior intervals will have appropriate coverage (Dawid 1982; Gneiting, Balabdaoui, and Raftery 2007). If the rank of
Simulation-based calibration
Suppose the Bayesian model to test has joint density
If the algorithm generates posterior draws according to the posterior, the ranks should be uniformly distributed from
SBC in Stan
Running simulation-based calibration in Stan will test whether Stan’s sampling algorithm can sample from the posterior associated with data generated according to the model. The data simulation and posterior fitting and rank calculation can all be done within a single Stan program. Then Stan’s posterior sampler has to be run multiple times. Each run produces a rank for each parameter being assessed for uniformity. The total set of ranks can then be tested for uniformity.
Example model
For illustration, a very simple model will suffice. Suppose there are two parameters
For example, suppose the following two parameter values are drawn from the prior in the first simulation,
Testing a Stan program with simulation-based calibration
To code simulation-based calibration in a Stan program, the transformed data block can be used to simulate parameters and data from the model. The parameters, transformed parameters, and model block then define the model over the simulated data. Then, in the generated quantities block, the program records an indicator for whether each parameter is less than the simulated value. As shown above, the rank is then the sum of the simulated indicator variables.
transformed data {
real mu_sim = normal_rng(0, 1);
real<lower=0> sigma_sim = lognormal_rng(0, 1);
int<lower=0> J = 10;
vector[J] y_sim;
for (j in 1:J) {
y_sim[j] = normal_rng(mu_sim, sigma_sim);
}
}parameters {
real mu;
real<lower=0> sigma;
}model {
0, 1);
mu ~ normal(0, 1);
sigma ~ lognormal(
y_sim ~ normal(mu, sigma);
}generated quantities {
array[2] int<lower=0, upper=1> lt_sim
= { mu < mu_sim, sigma < sigma_sim }; }
To avoid confusion with the number of simulated data sets used for simulation-based calibration, J
is used for the number of simulated data points.
The model is implemented twice—once as a data generating process using random number generators in the transformed data block, then again in the parameters and model block. This duplication is a blessing and a curse. The curse is that it’s more work and twice the chance for errors. The blessing is that by implementing the model twice and comparing results, the chance of there being a mistake in the model is reduced.
Pseudocode for simulation-based calibration
The entire simulation-based calibration process is as follows, where
p(theta)
is the prior densityp(y | theta)
is the sampling densityK
is the number of parametersN
is the total number of simulated data sets and fitsM
is the number of posterior draws per simulated data set
SBC(p(theta), p(y | theta), K, N, M)
------------------------------------
for (n in 1:N) {
// simulate parameters and data
theta(sim(n)) ~ p(theta)
y(sim(n)) ~ p(y | theta(sim(n)))
// posterior draws given simulated data
for (m in 1:M) {
theta(n, m) ~ p(theta | y(sim(n)))
}
// calculate rank of sim among posterior draws
for (k in 1:K) {
rank(n, k) = SUM_m I(theta[k](n,m) < theta[k](sim(n)))
}
}
// test uniformity of each parameter
for (k in 1:K) {
test uniformity of rank(1:N, k)
}
The importance of thinning
The draws from the posterior are assumed to be roughly independent. If they are not, artifacts may arise in the uniformity tests due to correlation in the posterior draws. Thus it is best to think the posterior draws down to the point where the effective sample size is roughly the same as the number of thinned draws. This may require running the code a few times to judge the number of draws required to produce a target effective sample size. This operation that can be put into a loop that doubles the number of iterations until all parameters have an effective sample size of M
, then thinning down to M
draws.
Testing uniformity
A simple, though not very highly powered,
The bins don’t need to be exactly the same size. In general, if
Indexing to simplify arithmetic
Because there are
Distributing the ranks into bins is another fiddly operation that can be done with integer arithmetic or the floor operation. Using floor, the following function determines the bin for a rank,
To summarize, the following pseudocode computes the
Inputs: M draws, J bins, N parameters, ranks r[n, m]
b[1:J] = 0
for (m in 1:M) {
++b[1 + floor(r[n, m] * J / (M + 1))]
}
where the ++b[n]
notation is a common form of syntactic sugar for b[n] = b[n] + 1.
In general, a great deal of care must be taken in visualizing discrete data because it’s easy to introduce off-by-one errors and artifacts at the edges because of the way boundaries are computed by default. That’s why so much attention must be devoted to indexing and binning.
Examples of simulation-based calibration
This section will show what the results look like when the tests pass and then when they fail. The passing test will compare a normal model and normal data generating process, whereas the second will compare a normal model with a Student-t data generating process. The first will produce calibrated posteriors, the second will not.
When things go right
Consider the following simple model for a normal distribution with standard normal and lognormal priors on the location and scale parameters.
transformed data {
real mu_sim = normal_rng(0, 1);
real<lower=0> sigma_sim = lognormal_rng(0, 1);
int<lower=0> J = 10;
vector[J] y_sim;
for (j in 1:J) {
4, mu_sim, sigma_sim);
y_sim[j] = student_t_rng(
}
}parameters {
real mu;
real<lower=0> sigma;
}model {
0, 1);
mu ~ normal(0, 1);
sigma ~ lognormal(
y_sim ~ normal(mu, sigma);
}generated quantities {
array[2] int<lower=0, upper=1> I_lt_sim
= { mu < mu_sim, sigma < sigma_sim }; }
After running this for enough iterations so that the effective sample size is larger than
When things go wrong
Now consider using a Student-t data generating process with a normal model. Compare the apparent uniformity of the well specified model with the ill-specified situation with Student-t generative process and normal model.
When Stan’s sampler goes wrong
The example in the previous sections show hard-coded pathological behavior. The usual application of SBC is to diagnose problems with a sampler.
This can happen in Stan with well-specified models if the posterior geometry is too difficult (usually due to extreme stiffness that varies). A simple example is the eight schools problem, the data for which consists of sample means
The meta-analysis model has parameters for a population mean
This model can be coded in Stan with a data-generating process that simulates the parameters and then simulates data according to the parameters.
transformed data {
real mu_sim = normal_rng(0, 5);
real tau_sim = abs(normal_rng(0, 5));
int<lower=0> J = 8;
array[J] real theta_sim = normal_rng(rep_vector(mu_sim, J), tau_sim);
array[J] real<lower=0> sigma = abs(normal_rng(rep_vector(0, J), 5));
array[J] real y = normal_rng(theta_sim, sigma);
}parameters {
real mu;
real<lower=0> tau;
array[J] real theta;
}model {
0, 5);
tau ~ normal(0, 5);
mu ~ normal(
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}generated quantities {
int<lower=0, upper=1> mu_lt_sim = mu < mu_sim;
int<lower=0, upper=1> tau_lt_sim = tau < tau_sim;
int<lower=0, upper=1> theta1_lt_sim = theta[1] < theta_sim[1];
}
As usual for simulation-based calibration, the transformed data encodes the data-generating process using random number generators. Here, the population parameters
When fitting the model in Stan, multiple warning messages are provided that the sampler has diverged. The divergence warnings are in Stan’s sampler precisely to diagnose the sampler’s inability to follow the curvature in the posterior and provide independent confirmation that Stan’s sampler cannot fit this model as specified.
SBC also diagnoses the problem. Here’s the rank plots for running
Although the population mean and standard deviation