17 diagnose
: Diagnosing Biased Hamiltonian Monte Carlo Inferences
CmdStan is distributed with a utility that is able to read in and analyze the output of one or more Markov chains to check for the following potential problems:
- Divergent transitions
- Transitions that hit the maximum treedepth
- Low E-BFMI values
- Low effective sample sizes
- High \(\hat{R}\) values
The meanings of several of these problems are discussed in https://arxiv.org/abs/1701.02434.
17.1 Building the diagnose command
The CmdStan makefile task build
compiles the diagnose
utility
into the bin
directory.
It can be compiled directly using the makefile as follows:
> cd <cmdstan-home>
> make bin/diagnose
17.2 Running the diagnose command
The diagnose
command is executed on one or more output files,
which are provided as command-line arguments separated by spaces. If
there are no apparent problems with the output files passed to
diagnose
, it outputs a message that all transitions are
within treedepth limit and that no divergent transitions were found.
It problems are detected, it outputs a summary of the problem along
with possible ways to mitigate it.
To fully exercise the diagnose
command, we run 4 chains to
sample from the Neal’s funnel distribution,
discussed in the Stan User’s Guide reparameterization section
https://mc-stan.org/docs/stan-users-guide/reparameterization-section.html.
This program defines a distribution which
exemplifies the difficulties of sampling from some hierarchical models:
parameters {
real y;
vector[9] x;
}
model {
y ~ normal(0, 3);
x ~ normal(0, exp(y/2));
}
This program is available on GitHub: https://github.com/stan-dev/example-models/blob/master/misc/funnel/funnel.stan
Stan has trouble sampling from the region where y
is small and thus x
is constrained to be near 0.
This is due to the fact that the density’s scale changes with y
,
so that a step size that works well when y
is large is inefficient
when y
is small and vice-versa.
Running 4 chains produces output files output_1.csv
, …, output_4.csv
.
We run diagnose
command on this fileset:
> bin/diagnose output_*.csv
The output is printed to the terminal window:
Processing csv files: output_1.csv, output_2.csv, output_3.csv, output_4.csv
Checking sampler transitions treedepth.
9 of 4000 (0.23%) transitions hit the maximum treedepth limit of 10, or 2^10 leapfrog steps.
Trajectories that are prematurely terminated due to this limit will result in slow exploration.
For optimal performance, increase this limit.
Checking sampler transitions for divergences.
9 of 4000 (0.23%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.
Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.078, is below the nominal threshold of 0.3 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.
Effective sample size satisfactory.
The following parameters had split R-hat greater than 1.1:
y
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.
Processing complete.
In this example, changing the model to use a non-centered parameterization
is the only way to correct these problems.
In this second model, the parameters x_raw
and y_raw
are sampled as independent
standard normals, which is easy for Stan.
parameters {
real y_raw;
vector[9] x_raw;
}
transformed parameters {
real y;
vector[9] x;
y = 3.0 * y_raw;
x = exp(y/2) * x_raw;
}
model {
y_raw ~ std_normal(); // implies y ~ normal(0, 3)
x_raw ~ std_normal(); // implies x ~ normal(0, exp(y/2))
}
This program is available on GitHub: https://github.com/stan-dev/example-models/blob/master/misc/funnel/funnel_reparam.stan
We compile the program and run 4 chains, as before.
Now the diagnose
command doesn’t detect any problems:
Processing csv files: output_1.csv, output_2.csv, output_3.csv, output_4.csv
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory for all transitions.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
17.3 diagnose
warnings and recommendations
17.3.1 Divergent transitions after warmup
Stan uses Hamiltonian Monte Carlo (HMC) to explore the target distribution — the posterior defined by a Stan program + data — by simulating the evolution of a Hamiltonian system. In order to approximate the exact solution of the Hamiltonian dynamics we need to choose a step size governing how far we move each time we evolve the system forward. That is, the step size controls the resolution of the sampler.
Unfortunately, for particularly hard problems there are features of the target distribution that are too small for this resolution. Consequently the sampler misses those features and returns biased estimates. Fortunately, this mismatch of scales manifests as divergences which provide a practical diagnostic. If there are any divergences after warmup, then the samples may be biased.
If the divergent transitions cannot be eliminated by increasing the
adapt_delta
parameter, we have to find a different way to
write the model that is logically equivalent but simplifies the
geometry of the posterior distribution. This problem occurs frequently
with hierarchical models and one of the simplest examples is Neal’s
Funnel, which is discussed in the
reparameterization section
of the Stan User’s Guide.
17.3.2 Maximum treedepth exceeded
Warnings about hitting the maximum treedepth are not as serious as warnings about divergent transitions. While divergent transitions are a validity concern, hitting the maximum treedepth is an efficiency concern. Configuring the No-U-Turn-Sampler (the variant of HMC used by Stan) requires putting a cap on the depth of the trees that it evaluates during each iteration (for details on this see the Hamiltonian Monte Carlo Sampling chapter in the Stan Reference Manual. When the maximum allowed tree depth is reached it indicates that NUTS is terminating prematurely to avoid excessively long execution time.
This is controlled through the max_depth
argument. If the
number of transitions which exceed maximum treedepth is low,
increasing max_depth
may correct this problem.
17.3.3 Low E-BFMI values - sampler transitions HMC potential energy.
The sampler csv output column energy__
is used to diagnose
the accuracy of any Hamiltonian Monte Carlo sampler.
If the standard deviation of
energy
is much larger than \(\sqrt{D / 2}\), where \(D\) is the
number of unconstrained parameters, then the sampler is
unlikely to be able to explore the posterior adequately. This is
usually due to heavy-tailed posteriors and can sometimes be remedied by
reparameterizing the model.
The warning that some number of chains had an estimated Bayesian Fraction of Missing Information (BFMI) that was too low implies that the adaptation phase of the Markov Chains did not turn out well and those chains likely did not explore the posterior distribution efficiently. For more details on this diagnostic, see https://arxiv.org/abs/1604.00695. Should this occur, you can either run the sampler for more iterations, or consider reparameterizing your model.
17.3.4 Low effective sample sizes
Roughly speaking, the effective sample size (ESS) of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. Clearly, the higher the ESS the better. Stan uses \(\hat{R}\) adjustment to use the between-chain information in computing the ESS. For example, in case of multimodal distributions with well-separated modes, this leads to an ESS estimate that is close to the number of distinct modes that are found.
Bulk-ESS refers to the effective sample size based on the rank normalized draws. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Overall bulk-ESS estimates the sampling efficiency for the location of the distribution (e.g. mean and median).
Often quite smaller ESS would be sufficient for the desired estimation accuracy, but the estimation of ESS and convergence diagnostics themselves require higher ESS. We recommend requiring that the bulk-ESS is greater than 100 times the number of chains. For example, when running four chains, this corresponds to having a rank-normalized effective sample size of at least 400.
17.3.5 High \(\hat{R}\)
\(\hat{R}\) (R-hat) convergence diagnostic compares the between- and within-chain estimates for model parameters and other univariate quantities of interest. If chains have not mixed well (ie, the between- and within-chain estimates don’t agree), \(\hat{R}\) is larger than 1. We recommend running at least four chains by default and only using the sample if \(\hat{R}\) is less than 1.01. Stan reports \(\hat{R}\) which is the maximum of rank normalized split-R-hat and rank normalized folded-split-R-hat, which works for thick tailed distributions and is sensitive also to differences in scale. For more details on this diagnostic, see https://arxiv.org/abs/1903.08008.
There is further discussion in https://arxiv.org/abs/1701.02434; however the correct resolution is necessarily model specific, hence all suggestions general guidelines only.