17 diagnose: Diagnosing Biased Hamiltonian Monte Carlo Inferences

CmdStan is distributed with a utility that is able to read in and analyze the output of one or more Markov chains to check for the following potential problems:

  • Divergent transitions
  • Transitions that hit the maximum treedepth
  • Low E-BFMI values
  • Low effective sample sizes
  • High \(\hat{R}\) values

The meanings of several of these problems are discussed in https://arxiv.org/abs/1701.02434.

17.1 Building the diagnose command

The CmdStan makefile task build compiles the diagnose utility into the bin directory. It can be compiled directly using the makefile as follows:

> cd <cmdstan-home>
> make bin/diagnose

17.2 Running the diagnose command

The diagnose command is executed on one or more output files, which are provided as command-line arguments separated by spaces. If there are no apparent problems with the output files passed to diagnose, it outputs a message that all transitions are within treedepth limit and that no divergent transitions were found. It problems are detected, it outputs a summary of the problem along with possible ways to mitigate it.

To fully exercise the diagnose command, we run 4 chains to sample from the Neal’s funnel distribution, discussed in the Stan User’s Guide reparameterization section https://mc-stan.org/docs/stan-users-guide/reparameterization-section.html. This program defines a distribution which exemplifies the difficulties of sampling from some hierarchical models:

parameters {
  real y;
  vector[9] x;
model {
  y ~ normal(0, 3);
  x ~ normal(0, exp(y/2));

This program is available on GitHub: https://github.com/stan-dev/example-models/blob/master/misc/funnel/funnel.stan

Stan has trouble sampling from the region where y is small and thus x is constrained to be near 0. This is due to the fact that the density’s scale changes with y, so that a step size that works well when y is large is inefficient when y is small and vice-versa.

Running 4 chains produces output files output_1.csv, …, output_4.csv. We run diagnose command on this fileset:

> bin/diagnose output_*.csv

The output is printed to the terminal window:

Processing csv files: output_1.csv, output_2.csv, output_3.csv, output_4.csv

Checking sampler transitions treedepth.
9 of 4000 (0.23%) transitions hit the maximum treedepth limit of 10, or 2^10 leapfrog steps.
Trajectories that are prematurely terminated due to this limit will result in slow exploration.
For optimal performance, increase this limit.

Checking sampler transitions for divergences.
9 of 4000 (0.23%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.078, is below the nominal threshold of 0.3 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.

Effective sample size satisfactory.

The following parameters had split R-hat greater than 1.1:
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

In this example, changing the model to use a non-centered parameterization is the only way to correct these problems. In this second model, the parameters x_raw and y_raw are sampled as independent standard normals, which is easy for Stan.

parameters {
  real y_raw;
  vector[9] x_raw;
transformed parameters {
  real y;
  vector[9] x;

  y = 3.0 * y_raw;
  x = exp(y/2) * x_raw;
model {
  y_raw ~ std_normal(); // implies y ~ normal(0, 3)
  x_raw ~ std_normal(); // implies x ~ normal(0, exp(y/2))

This program is available on GitHub: https://github.com/stan-dev/example-models/blob/master/misc/funnel/funnel_reparam.stan

We compile the program and run 4 chains, as before. Now the diagnose command doesn’t detect any problems:

Processing csv files: output_1.csv, output_2.csv, output_3.csv, output_4.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory for all transitions.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

17.3 diagnose warnings and recommendations

17.3.1 Divergent transitions after warmup

Stan uses Hamiltonian Monte Carlo (HMC) to explore the target distribution — the posterior defined by a Stan program + data — by simulating the evolution of a Hamiltonian system. In order to approximate the exact solution of the Hamiltonian dynamics we need to choose a step size governing how far we move each time we evolve the system forward. That is, the step size controls the resolution of the sampler.

Unfortunately, for particularly hard problems there are features of the target distribution that are too small for this resolution. Consequently the sampler misses those features and returns biased estimates. Fortunately, this mismatch of scales manifests as divergences which provide a practical diagnostic. If there are any divergences after warmup, then the samples may be biased.

If the divergent transitions cannot be eliminated by increasing the adapt_delta parameter, we have to find a different way to write the model that is logically equivalent but simplifies the geometry of the posterior distribution. This problem occurs frequently with hierarchical models and one of the simplest examples is Neal’s Funnel, which is discussed in the reparameterization section of the Stan User’s Guide.

17.3.2 Maximum treedepth exceeded

Warnings about hitting the maximum treedepth are not as serious as warnings about divergent transitions. While divergent transitions are a validity concern, hitting the maximum treedepth is an efficiency concern. Configuring the No-U-Turn-Sampler (the variant of HMC used by Stan) requires putting a cap on the depth of the trees that it evaluates during each iteration (for details on this see the Hamiltonian Monte Carlo Sampling chapter in the Stan Reference Manual. When the maximum allowed tree depth is reached it indicates that NUTS is terminating prematurely to avoid excessively long execution time.

This is controlled through the max_depth argument. If the number of transitions which exceed maximum treedepth is low, increasing max_depth may correct this problem.

17.3.3 Low E-BFMI values - sampler transitions HMC potential energy.

The sampler csv output column energy__ is used to diagnose the accuracy of any Hamiltonian Monte Carlo sampler. If the standard deviation of energy is much larger than \(\sqrt{D / 2}\), where \(D\) is the number of unconstrained parameters, then the sampler is unlikely to be able to explore the posterior adequately. This is usually due to heavy-tailed posteriors and can sometimes be remedied by reparameterizing the model.

The warning that some number of chains had an estimated Bayesian Fraction of Missing Information (BFMI) that was too low implies that the adaptation phase of the Markov Chains did not turn out well and those chains likely did not explore the posterior distribution efficiently. For more details on this diagnostic, see https://arxiv.org/abs/1604.00695. Should this occur, you can either run the sampler for more iterations, or consider reparameterizing your model.

17.3.4 Low effective sample sizes

Roughly speaking, the effective sample size (ESS) of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. Clearly, the higher the ESS the better. Stan uses \(\hat{R}\) adjustment to use the between-chain information in computing the ESS. For example, in case of multimodal distributions with well-separated modes, this leads to an ESS estimate that is close to the number of distinct modes that are found.

Bulk-ESS refers to the effective sample size based on the rank normalized draws. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Overall bulk-ESS estimates the sampling efficiency for the location of the distribution (e.g. mean and median).

Often quite smaller ESS would be sufficient for the desired estimation accuracy, but the estimation of ESS and convergence diagnostics themselves require higher ESS. We recommend requiring that the bulk-ESS is greater than 100 times the number of chains. For example, when running four chains, this corresponds to having a rank-normalized effective sample size of at least 400.

17.3.5 High \(\hat{R}\)

\(\hat{R}\) (R-hat) convergence diagnostic compares the between- and within-chain estimates for model parameters and other univariate quantities of interest. If chains have not mixed well (ie, the between- and within-chain estimates don’t agree), \(\hat{R}\) is larger than 1. We recommend running at least four chains by default and only using the sample if \(\hat{R}\) is less than 1.01. Stan reports \(\hat{R}\) which is the maximum of rank normalized split-R-hat and rank normalized folded-split-R-hat, which works for thick tailed distributions and is sensitive also to differences in scale. For more details on this diagnostic, see https://arxiv.org/abs/1903.08008.

There is further discussion in https://arxiv.org/abs/1701.02434; however the correct resolution is necessarily model specific, hence all suggestions general guidelines only.