How to Diagnose and Resolve Convergence Problems
A big advantage of Stan is that it employs a range of diagnostics to let you notice many potential problems with your model — Stan is conservative and throws warnings for anything suspicious. Here we walk through the types of warnings and hints to help you diagnose and resolve underlying modelling problems. If you fail to diagnose/resolve the problems with the model yourself or if you have trouble understanding or applying some of the hints, don’t worry, you are welcome to ask on Stan Discourse, we’ll try to help!
For guidance on warnings that occur when compiling the model, see Stan User’s guide on errors and warnings.
When can warnings be ignored
In most cases the warnings actually indicate important problems with your model. This does not mean that every time you see a warning the model estimates are meaningless, but when you see warnings you shouldn’t trust your estimates without first understanding what the warnings mean.
However, in early stages of a modelling workflow, we often don’t need completely reliable inference, and a roughly correct posterior can be enough to let us check if the model is sensible using posterior predictive checks. If warnings occur rarely or the diagnostics are just somewhat above the recommended threshold, it often makes sense to do some rough sanity checks before investigating the warnings in detail. This can help to avoid investing a lot of time debugging a model that would be discarded anyway due to lack of fit to data or other conceptual problems.
Types of warnings
- Divergent transitions after warmup
- Exceptions thrown when computing target density and its gradient (Hamiltonian proposal rejected)
- R-hat
- Bulk- and Tail-ESS
- Maximum treedepth
- BFMI low
Divergent transitions after warmup
Example:
1: There were 15 divergent transitions after warmup.
Stan uses Hamiltonian Monte Carlo (HMC) to explore the target distribution — the posterior defined by a Stan program + data — by simulating the evolution of a Hamiltonian system. In order to approximate the exact solution of the Hamiltonian dynamics we need to choose a step size governing how far we move each time we evolve the system forward. The step size controls the resolution of the sampler.
For hard problems the curvature of the posterior can vary a lot and there can be features of the target distribution that are too small for this resolution. For example, divergences are likely if the log posterior density doesn’t have continuous derivative everywhere. Consequently the sampler misses those features and returns biased estimates. Fortunately, this mismatch of scales manifests as divergences which provide a practical diagnostic.
Even a small number of divergences after warmup cannot be safely ignored if completely reliable inference is desired. But if you get only few divergences and you get good Rhat and ESS values, the resulting posterior is often good enough to move forward. There are also cases when a small number divergences without any pattern in their locations can be verified to be unimportant, but this cannot be safely assumed without a careful investigation of the model.
Further reading on divergences:
- Identity Crisis - a discussion of the causes of divergences, diagnosis and treatment.
- Taming divergences in Stan models a less mathematical but hopefully more accessible intuition on what divergent transitions are.
- A Conceptual Introduction to Hamiltonian Monte Carlo
Exceptions thrown when Hamiltonian proposal rejected
Examples:
24: normal_log: Scale parameter is 0, but must be positive!
Exception thrown at line -----
:
Rejecting initial value
Gradient evaluated at the initial value is not finite. Stan can’t start sampling from this initial value.
The first warning indicates that the standard deviation (scale parameter) of the normal distribution (at line 24 in the Stan program) is 0, but it must be positive for Stan to compute the value of density function.
The second message indicates that the gradient of the target (as computed by Stan’s automatic differention) is infinite, indicating numerical problems somewhere in the model but unfortunately without clear information about where exactly.
There are many other types of these warnings. A message like this does not necessarily indicate that something is wrong with your model, as it’s possible that even in a correctly specified and generally well-behaved model a numerical inaccuracy is occasionally seen, especially during warmup (for example, scale parameter with zero lower bound can still become numerically indistinguishable from 0). If such warnings occur only before the first iteration it is a sign that initialization is problematic. In this case changing the initial values can help but is not necessary if there are no further warnings. Warnings after the first few iterations may indicate bugs or numerical issues in the model and if exceptions occur many times, investigation is definitely warranted.
Further reading on numerical accuracy of floating point format used in the computation: - Lecture notes by Geyer (2020) “Stat 3701 Lecture Notes: Computer Arithmetic” provide code examples in R illustrating the most common issues in floating-point arithmetic which can cause also unwanted rounding to the boundary of the constraint.
R-hat
R-hat (sometimes also Rhat) convergence diagnostic compares the between- and within-chain estimates for model parameters and other univariate quantities of interest. If chains have not mixed well (so that between- and within-chain estimates don’t agree), R-hat is larger than 1. We recommend running at least four chains by default and in general only fully trust the sample if R-hat is less than 1.01. In early workflow, R-hat below 1.1 is often sufficient.
Stan reports R-hat as the maximum of rank normalized split-R-hat and rank normalized folded-split-R-hat, which works for thick tailed distributions and is sensitive also to differences in scale.
High R-hat means that the chains have not mixed well and so there is not a good reason to think of them as them being fully representative of the posterior.
When there are divergent transitions in the model, high R-hat is often just another symptom of the problematic geometry that caused the divergences. If there are “maximum treedepth” warnings alongside high R-hat, it usually reflects a problematic geometry of the posterior that is hard to traverse efficiently and is often a side effect of setting very high adapt_delta
in an attempt to resolve divergences. High R-hat without other warnings is commonly associated with posteriors that have multiple well-separated modes for the offending parameters, but can also arise in unimodal posteriors with difficult geometry, in case of high correlations, or in near improper posteriors.
R-hat and ESSs (see below) are useful quick summaries, but for the final results, it is useful to check also Monte Carlo standard error for the quantities of interest and compare that to the domain knowledge of the required accuracy, and if it is low, then run longer chains to get more samples from the posterior.
Further reading on R-hat:
- Rank-normalization, folding, and localization: An improved \(\hat{R}\) for assessing convergence of MCMC
- Documentation for the rhat() function in the
posterior
package.
Bulk and Tail ESS
Roughly speaking, the effective sample size (ESS) of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. The higher the ESS the better. Stan uses R-hat adjustment to use the within- and between-chain information in computing the ESS. For example, in case of multimodal distributions with well-separated modes, this leads to an ESS estimate that is close to the number of distinct modes that are found.
Bulk-ESS refers to the effective sample size based on the rank normalized draws. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Overall bulk-ESS estimates the sampling efficiency for location summaries such as mean and median. Often smaller ESS would be sufficient for the desired estimation accuracy, but the estimation of ESS and convergence diagnostics themselves require higher ESS. For final results, we recommend requiring that the bulk-ESS is greater than 100 times the number of chains. For example, when running four chains, this corresponds to having a rank-normalized effective sample size of at least 400. In early workflow, ESS > 20 is often sufficient.
Tail-ESS computes the minimum of the effective sample sizes (ESS) of the 5% and 95% quantiles. Tail-ESS can help diagnose problems due to different scales of the chains and slow mixing in the tails. If one or more chains has no draws below the 5% or above 95% quantiles computed from the all draws, there will be no direct way to assess that those chains are not completely failing, and thus NA (not available) can be reported for tail-ESS. > [name=Paul] We recently changed this policy in posterior because constant within chain was too conservative and turned out to return NA to oft
In most cases, low bulk-ESS and tail-ESS is accompanied by large R-hat, and all recommendations for large Rhat are often useful in dealing with low ESS.
R-hat and ESS are useful quick summaries, but for final results, it can useful to check also Monte Carlo standard error for the quantities of interest and compare that to the domain knowledge of the required accuracy, and if needed then sample from the posterior.
Further reading on ESS:
- Rank-normalization, folding, and localization: An improved \(\hat{R}\) for assessing convergence of MCMC
- Comparison of MCMC effective sample size estimators.
- Documentation for ess_bulk() and ess_tail() in the
posterior
package.
Maximum treedepth
Warnings about hitting the maximum treedepth are not as serious as other warnings. While divergent transitions, high R-hat and low ESS are a validity concern, hitting the maximum treedepth is an efficiency concern. Configuring the No-U-Turn-Sampler (the variant of HMC used by Stan) involves putting a cap on the number of simulation steps it evaluates during each iteration (for details on this see the Hamiltonian Monte Carlo Sampling chapter in the Stan manual). This is controlled through a maximum tree depth parameter max_treedepth
where the maximum number of steps is 2^max_treedepth
. When the maximum allowed tree depth is reached it indicates that NUTS might be terminating prematurely to avoid excessively long execution time.
If this is the only warning you are getting and your ESS and R-hat diagnostics are good, it is likely safe to ignore this warning, but finding the root cause could result in a more efficient model.
We do not generally recommend increasing max treedepth. In practice, the max treedepth limit being reached can be a sign of model misspecification, and to increase max treedepth can then result in just taking longer to fit a model that you don’t want to be fitting.
BFMI low
Example:
The E-BFMI 0.2786 is below the nominal threshold of 0.3 which suggests that
HMC may have trouble exploring the target distribution.
You may see a warning that says some number of chains had an estimated BFMI that was too low. This implies that the adaptation phase of the Markov chains did not turn out well or the posterior has thick tails that were not well explored in the simulation. As with other warnings, if the BFMI value is just slightly below the threshold, the posterior is likely good for sanity checks but probably still not good for final inferences.
A brief technical description of the diagnostic can be found in section 6.1 of A Conceptual Introduction to Hamiltonian Monte Carlo and a thorough account is given in https://arxiv.org/abs/1604.00695.
Diagnosing and resolving problems
Diagnosing problems is best thought of as a part of larger workflow of model building, testing and critique/evaluation. Building blocks of such workflow are provided in our Bayesian Workflow article.
Some approaches that can be helpful for most types of warnings follow. This is not, and can’t be, a definitive guide — each case is problematic in its own way and there is no single approach that would always work.
If you are using Stan via a package or other higher-level interface such as rstanarm
or brms
, your options are more limited, but the general advice below still applies. Additionally you may want to consult the package documentation or the underlying Stan code to get a good understanding on what is happening under the hood.
Reduce your model. Find the smallest / least complex model and a (preferably simulated) dataset that shows problems. Only add more complexity after you resolve or at least understand all the issues with the small model. If your model has multiple components (for example, a linear predictor for parameters in an ODE model), build and test small models where each of the components is separate (for example, a separate linear model and separate ODE model with constant parameters). For higher-level interfaces, this usually means removing predictor terms.
- The small model implementation workflow vignette from the SBC package shows an example of building a larger model from simpler components step-by-step.
Reduce your data (if it is big). Diagnosing a model that takes a long time to fit and has many parameters is difficult, so working with a suitable subset of the data can speed the process up. This is however a double-edged sword as some problems can be caused by having not enough data, notably fitting varying slopes/intercepts for less than 3 categories can be problematic if priors are weak.
- You can often reduce the model and data at the same time - e.g. if your linear model has multiple predictors, you can filter the data to only contain rows with a specific value of a categorical predictor (or narrow range of a continuous predictor) and then remove the predictor from the model.
Use stronger priors. Often a computation is slow to converge because it is drifting all over parameter space. A reasonable prior can control things by reducing the range of operation. Including a reasonable prior will not always solve problems of computation, but often it gets rid of the worst problems.
Visualisations: in R some very useful plots are provided by the
bayesplot
package (e.g.mcmc_parcoord
,mcmc_pairs
,mcmc_rank_overlay
) and theshinystan
package (which provides an interactive GUI). In Python and Julia theArviZ
andArviZ.jl
packages provide similar functionality. The most commonly useful visualisation is the pairs plot where well behaved models commonly show a set of “Gaussian blobs” while any other pattern can indicate problems. Further reading:- bayesplot vignette on diagnostics
- ArviZ example gallery
- Case study - diagnosing a multilevel model
- The small model implementation workflow vignette from the SBC package shows (among other things) how several problems manifest in a
pairs
plot. - Gabry et al. 2019 - Visualization in Bayesian workflow
Make sure all parameters in your model are well informed by the model. Typical problematic situations are when large changes in parameters can result in almost the same posterior density (for example complete separation in a logistic regression) and multimodality (multiple local maxima of the posterior distributions). Further reading:
- Case study - mixture models
- Identifying non-identifiability - some informal intuition of the concept and examples of problematic models and how to spot them.
- Underdetermined linear regression discusses problems arising when the data cannot inform all parameters.
- https://discourse.mc-stan.org/t/interpretation-of-cor-term-from-multivariate-animal-models/16703/26 has an example where a varying intercept at individual-level is not informed by the data.
If computation is very slow in the warmup stage, this can sometimes be fixed by choosing more reasonable initial values rather than using Stan’s defaults, which can be so widely dispersed that for models with complex geometries, chains can get stuck in unimportant regions of parameter space.
Specifically for divergences/max treedepth:
- Check that the log posterior density and its derivatives are everywhere continuous. Numerical instabilities and inaccuracies can also cause the log density or its derivatives to behave discontinuously in practice, even if they are continuous in theory.
- If the model seems “almost OK” or it is very fast to fit, you might be able to resolve the warnings by playing a bit with
adapt_delta
andstepsize
; see here for an example. Increasingadapt_delta
in particular has become common as the go-to first thing people try, and while there are cases where it can be useful, it may also be possible to obtain more efficient sampling with the defaultadapt_delta
and modified model, prior or parameterization. Increasingadapt_delta
/max_treedepth
will likely increase the sampling time per iteration. You are more likely to achieve both better sampling performance and a more robust model (not to mention understanding thereof) by pursing the above options and leaving adjustment ofadapt_delta
as a last-resort. Increasingadapt_delta
beyond 0.99 andmax_treedepth
beyond 12 is seldom useful. For the purpose of diagnosis, it is sometimes helpful to have more divergences, so reverting to default settings for diagnosis can be useful. Increasingadapt_delta
/max_treedepth
is rarely useful to resolve Rhat/ESS/BFMI issues.
Create a simulated dataset with known true values of all parameters. If the errors disappear on simulated data and the posterior mostly covers the “true” parameters from the simulation, your model may be a bad fit for the actual observed data. If you are using a higher-level interface, writing code to simulate data is a great way to test whether you understand what the model is actually doing. Further reading:
- Falling (In Love With Principled Modeling) has examples of using Stan to simulate data.
- https://discourse.mc-stan.org/t/failure-to-recover-simulated-group-means-in-cross-classified-lmm-with-monotonic-predictor/12978 shows an R simulation for a complex model and how it helped diagnose a bug.
brms
can simulate datasets using thesample_prior = "only"
argument (see the docs for more details).
Move parameters to the
data
block and set them to their true values (from simulated data). Then return them one by one toparameters
block. Which parameter introduces the problems?Using a simulated dataset, introduce tight priors centered at true parameter values (known from the simulation). How tight need the priors to be to let the model fit? Useful for identifying multimodality.
Check your code. Problems are almost as likely a result of a programming error as they are a truly statistical issue. This is most relevant for exceptions and rejections.
- Are all the declared parameters used in the model?
- Do all parameters have informative priors?
- Do your array indices and for loops match?
- Do you have correct hard bounds (e.g. standard deviation parameters have
<lower=0>
, probabilities have<lower=0, upper=1>
). Don’t use hard bounds to express soft prior knowledge as the effects to the shape of the posterior are not trivial. - Check for other errors/warnings in the Stan output (for example
Location parameter is inf, but must be finite!
) and investigate their sources, as they usually hint at coding errors. Inrstan
, those warnings might not get displayed unless you run withchains = 1
.
Specifically for exceptions/rejections: Help the sampler avoid bounds of the parameter space and numerical inaccuracies.
- Use priors that place less weight on values near the boundary (if that’s consistent with domain expertise).
- Use numerically stable expressions. This includes always performing calculations on the log-scale when possible and making use of Stan’s special composed functions. For example,
log_sum_exp(x)
is more robust numerically thanlog(sum(exp(x))
. Other examples like this includelog1m(x)
instead oflog(1-x)
,log1p_exp(x)
instead oflog(1 + exp(x))
, and many others. See the Composed Functions subsection of the the Stan functions reference for a complete listing. - Use the
print()
function in your Stan model to display intermediate values in a computation to see where an invalid value originates.
Introduce more informative priors. If your posterior includes values that imply some clearly absurd predictions (e.g. 10m tall humans, almost zero between-subject differences in survival, …), removing those by a stricter prior might be useful.
- When working on the logarithmic scale (e.g. logistic/Poisson regression) even seemingly narrow priors like
normal(0, 1);
can be actually quite wide (this makes an odds ratio/multiplicative effect ofexp(2)
or roughly7.4
still a-priori plausible - is that consistent with your domain expertise?). - Some (slightly outdated) advice on priors is at https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
- Half-normal or half-student distribution is usually preferable to half-cauchy for sd parameters in varying intercept/effect models. Forum discussion, Gelman 2006
- If you have additional knowledge that would let you defensibly constrain your priors use it. Identity Crisis has some discussion of when this can help.
- Simulating data from the prior (a.k.a prior predictive check) is a good way to check if the priors and their interaction are roughly reasonable. Gabry et al. 2019 has an example.
- When working on the logarithmic scale (e.g. logistic/Poisson regression) even seemingly narrow priors like
Compare your priors to the posterior distribution. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign, indicating that your priors might be substantially influencing the sampling. Here, setting wider priors can sometimes help.
Reparametrize your model. An ideal posterior is close to independent normal distributions in each parameter as those result in constant curvature. Any deviations from this ideal — sharp corners, edges, cusps, banana shapes, tight correlations or other irregularities signal potential targets for reparametrization. In default setting, Stan will automatically rescale each parameter by a constant factor during adaptation, so the absolute scale of the posterior is usually not very important. The main part of reparametrization is to change the actual parameters and compute your parameters of interest in the
transformed parameters
block. Further reading:- Stan user’s guide chapter on reparametrization.
- Case study - diagnosing a multilevel model discusses non-centered parametrization which is frequently useful.
- The case study on hierarchical models by Mike Betancourt goes into more detail on the non-centered parametrization, Betancourt & Girolami 2015 addresses the same topic.
- Non-centered parametrization for the exponential distribution
- Stan users guide chapter on QR reparametrization for linear models
- Identifying non-identifiability - a sigmoid model shows an example of where the parameters are not well informed by data, while https://discourse.mc-stan.org/t/difficulties-with-logistic-population-growth-model/8286/3 show a potential reparametrization.
- Reparametrizing the Sigmoid Model of Gene Regulation shows problems and solutions in an ODE model.
- Multiple parametrizations of a sum-to-zero constraint.
Specifically for low BFMI warnings:
- Look at the
pairs
plot to see which primitive parameters are correlated with theenergy__
margin. - The primitive parameters that are correlated with the
energy__
margin in thepairs
plot are a good place to start thinking about reparameterizations. There should be be a negative relationship betweenlp__
andenergy__
in thepairs
plot, but this is not a concern becauselp__
is the logarithm of the posterior kernel rather than a primitive parameter.
- Look at the
Specifically for Rhat, ESS, low BFMI warnings: You might try setting a higher number of warmup or sampling iterations. Increasing the number of iterations is rarely helpful for resolving divergences/max treedepth warnings.
- Look at change in bulk-ESS and tail-ESS when the number of iterations increase. If R-hat is less than 1.01 and ESS grows linearly with the number of iterations and eventually exceeds the recommended limit, the mixing is sufficient but MCMC has high autocorrelation requiring a large number of iterations.
Run the
optimizing
mode (penalized maximum likelihood) instead of sampling (NUTS) to check if the resulting values look at least roughly reasonable, if not try to find out why.Run Stan in diagnose (
test_grad
) mode to test the gradient computations. This can sometimes detect numerical instabilities in your model.
I got no warnings when fitting the same model in JAGS/WinBUGS/…
You might not be used to seeing so many warnings from other software you use, but that does not mean that Stan has more problems than that other software. The Stan Development Team places a high priority on notifying users about any issue that could potentially be important and Stan will always tell you about problems it encounters instead of hiding them from you.
A huge advantage of the algorithms used by Stan is that they permit certain unique diagnostics that are unavailable when using other algorithms. This can lead to more warnings from Stan, but this is a feature rather than a drawback. Sometimes when people reimplement a model from some other software in Stan, they get unexpected warnings, and further investigation reveals that the problem is not Stan being too sensitive, but that the original model in fact did not compute the correct posterior.
Getting help
Te best place to get help from Stan developers and users if you have difficulties fitting a model is to visit the Stan Forums.
In order to both reduce the amount of help you need and allow us to give the best help when you do need it, it is recommended to: * Put Stan programs in a stand-alone file with a .stan
extension. Even though some Stan interfaces allow specifying the model as a string, the line numbers in warning and error messages are only meaningful if you use a separate file. * Maintain reproducibility by saving the model and initial values in files and the RStan (other other Stan interface) commands in scripts. * Use version control such as git on your files and scripts so that you have a history of the changes you’ve made. * Start simple! Build your model in stages, and check for good fits at each stage, only adding complexity if there are no red flags. If you start by writing a complicated model it can be more difficult to figure out where things are going wrong.