A big advantage of Stan is that it employs a range of diagnostics to let you notice many potential problems with your model — Stan is conservative and throws warnings for anything suspicious. Here we walk through the types of warnings and hints to help you diagnose and resolve underlying modelling problems. If you fail to diagnose/resolve the problems with the model yourself or if you have trouble understanding or applying some of the hints, don’t worry, you are welcome to ask on Stan Discourse, we’ll try to help!
Resolving modeling issues is generally hard and requires some understanding of probabilistic theory and Hamiltonian Monte-Carlo, see Understanding basics of Bayesian statistics and modelling for more general resources.
For guidance on warnings that occur when compiling the model, see Stan User’s guide on errors and warnings.
In most cases the warnings actually indicate important problems with your model. This does not mean that every time you see a warning the model estimates are meaningless, but when you see warnings you shouldn’t trust your estimates without first understanding what the warnings mean.
However, in early stages of a modelling workflow, we often don’t need completely reliable inference, and a roughly correct posterior can be enough to let us check if the model is sensible using posterior predictive checks. If warnings occur rarely or the diagnostics are just somewhat above the recommended threshold, it often makes sense to do some rough sanity checks before investigating the warnings in detail. This can help to avoid investing a lot of time debugging a model that would be discarded anyway due to lack of fit to data or other conceptual problems.
Example:
1: There were 15 divergent transitions after warmup.
Stan uses Hamiltonian Monte Carlo (HMC) to explore the target distribution — the posterior defined by a Stan program + data — by simulating the evolution of a Hamiltonian system. In order to approximate the exact solution of the Hamiltonian dynamics we need to choose a step size governing how far we move each time we evolve the system forward. The step size controls the resolution of the sampler.
For hard problems the curvature of the posterior can vary a lot and there can be features of the target distribution that are too small for this resolution. For example, divergences are likely if the log posterior density doesn’t have continuous derivative everywhere. Consequently the sampler misses those features and returns biased estimates. Fortunately, this mismatch of scales manifests as divergences which provide a practical diagnostic.
Even a small number of divergences after warmup cannot be safely ignored if completely reliable inference is desired. But if you get only few divergences and you get good Rhat and ESS values, the resulting posterior is often good enough to move forward. There are also cases when a small number divergences without any pattern in their locations can be verified to be unimportant, but this cannot be safely assumed without a careful investigation of the model.
Further reading on divergences: - Identity Crisis - a discussion of the causes of divergences, diagnosis and treatment. * Taming divergences in Stan models a less mathematical but hopefully more accessible intuition on what divergent transitions are. * A Conceptual Introduction to Hamiltonian Monte Carlo
Examples:
Exception thrown at line 24: normal_log: Scale parameter is 0, but must be positive!
-----
Rejecting initial value:
Gradient evaluated at the initial value is not finite.
Stan can’t start sampling from this initial value.
The first warning indicates that the standard deviation (scale parameter) of the normal distribution (at line 24 in the Stan program) is 0, but it must be positive for Stan to compute the value of density function.
The second message indicates that the gradient of the target (as computed by Stan’s automatic differention) is infinite, indicating numerical problems somewhere in the model but unfortunately without clear information about where exactly.
There are many other types of these warnings. A message like this does not necessarily indicate that something is wrong with your model, as it’s possible that even in a correctly specified and generally well-behaved model a numerical inaccuracy is occasionally seen, especially during warmup (for example, scale parameter with zero lower bound can still become numerically indistinguishable from 0). If such warnings occur only before the first iteration it is a sign that initialization is problematic. In this case changing the initial values can help but is not necessary if there are no further warnings. Warnings after the first few iterations may indicate bugs or numerical issues in the model and if exceptions occur many times, investigation is definitely warranted.
Further reading on numerical accuracy of floating point format used in the computation: - Lecture notes by Geyer (2020) “Stat 3701 Lecture Notes: Computer Arithmetic” provide code examples in R illustrating the most common issues in floating-point arithmetic which can cause also unwanted rounding to the boundary of the constraint.
R-hat (sometimes also Rhat) convergence diagnostic compares the between- and within-chain estimates for model parameters and other univariate quantities of interest. If chains have not mixed well (so that between- and within-chain estimates don’t agree), R-hat is larger than 1. We recommend running at least four chains by default and in general only fully trust the sample if R-hat is less than 1.01. In early workflow, R-hat below 1.1 is often sufficient.
Stan reports R-hat as the maximum of rank normalized split-R-hat and rank normalized folded-split-R-hat, which works for thick tailed distributions and is sensitive also to differences in scale.
High R-hat means that the chains have not mixed well and so there is not a good reason to think of them as them being fully representative of the posterior.
When there are divergent transitions in the model, high R-hat is often just another symptom of the problematic geometry that caused the divergences. If there are “maximum treedepth” warnings alongside high R-hat, it usually reflects a problematic geometry of the posterior that is hard to traverse efficiently and is often a side effect of setting very high adapt_delta
in an attempt to resolve divergences. High R-hat without other warnings is commonly associated with posteriors that have multiple well-separated modes for the offending parameters, but can also arise in unimodal posteriors with difficult geometry, in case of high correlations, or in near improper posteriors.
R-hat and ESSs (see below) are useful quick summaries, but for the final results, it is useful to check also Monte Carlo standard error for the quantities of interest and compare that to the domain knowledge of the required accuracy, and if it is low, then run longer chains to get more samples from the posterior.
Further reading on R-hat:
posterior
package.Roughly speaking, the effective sample size (ESS) of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. The higher the ESS the better. Stan uses R-hat adjustment to use the within- and between-chain information in computing the ESS. For example, in case of multimodal distributions with well-separated modes, this leads to an ESS estimate that is close to the number of distinct modes that are found.
Bulk-ESS refers to the effective sample size based on the rank normalized draws. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Overall bulk-ESS estimates the sampling efficiency for location summaries such as mean and median. Often smaller ESS would be sufficient for the desired estimation accuracy, but the estimation of ESS and convergence diagnostics themselves require higher ESS. For final results, we recommend requiring that the bulk-ESS is greater than 100 times the number of chains. For example, when running four chains, this corresponds to having a rank-normalized effective sample size of at least 400. In early workflow, ESS > 20 is often sufficient.
Tail-ESS computes the minimum of the effective sample sizes (ESS) of the 5% and 95% quantiles. Tail-ESS can help diagnose problems due to different scales of the chains and slow mixing in the tails. If one or more chains has no draws below the 5% or above 95% quantiles computed from the all draws, there will be no direct way to assess that those chains are not completely failing, and thus NA (not available) can be reported for tail-ESS. > [name=Paul] We recently changed this policy in posterior because constant within chain was too conservative and turned out to return NA to oft
In most cases, low bulk-ESS and tail-ESS is accompanied by large R-hat, and all recommendations for large Rhat are often useful in dealing with low ESS.
R-hat and ESS are useful quick summaries, but for final results, it can useful to check also Monte Carlo standard error for the quantities of interest and compare that to the domain knowledge of the required accuracy, and if needed then sample from the posterior.
Further reading on ESS:
posterior
package.Warnings about hitting the maximum treedepth are not as serious as other warnings. While divergent transitions, high R-hat and low ESS are a validity concern, hitting the maximum treedepth is an efficiency concern. Configuring the No-U-Turn-Sampler (the variant of HMC used by Stan) involves putting a cap on the number of simulation steps it evaluates during each iteration (for details on this see the Hamiltonian Monte Carlo Sampling chapter in the Stan manual). This is controlled through a maximum tree depth parameter max_treedepth
where the maximum number of steps is 2^max_treedepth
. When the maximum allowed tree depth is reached it indicates that NUTS might be terminating prematurely to avoid excessively long execution time.
If this is the only warning you are getting and your ESS and R-hat diagnostics are good, it is likely safe to ignore this warning, but finding the root cause could result in a more efficient model.
We do not generally recommend increasing max treedepth. In practice, the max treedepth limit being reached can be a sign of model misspecification, and to increase max treedepth can then result in just taking longer to fit a model that you don’t want to be fitting.
Example:
The E-BFMI 0.2786 is below the nominal threshold of 0.3 which suggests that
HMC may have trouble exploring the target distribution.
You may see a warning that says some number of chains had an estimated BFMI that was too low. This implies that the adaptation phase of the Markov chains did not turn out well or the posterior has thick tails that were not well explored in the simulation. As with other warnings, if the BFMI value is just slightly below the threshold, the posterior is likely good for sanity checks but probably still not good for final inferences.
A brief technical description of the diagnostic can be found in section 6.1 of A Conceptual Introduction to Hamiltonian Monte Carlo and a thorough account is given in https://arxiv.org/abs/1604.00695.
Diagnosing problems is best thought of as a part of larger workflow of model building, testing and critique/evaluation. Building blocks of such workflow are provided in our Bayesian Workflow article.
Some approaches that can be helpful for most types of warnings follow. This is not, and can’t be, a definitive guide — each case is problematic in its own way and there is no single approach that would always work.
If you are using Stan via a package or other higher-level interface such as rstanarm
or brms
, your options are more limited, but the general advice below still applies. Additionally you may want to consult the package documentation or the underlying Stan code to get a good understanding on what is happening under the hood.
Reduce your model. Find the smallest / least complex model and a (preferably simulated) dataset that shows problems. Only add more complexity after you resolve or at least understand all the issues with the small model. If your model has multiple components (for example, a linear predictor for parameters in an ODE model), build and test small models where each of the components is separate (for example, a separate linear model and separate ODE model with constant parameters). For higher-level interfaces, this usually means removing predictor terms.
Reduce your data (if it is big). Diagnosing a model that takes a long time to fit and has many parameters is difficult, so working with a suitable subset of the data can speed the process up. This is however a double-edged sword as some problems can be caused by having not enough data, notably fitting varying slopes/intercepts for less than 3 categories can be problematic if priors are weak.
Use stronger priors. Often a computation is slow to converge because it is drifting all over parameter space. A reasonable prior can control things by reducing the range of operation. Including a reasonable prior will not always solve problems of computation, but often it gets rid of the worst problems.
Visualisations: in R some very useful plots are provided by the bayesplot
package (e.g. mcmc_parcoord
, mcmc_pairs
, mcmc_rank_overlay
) and the shinystan
package (which provides an interactive GUI). In Python and Julia the ArviZ
and ArviZ.jl
packages provide similar functionality. The most commonly useful visualisation is the pairs plot where well behaved models commonly show a set of “Gaussian blobs” while any other pattern can indicate problems. Further reading:
pairs
plot.Make sure all parameters in your model are well informed by the model. Typical problematic situations are when large changes in parameters can result in almost the same posterior density (for example complete separation in a logistic regression) and multimodality (multiple local maxima of the posterior distributions). Further reading:
If computation is very slow in the warmup stage, this can sometimes be fixed by choosing more reasonable initial values rather than using Stan’s defaults, which can be so widely dispersed that for models with complex geometries, chains can get stuck in unimportant regions of parameter space.
Specifically for divergences/max treedepth:
adapt_delta
and stepsize
; see here for an example. Increasing adapt_delta
in particular has become common as the go-to first thing people try, and while there are cases where it can be useful, it may also be possible to obtain more efficient sampling with the default adapt_delta
and modified model, prior or parameterization. Increasing adapt_delta
/max_treedepth
will likely increase the sampling time per iteration. You are more likely to achieve both better sampling performance and a more robust model (not to mention understanding thereof) by pursing the above options and leaving adjustment of adapt_delta
as a last-resort. Increasing adapt_delta
beyond 0.99 and max_treedepth
beyond 12 is seldom useful. For the purpose of diagnosis, it is sometimes helpful to have more divergences, so reverting to default settings for diagnosis can be useful. Increasing adapt_delta
/max_treedepth
is rarely useful to resolve Rhat/ESS/BFMI issues.Create a simulated dataset with known true values of all parameters. If the errors disappear on simulated data and the posterior mostly covers the “true” parameters from the simulation, your model may be a bad fit for the actual observed data. If you are using a higher-level interface, writing code to simulate data is a great way to test whether you understand what the model is actually doing. Further reading:
brms
can simulate datasets using the sample_prior = "only"
argument (see the docs for more details).Move parameters to the data
block and set them to their true values (from simulated data). Then return them one by one to parameters
block. Which parameter introduces the problems?
Using a simulated dataset, introduce tight priors centered at true parameter values (known from the simulation). How tight need the priors to be to let the model fit? Useful for identifying multimodality.
Check your code. Problems are almost as likely a result of a programming error as they are a truly statistical issue. This is most relevant for exceptions and rejections.
<lower=0>
, probabilities have <lower=0, upper=1>
). Don’t use hard bounds to express soft prior knowledge as the effects to the shape of the posterior are not trivial.Location parameter is inf, but must be finite!
) and investigate their sources, as they usually hint at coding errors. In rstan
, those warnings might not get displayed unless you run with chains = 1
.Specifically for exceptions/rejections: Help the sampler avoid bounds of the parameter space and numerical inaccuracies.
log_sum_exp(x)
is more robust numerically than log(sum(exp(x))
. Other examples like this include log1m(x)
instead of log(1-x)
, log1p_exp(x)
instead of log(1 + exp(x))
, and many others. See the Composed Functions subsection of the the Stan functions reference for a complete listing.print()
function in your Stan model to display intermediate values in a computation to see where an invalid value originates.Introduce more informative priors. If your posterior includes values that imply some clearly absurd predictions (e.g. 10m tall humans, almost zero between-subject differences in survival, …), removing those by a stricter prior might be useful.
normal(0, 1);
can be actually quite wide (this makes an odds ratio/multiplicative effect of exp(2)
or roughly 7.4
still a-priori plausible - is that consistent with your domain expertise?).Compare your priors to the posterior distribution. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign, indicating that your priors might be substantially influencing the sampling. Here, setting wider priors can sometimes help.
Reparametrize your model. An ideal posterior is close to independent normal distributions in each parameter as those result in constant curvature. Any deviations from this ideal — sharp corners, edges, cusps, banana shapes, tight correlations or other irregularities signal potential targets for reparametrization. In default setting, Stan will automatically rescale each parameter by a constant factor during adaptation, so the absolute scale of the posterior is usually not very important. The main part of reparametrization is to change the actual parameters and compute your parameters of interest in the transformed parameters
block. Further reading:
Specifically for low BFMI warnings:
pairs
plot to see which primitive parameters are correlated with the energy__
margin.energy__
margin in the pairs
plot are a good place to start thinking about reparameterizations. There should be be a negative relationship between lp__
and energy__
in the pairs
plot, but this is not a concern because lp__
is the logarithm of the posterior kernel rather than a primitive parameter.Specifically for Rhat, ESS, low BFMI warnings: You might try setting a higher number of warmup or sampling iterations. Increasing the number of iterations is rarely helpful for resolving divergences/max treedepth warnings.
Run the optimizing
mode (penalized maximum likelihood) instead of sampling (NUTS) to check if the resulting values look at least roughly reasonable, if not try to find out why.
Run Stan in diagnose (test_grad
) mode to test the gradient computations. This can sometimes detect numerical instabilities in your model.
You might not be used to seeing so many warnings from other software you use, but that does not mean that Stan has more problems than that other software. The Stan Development Team places a high priority on notifying users about any issue that could potentially be important and Stan will always tell you about problems it encounters instead of hiding them from you.
A huge advantage of the algorithms used by Stan is that they permit certain unique diagnostics that are unavailable when using other algorithms. This can lead to more warnings from Stan, but this is a feature rather than a drawback. Sometimes when people reimplement a model from some other software in Stan, they get unexpected warnings, and further investigation reveals that the problem is not Stan being too sensitive, but that the original model in fact did not compute the correct posterior.
Te best place to get help from Stan developers and users if you have difficulties fitting a model is to visit the Stan Forums.
In order to both reduce the amount of help you need and allow us to give the best help when you do need it, it is recommended to: * Put Stan programs in a stand-alone file with a .stan
extension. Even though some Stan interfaces allow specifying the model as a string, the line numbers in warning and error messages are only meaningful if you use a separate file. * Maintain reproducibility by saving the model and initial values in files and the RStan (other other Stan interface) commands in scripts. * Use version control such as git on your files and scripts so that you have a history of the changes you’ve made. * Start simple! Build your model in stages, and check for good fits at each stage, only adding complexity if there are no red flags. If you start by writing a complicated model it can be more difficult to figure out where things are going wrong.