Posterior Analysis
Stan uses Markov chain Monte Carlo (MCMC) techniques to generate draws from the posterior distribution for full Bayesian inference. Markov chain Monte Carlo (MCMC) methods were developed for situations in which it is not straightforward to make independent draws Metropolis et al. (1953).
Stan’s variational inference algorithm provides draws from the variational approximation to the posterior which may be analyzed just as any other MCMC output, despite the fact that it is not actually a Markov chain.
Stan’s Laplace algorithm produces a sample from a normal approximation centered at the mode of a distribution in the unconstrained space. If the mode is a maximum a posteriori (MAP) estimate, the sample provides an estimate of the mean and standard deviation of the posterior distribution. If the mode is a maximum likelihood estimate (MLE), the sample provides an estimate of the standard error of the likelihood.
Markov chains
A Markov chain is a sequence of random variables
Stan uses Hamiltonian Monte Carlo to generate a next state in a manner described in the Hamiltonian Monte Carlo chapter.
The Markov chains Stan and other MCMC samplers generate are ergodic in the sense required by the Markov chain central limit theorem, meaning roughly that there is a reasonable chance of reaching one value of
Stationary Markov chains have an equilibrium distribution on states in which each has the same marginal probability function, so that
Using MCMC methods introduces two difficulties that are not faced by independent sample Monte Carlo methods. The first problem is determining when a randomly initialized Markov chain has converged to its equilibrium distribution. The second problem is that the draws from a Markov chain may be correlated or even anti-correlated, and thus the central limit theorem’s bound on estimation error no longer applies. These problems are addressed in the next two sections.
Stan’s posterior analysis tools compute a number of summary statistics, estimates, and diagnostics for Markov chain Monte Carlo (MCMC) sample. Stan’s estimators and diagnostics are more robust in the face of non-convergence, antithetical sampling, and long-term Markov chain correlations than most of the other tools available. The algorithms Stan uses to achieve this are described in this chapter.
Convergence
By definition, a Markov chain samples from the target distribution only after it has converged to equilibrium (i.e., equilibrium is defined as being achieved when
In theory, convergence is only guaranteed asymptotically as the number of draws grows without bound.
In practice, diagnostics must be applied to monitor convergence for the finite number of draws actually available.
Notation for samples, chains, and draws
To establish basic notation, suppose a target Bayesian posterior density
An MCMC sample consists of a set of a sequence of
Potential scale reduction
One way to monitor whether a chain has converged to the equilibrium distribution is to compare its behavior to other randomly initialized chains. This is the motivation for the Gelman and Rubin (1992) potential scale reduction statistic,
Gelman and Rubin’s recommendation is that the independent Markov chains be initialized with diffuse starting values for the parameters and sampled until all values for
The
where
and
The within-chain variance is averaged over the chains,
where
The variance estimator is a mixture of the within-chain and cross-chain sample variances,
Finally, the potential scale reduction statistic is defined by
Split R-hat for detecting non-stationarity
Before Stan calculating the potential-scale-reduction statistic
Rank normalization helps when there are heavy tails
Split R-hat and the effective sample size (ESS) are well defined only if the marginal posteriors have finite mean and variance. Therefore, following Vehtari et al. (2021), we compute the rank normalized parameter values and then feed them into the formulas for split R-hat and ESS.
Rank normalization proceeds as follows:
First, replace each value
by its rank within the pooled draws from all chains. Average rank for ties are used to conserve the number of unique values of discrete quantities.Second, transform ranks to normal scores using the inverse normal transformation and a fractional offset:
To further improve sensitivity to chains having different scales,
rank normalized R-hat is computed also for the for the corresponding folded draws
To obtain a single conservative
Bulk-ESS is defined as ESS for rank normalized split chains. Tail-ESS is defined as the minimum ESS for the 5% and 95% quantiles. See Effective Sample Size section for details on how ESS is estimated.
Convergence is global
A question that often arises is whether it is acceptable to monitor convergence of only a subset of the parameters or generated quantities. The short answer is “no,” but this is elaborated further in this section.
For example, consider the value lp__
, which is the log posterior density (up to a constant).3
It is thus a mistake to declare convergence in any practical sense if lp__
has not converged, because different chains are really in different parts of the space. Yet measuring convergence for lp__
is particularly tricky, as noted below.
Asymptotics and transience vs. equilibrium
Markov chain convergence is a global property in the sense that it does not depend on the choice of function of the parameters that is monitored. There is no hard cutoff between pre-convergence “transience” and post-convergence “equilibrium.” What happens is that as the number of states in the chain approaches infinity, the distribution of possible states in the chain approaches the target distribution and in that limit the expected value of the Monte Carlo estimator of any integrable function converges to the true expectation. There is nothing like warmup here, because in the limit, the effects of initial state are completely washed out.
Multivariate convergence of functions
The
The transformation from unconstrained space to constrained space is just another function, so does not effect convergence.
Different functions may have different autocorrelations, but if the Markov chain has equilibrated then all Markov chain plus function compositions should be consistent with convergence. Formally, any function that appears inconsistent is of concern and although it would be unreasonable to test every function, lp__
and other measured quantities should at least be consistent.
The obvious difference in lp__
is that it tends to vary quickly with position and is consequently susceptible to outliers.
Finite numbers of states
The question is what happens for finite numbers of states? If we can prove a strong geometric ergodicity property (which depends on the sampler and the target distribution), then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time. But even if you can show it exists and is finite (which is nigh impossible) you can’t compute an actual value analytically.
So what we do in practice is hope that the finite number of draws is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of draws will be enough.
Why inconsistent R-hat?
Firstly, as noted above, for any finite number of draws, there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better
Secondly, use of the lp__
) almost never looks Gaussian, instead it features long tails that can lead to large
Final words on convergence monitoring
“Convergence” is a global property and holds for all integrable functions at once, but employing the
Note that if you just compare the expectations between chains then we can rely on the Markov chain asymptotics for Gaussian distributions and can apply the standard tests.
Effective sample size
The second technical difficulty posed by MCMC methods is that the draws will typically be autocorrelated (or anticorrelated) within a chain. This increases (or reduces) the uncertainty of the estimation of posterior quantities of interest, such as means, variances, or quantiles; see Charles J. Geyer (2011).
Stan estimates an effective sample size for each parameter, which plays the role in the Markov chain Monte Carlo central limit theorem (MCMC CLT) as the number of independent draws plays in the standard central limit theorem (CLT).
Unlike most packages, the particular calculations used by Stan follow those for split-
Definition of effective sample size
The amount by which autocorrelation within the chains increases uncertainty in estimates can be measured by effective sample size (ESS). Given independent sample (with finite variance), the central limit theorem bounds uncertainty in estimates based on the sample size
For example, Monte Carlo standard error (MCSE) is proportional to
The effective sample size of a sequence is defined in terms of the autocorrelations within the sequence at different lags. The autocorrelation
This is the correlation between the two chains offset by
The effective sample size of
For independent draws, the effective sample size is just the number of iterations. For correlated draws, the effective sample size is usually lower than the number of iterations, but in case of anticorrelated draws, the effective sample size can be larger than the number of iterations. In this latter case, MCMC can work better than independent sampling for some estimation problems. Hamiltonian Monte Carlo, including the no-U-turn sampler used by default in Stan, can produce anticorrelated draws if the posterior is close to Gaussian with little posterior correlation.
Estimation of effective sample size
In practice, the probability function in question cannot be tractably integrated and thus the autocorrelation cannot be calculated, nor the effective sample size. Instead, these quantities must be estimated from the draws themselves. The rest of this section describes a autocorrelations and split-
Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Charles J. Geyer (2011) for more detail on using FFT for autocorrelation calculations. The autocorrelation estimates
If the chains have not converged, the variance estimator
Because of the noise in the correlation estimates
where
where
Estimation of MCMC standard error
The posterior standard deviation of a parameter
The previous section showed how to estimate
The mean of the posterior draws of
is treated as an estimator of the true posterior mean,
based the observed data
The standard error for the estimator
Thinning samples
In complex posteriors, draws are almost always positively correlated. In these situations, the autocorrelation at lag
For instance, consider generating one thousand posterior draws in one of the following two ways.
Generate 1000 draws after convergence and save all of them.
Generate 10,000 draws after convergence and save every tenth draw.
Even though both produce a sample consisting one thousand draws, the second approach with thinning can produce a higher effective sample size when the draws are positively correlated. That’s because the autocorrelation
Now contrast the second approach above with the unthinned alternative,
- Generate 10,000 draws after convergence and save every draw.
This will typically have a higher effective sample than the thinned sample consisting of every tenth drawn. But the gap might not be very large. To summarize, the only reason to thin a sample is to reduce memory requirements.
If draws are anticorrelated, then thinning will increase correlation and further reduce the overall effective sample size.
References
Footnotes
Using vectors simplifies high level exposition at the expense of collapsing structure.↩︎
The structure is assumed to be rectangular; in the future, this needs to be generalized to ragged samples.↩︎
The
lp__
value also represents the potential energy in the Hamiltonian system and is rate bounded by the randomly supplied kinetic energy each iteration, which follows a Chi-square distribution in the number of parameters.↩︎