Posterior Analysis

Stan uses Markov chain Monte Carlo (MCMC) techniques to generate draws from the posterior distribution for full Bayesian inference. Markov chain Monte Carlo (MCMC) methods were developed for situations in which it is not straightforward to make independent draws Metropolis et al. (1953).

Stan’s variational inference algorithm provides draws from the variational approximation to the posterior which may be analyzed just as any other MCMC output, despite the fact that it is not actually a Markov chain.

Stan’s Laplace algorithm produces a sample from a normal approximation centered at the mode of a distribution in the unconstrained space. If the mode is a maximum a posteriori (MAP) estimate, the sample provides an estimate of the mean and standard deviation of the posterior distribution. If the mode is a maximum likelihood estimate (MLE), the sample provides an estimate of the standard error of the likelihood.

Markov chains

A Markov chain is a sequence of random variables \(\theta^{(1)}, \theta^{(2)},\ldots\) where each variable is conditionally independent of all other variables given the value of the previous value. Thus if \(\theta = \theta^{(1)}, \theta^{(2)},\ldots, \theta^{(N)}\), then

\[ p(\theta) = p(\theta^{(1)}) \prod_{n=2}^N p(\theta^{(n)}|\theta^{(n-1)}). \]

Stan uses Hamiltonian Monte Carlo to generate a next state in a manner described in the Hamiltonian Monte Carlo chapter.

The Markov chains Stan and other MCMC samplers generate are ergodic in the sense required by the Markov chain central limit theorem, meaning roughly that there is a reasonable chance of reaching one value of \(\theta\) from another. The Markov chains are also stationary, meaning that the transition probabilities do not change at different positions in the chain, so that for \(n, n' \geq 0\), the probability function \(p(\theta^{(n+1)}|\theta^{(n)})\) is the same as \(p(\theta^{(n'+1)}|\theta^{(n')})\) (following the convention of overloading random and bound variables and picking out a probability function by its arguments).

Stationary Markov chains have an equilibrium distribution on states in which each has the same marginal probability function, so that \(p(\theta^{(n)})\) is the same probability function as \(p(\theta^{(n+1)})\). In Stan, this equilibrium distribution \(p(\theta^{(n)})\) is the target density \(p(\theta)\) defined by a Stan program, which is typically a proper Bayesian posterior density \(p(\theta | y)\) defined on the log scale up to a constant.

Using MCMC methods introduces two difficulties that are not faced by independent sample Monte Carlo methods. The first problem is determining when a randomly initialized Markov chain has converged to its equilibrium distribution. The second problem is that the draws from a Markov chain may be correlated or even anti-correlated, and thus the central limit theorem’s bound on estimation error no longer applies. These problems are addressed in the next two sections.

Stan’s posterior analysis tools compute a number of summary statistics, estimates, and diagnostics for Markov chain Monte Carlo (MCMC) sample. Stan’s estimators and diagnostics are more robust in the face of non-convergence, antithetical sampling, and long-term Markov chain correlations than most of the other tools available. The algorithms Stan uses to achieve this are described in this chapter.

Convergence

By definition, a Markov chain samples from the target distribution only after it has converged to equilibrium (i.e., equilibrium is defined as being achieved when \(p(\theta^{(n)})\) is the target density). The following point cannot be expressed strongly enough:

In theory, convergence is only guaranteed asymptotically as the number of draws grows without bound.
In practice, diagnostics must be applied to monitor convergence for the finite number of draws actually available.

Notation for samples, chains, and draws

To establish basic notation, suppose a target Bayesian posterior density \(p(\theta | y)\) given real-valued vectors of parameters \(\theta\) and real- and discrete-valued data \(y\).¹

An MCMC sample consists of a set of a sequence of \(M\) Markov chains, each consisting of an ordered sequence of \(N\) draws from the posterior.² The sample thus consists of \(M \times N\) draws from the posterior.

Potential scale reduction

One way to monitor whether a chain has converged to the equilibrium distribution is to compare its behavior to other randomly initialized chains. This is the motivation for the Gelman and Rubin (1992) potential scale reduction statistic, \(\hat{R}\). The \(\hat{R}\) statistic measures the ratio of the average variance of drawss within each chain to the variance of the pooled draws across chains; if all chains are at equilibrium, these will be the same and \(\hat{R}\) will be one. If the chains have not converged to a common distribution, the \(\hat{R}\) statistic will be greater than one.

Gelman and Rubin’s recommendation is that the independent Markov chains be initialized with diffuse starting values for the parameters and sampled until all values for \(\hat{R}\) are below some threshold. Vehtari et al. (2021) suggest in general to use a threshold \(1.01\), but othe thresholds can be used depending on the use case. Stan allows users to specify initial values for parameters and it is also able to draw diffuse random initializations automatically satisfying the declared parameter constraints.

The \(\hat{R}\) statistic is defined for a set of \(M\) Markov chains, \(\theta_m\), each of which has \(N\) draws \(\theta^{(n)}_m\). The between-chain variance estimate is

\[ B = \frac{N}{M-1} \, \sum_{m=1}^M (\bar{\theta}^{(\bullet)}_{m} - \bar{\theta}^{(\bullet)}_{\bullet})^2, \]

where

\[ \bar{\theta}_m^{(\bullet)} = \frac{1}{N} \sum_{n = 1}^N \theta_m^{(n)} \]

and

\[ \bar{\theta}^{(\bullet)}_{\bullet} = \frac{1}{M} \, \sum_{m=1}^M \bar{\theta}_m^{(\bullet)}. \]

The within-chain variance is averaged over the chains,

\[ W = \frac{1}{M} \, \sum_{m=1}^M s_m^2, \]

where

\[ s_m^2 = \frac{1}{N-1} \, \sum_{n=1}^N (\theta^{(n)}_m - \bar{\theta}^{(\bullet)}_m)^2. \]

The variance estimator is a mixture of the within-chain and cross-chain sample variances,

\[ \widehat{\mbox{var}}^{+}\!(\theta|y) = \frac{N-1}{N}\, W \, + \, \frac{1}{N} \, B. \]

Finally, the potential scale reduction statistic is defined by

\[ \hat{R} \, = \, \sqrt{\frac{\widehat{\mbox{var}}^{+}\!(\theta|y)}{W}}. \]

Split R-hat for detecting non-stationarity

Before Stan calculating the potential-scale-reduction statistic \(\hat{R}\), each chain is split into two halves. This provides an additional means to detect non-stationarity in the individual chains. If one chain involves gradually increasing values and one involves gradually decreasing values, they have not mixed well, but they can have \(\hat{R}\) values near unity. In this case, splitting each chain into two parts leads to \(\hat{R}\) values substantially greater than 1 because the first half of each chain has not mixed with the second half.

Rank normalization helps when there are heavy tails

Split R-hat and the effective sample size (ESS) are well defined only if the marginal posteriors have finite mean and variance. Therefore, following Vehtari et al. (2021), we compute the rank normalized parameter values and then feed them into the formulas for split R-hat and ESS.

Rank normalization proceeds as follows:

First, replace each value \(\theta^{(nm)}\) by its rank \(r^{(nm)}\) within the pooled draws from all chains. Average rank for ties are used to conserve the number of unique values of discrete quantities.
Second, transform ranks to normal scores using the inverse normal transformation and a fractional offset:

\[ z_{(nm)} = \Phi^{-1} \left( \frac{r_{(nm)} - 3/8}{S - 1/4} \right) \]

To further improve sensitivity to chains having different scales,

rank normalized R-hat is computed also for the for the corresponding folded draws \(\zeta^{(mn)}\), absolute deviations from the median, \[ \label{zeta} \zeta^{(mn)} = \left|\theta^{(nm)}-{\rm median}(\theta)\right|. \] The rank normalized split-\(\widehat{R}\) measure computed on the \(\zeta^{(mn)}\) values is called -\(\widehat{R}\). This measures convergence in the tails rather than in the bulk of the distribution.

To obtain a single conservative \(\widehat{R}\) estimate, we propose to report the maximum of rank normalized split-\(\widehat{R}\) and rank normalized folded-split-\(\widehat{R}\) for each parameter.

Bulk-ESS is defined as ESS for rank normalized split chains. Tail-ESS is defined as the minimum ESS for the 5% and 95% quantiles. See Effective Sample Size section for details on how ESS is estimated.

Convergence is global

A question that often arises is whether it is acceptable to monitor convergence of only a subset of the parameters or generated quantities. The short answer is “no,” but this is elaborated further in this section.

For example, consider the value lp__, which is the log posterior density (up to a constant).³

It is thus a mistake to declare convergence in any practical sense if lp__ has not converged, because different chains are really in different parts of the space. Yet measuring convergence for lp__ is particularly tricky, as noted below.

Asymptotics and transience vs. equilibrium

Markov chain convergence is a global property in the sense that it does not depend on the choice of function of the parameters that is monitored. There is no hard cutoff between pre-convergence “transience” and post-convergence “equilibrium.” What happens is that as the number of states in the chain approaches infinity, the distribution of possible states in the chain approaches the target distribution and in that limit the expected value of the Monte Carlo estimator of any integrable function converges to the true expectation. There is nothing like warmup here, because in the limit, the effects of initial state are completely washed out.

Multivariate convergence of functions

The \(\hat{R}\) statistic considers the composition of a Markov chain and a function, and if the Markov chain has converged then each Markov chain and function composition will have converged. Multivariate functions converge when all of their margins have converged by the Cramer-Wold theorem.

The transformation from unconstrained space to constrained space is just another function, so does not effect convergence.

Different functions may have different autocorrelations, but if the Markov chain has equilibrated then all Markov chain plus function compositions should be consistent with convergence. Formally, any function that appears inconsistent is of concern and although it would be unreasonable to test every function, lp__ and other measured quantities should at least be consistent.

The obvious difference in lp__ is that it tends to vary quickly with position and is consequently susceptible to outliers.

Finite numbers of states

The question is what happens for finite numbers of states? If we can prove a strong geometric ergodicity property (which depends on the sampler and the target distribution), then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time. But even if you can show it exists and is finite (which is nigh impossible) you can’t compute an actual value analytically.

So what we do in practice is hope that the finite number of draws is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of draws will be enough.

Why inconsistent R-hat?

Firstly, as noted above, for any finite number of draws, there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better \(\hat{R}\). Functions vulnerable to such outliers may show fragility.

Secondly, use of the \(\hat{R}\) statistic makes very strong assumptions. In particular, it assumes that the functions being considered are Gaussian or it only uses the first two moments and assumes some kind of independence. The point is that strong assumptions are made that do not always hold. In particular, the distribution for the log posterior density (lp__) almost never looks Gaussian, instead it features long tails that can lead to large \(\hat{R}\) even in the large \(N\) limit. Tweaks to \(\hat{R}\), such as using quantiles in place of raw values, have the flavor of making the sample of interest more Gaussian and hence the \(\hat{R}\) statistic more accurate.

Final words on convergence monitoring

“Convergence” is a global property and holds for all integrable functions at once, but employing the \(\hat{R}\) statistic requires additional assumptions and thus may not work for all functions equally well.

Note that if you just compare the expectations between chains then we can rely on the Markov chain asymptotics for Gaussian distributions and can apply the standard tests.

Effective sample size

The second technical difficulty posed by MCMC methods is that the draws will typically be autocorrelated (or anticorrelated) within a chain. This increases (or reduces) the uncertainty of the estimation of posterior quantities of interest, such as means, variances, or quantiles; see Charles J. Geyer (2011).

Stan estimates an effective sample size for each parameter, which plays the role in the Markov chain Monte Carlo central limit theorem (MCMC CLT) as the number of independent draws plays in the standard central limit theorem (CLT).

Unlike most packages, the particular calculations used by Stan follow those for split-\(\hat{R}\), which involve both cross-chain (mean) and within-chain calculations (autocorrelation); see Gelman et al. (2013) and Vehtari et al. (2021).

Definition of effective sample size

The amount by which autocorrelation within the chains increases uncertainty in estimates can be measured by effective sample size (ESS). Given independent sample (with finite variance), the central limit theorem bounds uncertainty in estimates based on the sample size \(N\). Given dependent sample, the sample size is replaced with the effective sample size \(N_{\mathrm{eff}}\).
For example, Monte Carlo standard error (MCSE) is proportional to \(1 / \sqrt{N_{\mathrm{eff}}}\) rather than \(1/\sqrt{N}\).

The effective sample size of a sequence is defined in terms of the autocorrelations within the sequence at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be

\[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \]

This is the correlation between the two chains offset by \(t\) positions (i.e., a lag in time-series terminology). Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields

\[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta - \frac{\mu^2}{\sigma^2}. \]

The effective sample size of \(N\) draws generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\mathrm{eff}} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]

For independent draws, the effective sample size is just the number of iterations. For correlated draws, the effective sample size is usually lower than the number of iterations, but in case of anticorrelated draws, the effective sample size can be larger than the number of iterations. In this latter case, MCMC can work better than independent sampling for some estimation problems. Hamiltonian Monte Carlo, including the no-U-turn sampler used by default in Stan, can produce anticorrelated draws if the posterior is close to Gaussian with little posterior correlation.

Estimation of effective sample size

In practice, the probability function in question cannot be tractably integrated and thus the autocorrelation cannot be calculated, nor the effective sample size. Instead, these quantities must be estimated from the draws themselves. The rest of this section describes a autocorrelations and split-\(\hat{R}\) based effective sample size estimator, based on multiple chains. As before, each chain \(\theta_m\) will be assumed to be of length \(N\).

Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Charles J. Geyer (2011) for more detail on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with within-sample variance estimate \(W\) and multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as

\[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M s_m^2 \hat{\rho}_{t,m}} {\widehat{\mbox{var}}^{+}}. \]

If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate variance, leading to an overestimate of autocorrelation and an underestimate effective sample size.

Because of the noise in the correlation estimates \(\hat{\rho}_t\) as \(t\) increases, a typical truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations may occur only on odd lags and by summing over pairs starting from lag 0, the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise Charles J. Geyer (1992), Charles J. Geyer (2011). Stan uses Geyer’s initial monotone sequence criterion. The effective sample size estimator is defined as

\[ \hat{N}_{\mathrm{eff}} = \frac{M \cdot N}{\hat{\tau}}, \]

where

\[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2m+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{m} \hat{P}_{t'}, \]

where \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). Initial positive sequence estimators is obtained by choosing the largest \(m\) such that \(\hat{P}_{t'}>0, \quad t' = 1,\ldots,m\). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.

Estimation of MCMC standard error

The posterior standard deviation of a parameter \(\theta_n\) conditioned on observed data \(y\) is just the standard deviation of the posterior density \(p(\theta_n | y)\). This is estimated by the standard deviation of the combined posterior draws across chains,

\[ \hat{\sigma}_n = \mathrm{sd}(\theta^{(1)}_n, \ldots, \theta^{(m)}_n). \]

The previous section showed how to estimate \(N_{\mathrm{eff}}\) for a parameter \(\theta_n\) based on multiple chains of posterior draws.

The mean of the posterior draws of \(\theta_n\) \[ \hat{\theta}_n = \mathrm{mean}(\theta^{(1)}_n, \ldots, \theta^{(m)}_n) \]

is treated as an estimator of the true posterior mean,

\[ \mathbb{E}[\theta_n \mid y] \ = \ \int_{-\infty}^{\infty} \, \theta \, p(\theta | y) \, \mathrm{d}\theta_n, \]

based the observed data \(y\).

The standard error for the estimator \(\hat{\theta}_n\) is given by the posterior standard deviation divided by the square root of the effective sample size. This standard error is itself estimated as \(\hat{\sigma}_n / \sqrt{N_{\mathrm{eff}}}\). The smaller the standard error, the closer the estimate \(\hat{\theta}_n\) is expected to be to the true value. This is just the MCMC CLT applied to an estimator; see Charles J. Geyer (2011) for more details of the MCMC central limit theorem.

Thinning

In complex posteriors, draws are almost always positively correlated. In these situations, the autocorrelation at lag \(t\), \(\rho_t\), decreases as the lag, \(t\), increases. In this situation, thinning the sample by keeping only every \(N\)-th draw will reduce the autocorrelation of the resulting chain. This is particularly useful if we need to save storage or re-use the draws for inference.

For instance, consider generating one thousand posterior draws in one of the following two ways.

Generate 1000 draws after convergence and save all of them.
Generate 10,000 draws after convergence and save every tenth draw.

Even though both produce a sample consisting one thousand draws, the second approach with thinning can produce a higher effective sample size when the draws are positively correlated. That’s because the autocorrelation \(\rho_t\) for the thinned sequence is equivalent to \(\rho_{10t}\) in the unthinned sequence, so the sum of the autocorrelations usually will be lower and thus the effective sample size higher.

Now contrast the second approach above with the unthinned alternative,

Generate 10,000 draws after convergence and save every draw.

This will typically have a higher effective sample than the thinned sample consisting of every tenth drawn. But the gap might not be very large. To summarize, the only reason to thin a sample is to reduce memory requirements.

If draws are anticorrelated, then thinning will increase correlation and further reduce the overall effective sample size.

References

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.

Gelman, Andrew, and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Geyer, Charles J. 1992. “Practical Markov Chain Monte Carlo.” Statistical Science, 473–83.

Geyer, Charles J. 2011. “Introduction to Markov Chain Monte Carlo.” In Handbook of Markov Chain Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, 3–48. Chapman; Hall/CRC.

Metropolis, N., A. Rosenbluth, M. Rosenbluth, M. Teller, and E. Teller. 1953. “Equations of State Calculations by Fast Computing Machines.” Journal of Chemical Physics 21: 1087–92.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2021. “Rank-Normalization, Folding, and Localization: An Improved \(\widehat{R}\) for Assessing Convergence of MCMC.” Bayesian Analysis 16: 667–718.

Footnotes

Using vectors simplifies high level exposition at the expense of collapsing structure.↩︎
The structure is assumed to be rectangular; in the future, this needs to be generalized to ragged samples.↩︎
The lp__ value also represents the potential energy in the Hamiltonian system and is rate bounded by the randomly supplied kinetic energy each iteration, which follows a Chi-square distribution in the number of parameters.↩︎