16.3 Notation for samples, chains, and draws

This is an old version, view current version.

16.3 Notation for samples, chains, and draws

To establish basic notation, suppose a target Bayesian posterior density \(p(\theta | y)\) given real-valued vectors of parameters \(\theta\) and real- and discrete-valued data \(y\).²⁵

An MCMC sample consists of a set of a sequence of \(M\) Markov chains, each consisting of an ordered sequence of \(N\) draws from the posterior.²⁶ The sample thus consists of \(M \times N\) draws from the posterior.

16.3.1 Potential Scale Reduction

One way to monitor whether a chain has converged to the equilibrium distribution is to compare its behavior to other randomly initialized chains. This is the motivation for the Gelman and Rubin (1992) potential scale reduction statistic, \(\hat{R}\). The \(\hat{R}\) statistic measures the ratio of the average variance of samples within each chain to the variance of the pooled samples across chains; if all chains are at equilibrium, these will be the same and \(\hat{R}\) will be one. If the chains have not converged to a common distribution, the \(\hat{R}\) statistic will be greater than one.

Gelman and Rubin’s recommendation is that the independent Markov chains be initialized with diffuse starting values for the parameters and sampled until all values for \(\hat{R}\) are below 1.1. Stan allows users to specify initial values for parameters and it is also able to draw diffuse random initializations automatically satisfying the declared parameter constraints.

The \(\hat{R}\) statistic is defined for a set of \(M\) Markov chains, \(\theta_m\), each of which has \(N\) samples \(\theta^{(n)}_m\). The between-chain variance estimate is

\[ B = \frac{N}{M-1} \, \sum_{m=1}^M (\bar{\theta}^{(\bullet)}_{m} - \bar{\theta}^{(\bullet)}_{\bullet})^2, \]

where

\[ \bar{\theta}_m^{(\bullet)} = \frac{1}{N} \sum_{n = 1}^N \theta_m^{(n)} \]

and

\[ \bar{\theta}^{(\bullet)}_{\bullet} = \frac{1}{M} \, \sum_{m=1}^M \bar{\theta}_m^{(\bullet)}. \]

The within-chain variance is averaged over the chains,

\[ W = \frac{1}{M} \, \sum_{m=1}^M s_m^2, \]

where

\[ s_m^2 = \frac{1}{N-1} \, \sum_{n=1}^N (\theta^{(n)}_m - \bar{\theta}^{(\bullet)}_m)^2. \]

The variance estimator is a mixture of the within-chain and cross-chain sample variances,

\[ \widehat{\mbox{var}}^{+}\!(\theta|y) = \frac{N-1}{N}\, W \, + \, \frac{1}{N} \, B. \]

Finally, the potential scale reduction statistic is defined by

\[ \hat{R} \, = \, \sqrt{\frac{\widehat{\mbox{var}}^{+}\!(\theta|y)}{W}}. \]

16.3.2 Split R-hat for Detecting Non-Stationarity

Before Stan calculating the potential-scale-reduction statistic \(\hat{R}\), each chain is split into two halves. This provides an additional means to detect non-stationarity in the individual chains. If one chain involves gradually increasing values and one involves gradually decreasing values, they have not mixed well, but they can have \(\hat{R}\) values near unity. In this case, splitting each chain into two parts leads to \(\hat{R}\) values substantially greater than 1 because the first half of each chain has not mixed with the second half.

16.3.3 Convergence is Global

A question that often arises is whether it is acceptable to monitor convergence of only a subset of the parameters or generated quantities. The short answer is “no,” but this is elaborated further in this section.

For example, consider the value lp__, which is the log posterior density (up to a constant).²⁷

It is thus a mistake to declare convergence in any practical sense if lp__ has not converged, because different chains are really in different parts of the space. Yet measuring convergence for lp__ is particularly tricky, as noted below.

16.3.3.1 Asymptotics and transience vs. equilibrium

Markov chain convergence is a global property in the sense that it does not depend on the choice of function of the parameters that is monitored. There is no hard cutoff between pre-convergence “transience” and post-convergence “equilibrium.” What happens is that as the number of states in the chain approaches infinity, the distribution of possible states in the chain approaches the target distribution and in that limit the expected value of the Monte Carlo estimator of any integrable function converges to the true expectation. There is nothing like warmup here, because in the limit, the effects of initial state are completely washed out.

16.3.3.2 Multivariate convergence of functions

The \(\hat{R}\) statistic considers the composition of a Markov chain and a function, and if the Markov chain has converged then each Markov chain and function composition will have converged. Multivariate functions converge when all of their margins have converged by the Cramer-Wold theorem.

The transformation from unconstrained space to constrained space is just another function, so does not effect convergence.

Different functions may have different autocorrelations, but if the Markov chain has equilibrated then all Markov chain plus function compositions should be consistent with convergence. Formally, any function that appears inconsistent is of concern and although it would be unreasonable to test every function, lp__ and other measured quantities should at least be consistent.

The obvious difference in lp__ is that it tends to vary quickly with position and is consequently susceptible to outliers.

16.3.3.3 Finite numbers of states

The question is what happens for finite numbers of states? If we can prove a strong geometric ergodicity property (which depends on the sampler and the target distribution), then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time. But even if you can show it exists and is finite (which is nigh impossible) you can’t compute an actual value analytically.

So what we do in practice is hope that the finite number of draws is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of samples will be enough.

16.3.3.4 Why inconsistent R-hat?

Firstly, as noted above, for any finite number of draws, there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better \(\hat{R}\). Functions vulnerable to such outliers may show fragility.

Secondly, use of the \(\hat{R}\) statistic makes very strong assumptions. In particular, it assumes that the functions being considered are Gaussian or it only uses the first two moments and assumes some kind of independence. The point is that strong assumptions are made that do not always hold. In particular, the distribution for the log posterior density (lp__) almost never looks Gaussian, instead it features long tails that can lead to large \(\hat{R}\) even in the large \(N\) limit. Tweaks to \(\hat{R}\), such as using quantiles in place of raw values, have the flavor of making the samples of interest more Gaussian and hence the \(\hat{R}\) statistic more accurate.

16.3.3.5 Final words on convergence monitoring

“Convergence” is a global property and holds for all integrable functions at once, but employing the \(\hat{R}\) statistic requires additional assumptions and thus may not work for all functions equally well.

Note that if you just compare the expectations between chains then we can rely on the Markov chain asymptotics for Gaussian distributions and can apply the standard tests.

References

Gelman, Andrew, and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Using vectors simplifies high level exposition at the expense of collapsing structure.↩
The structure is assumed to be rectangular; in the future, this needs to be generalized to ragged samples.↩
The lp__ value also represents the potential energy in the Hamiltonian system and is rate bounded by the randomly supplied kinetic energy each iteration, which follows a Chi-square distribution in the number of parameters.↩