16.3 Notation for samples, chains, and draws
To establish basic notation, suppose a target Bayesian posterior density \(p(\theta | y)\) given real-valued vectors of parameters \(\theta\) and real- and discrete-valued data \(y\).24
An MCMC sample consists of a set of a sequence of \(M\) Markov chains, each consisting of an ordered sequence of \(N\) draws from the posterior.25 The sample thus consists of \(M \times N\) draws from the posterior.
16.3.1 Potential scale reduction
One way to monitor whether a chain has converged to the equilibrium distribution is to compare its behavior to other randomly initialized chains. This is the motivation for the Gelman and Rubin (1992) potential scale reduction statistic, \(\hat{R}\). The \(\hat{R}\) statistic measures the ratio of the average variance of samples within each chain to the variance of the pooled samples across chains; if all chains are at equilibrium, these will be the same and \(\hat{R}\) will be one. If the chains have not converged to a common distribution, the \(\hat{R}\) statistic will be greater than one.
Gelman and Rubin’s recommendation is that the independent Markov chains be initialized with diffuse starting values for the parameters and sampled until all values for \(\hat{R}\) are below 1.1. Stan allows users to specify initial values for parameters and it is also able to draw diffuse random initializations automatically satisfying the declared parameter constraints.
The \(\hat{R}\) statistic is defined for a set of \(M\) Markov chains, \(\theta_m\), each of which has \(N\) samples \(\theta^{(n)}_m\). The between-chain variance estimate is
\[ B = \frac{N}{M-1} \, \sum_{m=1}^M (\bar{\theta}^{(\bullet)}_{m} - \bar{\theta}^{(\bullet)}_{\bullet})^2, \]
where
\[ \bar{\theta}_m^{(\bullet)} = \frac{1}{N} \sum_{n = 1}^N \theta_m^{(n)} \]
and
\[ \bar{\theta}^{(\bullet)}_{\bullet} = \frac{1}{M} \, \sum_{m=1}^M \bar{\theta}_m^{(\bullet)}. \]
The within-chain variance is averaged over the chains,
\[ W = \frac{1}{M} \, \sum_{m=1}^M s_m^2, \]
where
\[ s_m^2 = \frac{1}{N-1} \, \sum_{n=1}^N (\theta^{(n)}_m - \bar{\theta}^{(\bullet)}_m)^2. \]
The variance estimator is a mixture of the within-chain and cross-chain sample variances,
\[ \widehat{\mbox{var}}^{+}\!(\theta|y) = \frac{N-1}{N}\, W \, + \, \frac{1}{N} \, B. \]
Finally, the potential scale reduction statistic is defined by
\[ \hat{R} \, = \, \sqrt{\frac{\widehat{\mbox{var}}^{+}\!(\theta|y)}{W}}. \]
16.3.2 Split R-hat for detecting non-stationarity
Before Stan calculating the potential-scale-reduction statistic \(\hat{R}\), each chain is split into two halves. This provides an additional means to detect non-stationarity in the individual chains. If one chain involves gradually increasing values and one involves gradually decreasing values, they have not mixed well, but they can have \(\hat{R}\) values near unity. In this case, splitting each chain into two parts leads to \(\hat{R}\) values substantially greater than 1 because the first half of each chain has not mixed with the second half.
16.3.3 Convergence is global
A question that often arises is whether it is acceptable to monitor convergence of only a subset of the parameters or generated quantities. The short answer is “no,” but this is elaborated further in this section.
For example, consider the value lp__
, which is the log posterior
density (up to a constant).26
It is thus a mistake to declare convergence in any practical sense if
lp__
has not converged, because different chains are really in
different parts of the space. Yet measuring convergence for lp__
is
particularly tricky, as noted below.
16.3.3.1 Asymptotics and transience vs. equilibrium
Markov chain convergence is a global property in the sense that it does not depend on the choice of function of the parameters that is monitored. There is no hard cutoff between pre-convergence “transience” and post-convergence “equilibrium.” What happens is that as the number of states in the chain approaches infinity, the distribution of possible states in the chain approaches the target distribution and in that limit the expected value of the Monte Carlo estimator of any integrable function converges to the true expectation. There is nothing like warmup here, because in the limit, the effects of initial state are completely washed out.
16.3.3.2 Multivariate convergence of functions
The \(\hat{R}\) statistic considers the composition of a Markov chain and a function, and if the Markov chain has converged then each Markov chain and function composition will have converged. Multivariate functions converge when all of their margins have converged by the Cramer-Wold theorem.
The transformation from unconstrained space to constrained space is just another function, so does not effect convergence.
Different functions may have different autocorrelations, but if the
Markov chain has equilibrated then all Markov chain plus function
compositions should be consistent with convergence. Formally, any
function that appears inconsistent is of concern and although it would
be unreasonable to test every function, lp__
and other
measured quantities should at least be consistent.
The obvious difference in lp__
is that it tends to vary
quickly with position and is consequently susceptible to outliers.
16.3.3.3 Finite numbers of states
The question is what happens for finite numbers of states? If we can prove a strong geometric ergodicity property (which depends on the sampler and the target distribution), then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time. But even if you can show it exists and is finite (which is nigh impossible) you can’t compute an actual value analytically.
So what we do in practice is hope that the finite number of draws is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of samples will be enough.
16.3.3.4 Why inconsistent R-hat?
Firstly, as noted above, for any finite number of draws, there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better \(\hat{R}\). Functions vulnerable to such outliers may show fragility.
Secondly, use of the \(\hat{R}\) statistic makes very strong
assumptions. In particular, it assumes that the functions being
considered are Gaussian or it only uses the first two moments and
assumes some kind of independence. The point is that strong
assumptions are made that do not always hold. In particular, the
distribution for the log posterior density (lp__
) almost
never looks Gaussian, instead it features long tails that can lead to
large \(\hat{R}\) even in the large \(N\) limit. Tweaks to \(\hat{R}\),
such as using quantiles in place of raw values, have the flavor of
making the samples of interest more Gaussian and hence the \(\hat{R}\)
statistic more accurate.
16.3.3.5 Final words on convergence monitoring
“Convergence” is a global property and holds for all integrable functions at once, but employing the \(\hat{R}\) statistic requires additional assumptions and thus may not work for all functions equally well.
Note that if you just compare the expectations between chains then we can rely on the Markov chain asymptotics for Gaussian distributions and can apply the standard tests.
References
Using vectors simplifies high level exposition at the expense of collapsing structure.↩︎
The structure is assumed to be rectangular; in the future, this needs to be generalized to ragged samples.↩︎
The
lp__
value also represents the potential energy in the Hamiltonian system and is rate bounded by the randomly supplied kinetic energy each iteration, which follows a Chi-square distribution in the number of parameters.↩︎