Stan Reference Manual

This is an old version, view current version.

14.1 Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) method that uses the derivatives of the density function being sampled to generate efficient transitions spanning the posterior (see, e.g., Betancourt and Girolami (2013), Neal (2011) for more details). It uses an approximate Hamiltonian dynamics simulation based on numerical integration which is then corrected by performing a Metropolis acceptance step.

This section translates the presentation of HMC by Betancourt and Girolami (2013) into the notation of Gelman et al. (2013).

Target Density

The goal of sampling is to draw from a density \(p(\theta)\) for parameters \(\theta\). This is typically a Bayesian posterior \(p(\theta|y)\) given data \(y\), and in particular, a Bayesian posterior coded as a Stan program.

Auxiliary Momentum Variable

HMC introduces auxiliary momentum variables \(\rho\) and draws from a joint density

\[ p(\rho, \theta) = p(\rho | \theta) p(\theta). \]

In most applications of HMC, including Stan, the auxiliary density is a multivariate normal that does not depend on the parameters \(\theta\),

\[ \rho \sim \mathsf{MultiNormal}(0, \Sigma). \]

The covariance matrix \(\Sigma\) acts as a Euclidean metric to rotate and scale the target distribution; see Betancourt and Stein (2011) for details of the geometry.

In Stan, this matrix may be set to the identity matrix (i.e., unit diagonal) or estimated from warmup draws and optionally restricted to a diagonal matrix. The inverse \(\Sigma^{-1}\) is known as the mass matrix, and will be a unit, diagonal, or dense if \(\Sigma\) is.

The Hamiltonian

The joint density \(p(\rho, \theta)\) defines a Hamiltonian

\[ \begin{array}{rcl} H(\rho, \theta) & = & - \log p(\rho, \theta) \\[3pt] & = & - \log p(\rho | \theta) - \log p(\theta). \\[3pt] & = & T(\rho | \theta) + V(\theta), \end{array} \]

where the term

\[ T(\rho | \theta) = - \log p(\rho | \theta) \]

is called the “kinetic energy” and the term

\[ V(\theta) = - \log p(\theta) \]

is called the “potential energy.” The potential energy is specified by the Stan program through its definition of a log density.

Generating Transitions

Starting from the current value of the parameters \(\theta\), a transition to a new state is generated in two stages before being subjected to a Metropolis accept step.

First, a value for the momentum is drawn independently of the current parameter values,

\[ \rho \sim \mathsf{MultiNormal}(0, \Sigma). \]

Thus momentum does not persist across iterations.

Next, the joint system \((\theta,\rho)\) made up of the current parameter values \(\theta\) and new momentum \(\rho\) is evolved via Hamilton’s equations,

\[ \begin{array}{rcccl} \displaystyle \frac{d\theta}{dt} & = & \displaystyle + \frac{\partial H}{\partial \rho} & = & \displaystyle + \frac{\partial T}{\partial \rho} \\[12pt] \displaystyle \frac{d\rho}{dt} & = & \displaystyle - \frac{\partial H}{\partial \theta } & = & \displaystyle - \frac{\partial T}{\partial \theta} - \frac{\partial V}{\partial \theta}. \end{array} \]

With the momentum density being independent of the target density, i.e., \(p(\rho | \theta) = p(\rho)\), the first term in the momentum time derivative, \({\partial T} / {\partial \theta}\) is zero, yielding the pair time derivatives

\[ \begin{array}{rcl} \frac{d \theta}{d t} & = & +\frac{\partial T}{\partial \rho} \\[2pt] \frac{d \rho}{d t} & = & -\frac{\partial V}{\partial \theta}. \end{array} \]

Leapfrog Integrator

The last section leaves a two-state differential equation to solve. Stan, like most other HMC implementations, uses the leapfrog integrator, which is a numerical integration algorithm that’s specifically adapted to provide stable results for Hamiltonian systems of equations.

Like most numerical integrators, the leapfrog algorithm takes discrete steps of some small time interval \(\epsilon\). The leapfrog algorithm begins by drawing a fresh momentum term independently of the parameter values \(\theta\) or previous momentum value.

\[ \rho \sim \mathsf{MultiNormal}(0, \Sigma). \] It then alternates half-step updates of the momentum and full-step updates of the position.

\[ \begin{array}{rcl} \rho & \leftarrow & \rho \, - \, \frac{\epsilon}{2} \frac{\partial V}{\partial \theta} \\[6pt] \theta & \leftarrow & \theta \, + \, \epsilon \, \Sigma \, \rho \\[6pt] \rho & \leftarrow & \rho \, - \, \frac{\epsilon}{2} \frac{\partial V}{\partial \theta}. \end{array} \]

By applying \(L\) leapfrog steps, a total of \(L \, \epsilon\) time is simulated. The resulting state at the end of the simulation (\(L\) repetitions of the above three steps) will be denoted \((\rho^{*}, \theta^{*})\).

The leapfrog integrator’s error is on the order of \(\epsilon^3\) per step and \(\epsilon^2\) globally, where \(\epsilon\) is the time interval (also known as the step size); Leimkuhler and Reich (2004) provide a detailed analysis of numerical integration for Hamiltonian systems, including a derivation of the error bound for the leapfrog integrator.

Metropolis Accept Step

If the leapfrog integrator were perfect numerically, there would no need to do any more randomization per transition than generating a random momentum vector. Instead, what is done in practice to account for numerical errors during integration is to apply a Metropolis acceptance step, where the probability of keeping the proposal \((\rho^{*}, \theta^{*})\) generated by transitioning from \((\rho, \theta)\) is

\[ \min \! \left( 1, \ \exp \! \left( H(\rho, \theta) - H(\rho^{*}, \theta^{*}) \right) \right). \]

If the proposal is not accepted, the previous parameter value is returned for the next draw and used to initialize the next iteration.

Algorithm Summary

The Hamiltonian Monte Carlo algorithm starts at a specified initial set of parameters \(\theta\); in Stan, this value is either user-specified or generated randomly. Then, for a given number of iterations, a new momentum vector is sampled and the current value of the parameter \(\theta\) is updated using the leapfrog integrator with discretization time \(\epsilon\) and number of steps \(L\) according to the Hamiltonian dynamics. Then a Metropolis acceptance step is applied, and a decision is made whether to update to the new state \((\theta^{*}, \rho^{*})\) or keep the existing state.

References

Betancourt, Michael, and Mark Girolami. 2013. “Hamiltonian Monte Carlo for Hierarchical Models.” arXiv 1312.0906. http://arxiv.org/abs/1312.0906.

Neal, Radford. 2011. “MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, 116–62. Chapman; Hall/CRC.

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third. London: Chapman &Hall/CRC Press.

Betancourt, Michael, and Leo C. Stein. 2011. “The Geometry of Hamiltonian Monte Carlo.” arXiv 1112.4118. http://arxiv.org/abs/1112.4118.

Leimkuhler, Benedict, and Sebastian Reich. 2004. Simulating Hamiltonian Dynamics. Cambridge: Cambridge University Press.