15.1 Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) method that uses the derivatives of the density function being sampled to generate efficient transitions spanning the posterior (see, e.g., Betancourt and Girolami (2013), Neal (2011) for more details). It uses an approximate Hamiltonian dynamics simulation based on numerical integration which is then corrected by performing a Metropolis acceptance step.
This section translates the presentation of HMC by Betancourt and Girolami (2013) into the notation of Gelman et al. (2013).
Target density
The goal of sampling is to draw from a density \(p(\theta)\) for parameters \(\theta\). This is typically a Bayesian posterior \(p(\theta|y)\) given data \(y\), and in particular, a Bayesian posterior coded as a Stan program.
Auxiliary momentum variable
HMC introduces auxiliary momentum variables \(\rho\) and draws from a joint density
\[ p(\rho, \theta) = p(\rho | \theta) p(\theta). \]
In most applications of HMC, including Stan, the auxiliary density is a multivariate normal that does not depend on the parameters \(\theta\),
\[ \rho \sim \mathsf{MultiNormal}(0, M). \]
\(M\) is the Euclidean metric. It can be seen as a transform of parameter space that makes sampling more efficient; see Betancourt (2017) for details.
By default Stan sets \(M^{-1}\) equal to a diagonal estimate of the covariance computed during warmup.
The Hamiltonian
The joint density \(p(\rho, \theta)\) defines a Hamiltonian
\[ \begin{array}{rcl} H(\rho, \theta) & = & - \log p(\rho, \theta) \\[3pt] & = & - \log p(\rho | \theta) - \log p(\theta). \\[3pt] & = & T(\rho | \theta) + V(\theta), \end{array} \]
where the term
\[ T(\rho | \theta) = - \log p(\rho | \theta) \]
is called the “kinetic energy” and the term
\[ V(\theta) = - \log p(\theta) \]
is called the “potential energy.” The potential energy is specified by the Stan program through its definition of a log density.
Generating transitions
Starting from the current value of the parameters \(\theta\), a transition to a new state is generated in two stages before being subjected to a Metropolis accept step.
First, a value for the momentum is drawn independently of the current parameter values,
\[ \rho \sim \mathsf{MultiNormal}(0, M). \]
Thus momentum does not persist across iterations.
Next, the joint system \((\theta,\rho)\) made up of the current parameter values \(\theta\) and new momentum \(\rho\) is evolved via Hamilton’s equations,
\[ \begin{array}{rcccl} \displaystyle \frac{d\theta}{dt} & = & \displaystyle + \frac{\partial H}{\partial \rho} & = & \displaystyle + \frac{\partial T}{\partial \rho} \\[12pt] \displaystyle \frac{d\rho}{dt} & = & \displaystyle - \frac{\partial H}{\partial \theta } & = & \displaystyle - \frac{\partial T}{\partial \theta} - \frac{\partial V}{\partial \theta}. \end{array} \]
With the momentum density being independent of the target density, i.e., \(p(\rho | \theta) = p(\rho)\), the first term in the momentum time derivative, \({\partial T} / {\partial \theta}\) is zero, yielding the pair time derivatives
\[ \begin{array}{rcl} \frac{d \theta}{d t} & = & +\frac{\partial T}{\partial \rho} \\[2pt] \frac{d \rho}{d t} & = & -\frac{\partial V}{\partial \theta}. \end{array} \]
Leapfrog integrator
The last section leaves a two-state differential equation to solve. Stan, like most other HMC implementations, uses the leapfrog integrator, which is a numerical integration algorithm that’s specifically adapted to provide stable results for Hamiltonian systems of equations.
Like most numerical integrators, the leapfrog algorithm takes discrete steps of some small time interval \(\epsilon\). The leapfrog algorithm begins by drawing a fresh momentum term independently of the parameter values \(\theta\) or previous momentum value.
\[ \rho \sim \mathsf{MultiNormal}(0, M). \] It then alternates half-step updates of the momentum and full-step updates of the position.
\[ \begin{array}{rcl} \rho & \leftarrow & \rho \, - \, \frac{\epsilon}{2} \frac{\partial V}{\partial \theta} \\[6pt] \theta & \leftarrow & \theta \, + \, \epsilon \, M^{-1} \, \rho \\[6pt] \rho & \leftarrow & \rho \, - \, \frac{\epsilon}{2} \frac{\partial V}{\partial \theta}. \end{array} \]
By applying \(L\) leapfrog steps, a total of \(L \, \epsilon\) time is simulated. The resulting state at the end of the simulation (\(L\) repetitions of the above three steps) will be denoted \((\rho^{*}, \theta^{*})\).
The leapfrog integrator’s error is on the order of \(\epsilon^3\) per step and \(\epsilon^2\) globally, where \(\epsilon\) is the time interval (also known as the step size); Leimkuhler and Reich (2004) provide a detailed analysis of numerical integration for Hamiltonian systems, including a derivation of the error bound for the leapfrog integrator.
Metropolis accept step
If the leapfrog integrator were perfect numerically, there would no need to do any more randomization per transition than generating a random momentum vector. Instead, what is done in practice to account for numerical errors during integration is to apply a Metropolis acceptance step, where the probability of keeping the proposal \((\rho^{*}, \theta^{*})\) generated by transitioning from \((\rho, \theta)\) is
\[ \min \! \left( 1, \ \exp \! \left( H(\rho, \theta) - H(\rho^{*}, \theta^{*}) \right) \right). \]
If the proposal is not accepted, the previous parameter value is returned for the next draw and used to initialize the next iteration.
Algorithm summary
The Hamiltonian Monte Carlo algorithm starts at a specified initial set of parameters \(\theta\); in Stan, this value is either user-specified or generated randomly. Then, for a given number of iterations, a new momentum vector is sampled and the current value of the parameter \(\theta\) is updated using the leapfrog integrator with discretization time \(\epsilon\) and number of steps \(L\) according to the Hamiltonian dynamics. Then a Metropolis acceptance step is applied, and a decision is made whether to update to the new state \((\theta^{*}, \rho^{*})\) or keep the existing state.