Variational Inference
Stan implements an automatic variational inference algorithm, called Automatic Differentiation Variational Inference (ADVI) Kucukelbir et al. (2017). In this chapter, we describe the specifics of how ADVI maximizes the variational objective.
Stochastic gradient ascent
ADVI optimizes the ELBO in the real-coordinate space using stochastic gradient ascent. We obtain noisy (yet unbiased) gradients of the variational objective using automatic differentiation and Monte Carlo integration. The algorithm ascends these gradients using an adaptive stepsize sequence. We evaluate the ELBO also using Monte Carlo integration and measure convergence similar to the relative tolerance scheme in Stan’s optimization feature.
Monte Carlo approximation of the ELBO
ADVI uses Monte Carlo integration to approximate the variational objective function, the ELBO. The number of draws used to approximate the ELBO is denoted by elbo_samples
. We recommend a default value of \(100\), as we only evaluate the ELBO every eval_elbo
iterations, which also defaults to \(100\).
Monte Carlo approximation of the gradients
ADVI uses Monte Carlo integration to approximate the gradients of the ELBO. The number of draws used to approximate the gradients is denoted by grad_samples
. We recommend a default value of \(1\), as this is the most efficient. It also a very noisy estimate of the gradient, but stochastic gradient ascent is capable of following such gradients.
Adaptive stepsize sequence
ADVI uses a finite-memory version of adaGrad Duchi, Hazan, and Singer (2011). This has a single parameter that we expose, denoted eta
. We now have a warmup adaptation phase that selects a good value for eta
. The procedure does a heuristic search over eta
values that span 5 orders of magnitude.
Assessing convergence
ADVI tracks the progression of the ELBO through the stochastic optimization. Specifically, ADVI heuristically determines a rolling window over which it computes the average and the median change of the ELBO. Should either number fall below a threshold, denoted by tol_rel_obj
, we consider the algorithm to have converged. The change in ELBO is calculated the same way as in Stan’s optimization module.