17.2 BFGS and L-BFGS configuration

This is an old version, view current version.

Convergence monitoring

Convergence monitoring in (L-)BFGS is controlled by a number of tolerance values, any one of which being satisfied causes the algorithm to terminate with a solution. Any of the convergence tests can be disabled by setting its corresponding tolerance parameter to zero. The tests for convergence are as follows.

Parameter convergence

The parameters $\theta_i$ in iteration $i$ are considered to have converged with respect to tolerance tol_param if

$|| \theta_{i} - \theta_{i-1} || < \mathtt{tol\_param}.$

Density convergence

The (unnormalized) log density $\log p(\theta_{i}|y)$ for the parameters $\theta_i$ in iteration $i$ given data $y$ is considered to have converged with respect to tolerance tol_obj if

$\left| \log p(\theta_{i}|y) - \log p(\theta_{i-1}|y) \right| < \mathtt{tol\_obj}.$

The log density is considered to have converged to within relative tolerance tol_rel_obj if

$\frac{\left| \log p(\theta_{i}|y) - \log p(\theta_{i-1}|y) \right|}{\ \max\left(\left| \log p(\theta_{i}|y)\right|,\left| \log p(\theta_{i-1}|y)\right|,1.0\right)} < \mathtt{tol\_rel\_obj} * \epsilon.$

Gradient convergence

The gradient is considered to have converged to 0 relative to a specified tolerance tol_grad if

$|| g_{i} || < \mathtt{tol\_grad},$ where $\nabla_{\theta}$ is the gradient operator with respect to $\theta$ and $g_{i} = \nabla_{\theta} \log p(\theta | y)$ is the gradient at iteration $i$ evaluated at $\theta^{(i)}$ , the value on the $i$ -th posterior iteration.

The gradient is considered to have converged to 0 relative to a specified relative tolerance tol_rel_grad if

$\frac{g_{i}^T \hat{H}_{i}^{-1} g_{i} }{ \max\left(\left|\log p(\theta_{i}|y)\right|,1.0\right) } \ < \ \mathtt{tol\_rel\_grad} * \epsilon,$

where $\hat{H}_{i}$ is the estimate of the Hessian at iteration $i$ , $|u|$ is the absolute value (L1 norm) of $u$ , $||u||$ is the vector length (L2 norm) of $u$ , and $\epsilon \approx 2e-16$ is machine precision.

Initial step size

The initial step size parameter $\alpha$ for BFGS-style optimizers may be specified. If the first iteration takes a long time (and requires a lot of function evaluations) initialize $\alpha$ to be the roughly equal to the $\alpha$ used in that first iteration. The default value is intentionally small, 0.001, which is reasonable for many problems but might be too large or too small depending on the objective function and initialization. Being too big or too small just means that the first iteration will take longer (i.e., require more gradient evaluations) before the line search finds a good step length. It’s not a critical parameter, but for optimizing the same model multiple times (as you tweak things or with different data), being able to tune $\alpha$ can save some real time.

L-BFGS history size

L-BFGS has a command-line argument which controls the size of the history it uses to approximate the Hessian. The value should be less than the dimensionality of the parameter space and, in general, relatively small values (5–10) are sufficient; the default value is 5.

If L-BFGS performs poorly but BFGS performs well, consider increasing the history size. Increasing history size will increase the memory usage, although this is unlikely to be an issue for typical Stan models.