Optimization
Stan provides optimization algorithms which find modes of the density specified by a Stan program. Such modes may be used as parameter estimates or as the basis of approximations to a Bayesian posterior.
Stan provides three different optimizers, a Newton optimizer, and two related quasi-Newton algorithms, BFGS and L-BFGS; see Nocedal and Wright (2006) for thorough description and analysis of all of these algorithms. The L-BFGS algorithm is the default optimizer. Newton’s method is the least efficient of the three, but has the advantage of setting its own stepsize.
General configuration
All of the optimizers have the option of including the the log absolute Jacobian determinant of inverse parameter transforms in the log probability computation. Without the Jacobian adjustment, optimization returns the maximum likelihood estimate (MLE),
All of the optimizers are iterative and allow the maximum number of iterations to be specified; the default maximum number of iterations is 2000.
All of the optimizers are able to stream intermediate output reporting on their progress. Whether or not to save the intermediate iterations and stream progress is configurable.
BFGS and L-BFGS configuration
Convergence monitoring
Convergence monitoring in (L-)BFGS is controlled by a number of tolerance values, any one of which being satisfied causes the algorithm to terminate with a solution. Any of the convergence tests can be disabled by setting its corresponding tolerance parameter to zero. The tests for convergence are as follows.
Parameter convergence
The parameters tol_param
if
Density convergence
The (unnormalized) log density tol_obj
if
The log density is considered to have converged to within relative tolerance tol_rel_obj
if
Gradient convergence
The gradient is considered to have converged to 0 relative to a specified tolerance tol_grad
if
The gradient is considered to have converged to 0 relative to a specified relative tolerance tol_rel_grad
if
where
Initial step size
The initial step size parameter
L-BFGS history size
L-BFGS has a command-line argument which controls the size of the history it uses to approximate the Hessian. The value should be less than the dimensionality of the parameter space and, in general, relatively small values (5–10) are sufficient; the default value is 5.
If L-BFGS performs poorly but BFGS performs well, consider increasing the history size. Increasing history size will increase the memory usage, although this is unlikely to be an issue for typical Stan models.
Writing models for optimization
Constrained vs. unconstrained parameters
For constrained optimization problems, for instance, with a standard deviation parameter sigma
with no constraints. This allows the optimizer to easily get close to 0 without having to tend toward
With unconstrained parameterizations of parameters with constrained support, it is important to provide a custom initialization that is within the support. For example, declaring a vector
vector[M] sigma;
and using the default random initialization which is
For any given optimization problem, it is probably worthwhile trying the program both ways, with and without the constraint, to see which one is more efficient.