10 Maximum Likelihood Estimation

The optimize method finds the mode of the posterior distribution, assuming that there is one. If the posterior is not convex, there is no guarantee Stan will be able to find the global mode as opposed to a local optimum of log probability. For optimization, the mode is calculated without the Jacobian adjustment for constrained variables, which shifts the mode due to the change of variables. Thus modes correspond to modes of the model as written.

The full set of configuration options available for the optimize method is reported at the beginning of the sampler output file as CSV comments. When the example model bernoulli.stan is run with method=optimize via the command line with all default arguments, the resulting Stan CSV file header comments show the complete set of default configuration options:

# model = bernoulli_model
# method = optimize
#   optimize
#     algorithm = lbfgs (Default)
#       lbfgs
#         init_alpha = 0.001 (Default)
#         tol_obj = 9.9999999999999998e-13 (Default)
#         tol_rel_obj = 10000 (Default)
#         tol_grad = 1e-08 (Default)
#         tol_rel_grad = 10000000 (Default)
#         tol_param = 1e-08 (Default)
#         history_size = 5 (Default)
#     iter = 2000 (Default)
#     save_iterations = 0 (Default)

10.1 Optimization algorithms

The algorithm argument specifies the optimization algorithm. This argument takes one of the following three values:

  • lbfgs A quasi-Newton optimizer. This is the default optimizer and also much faster than the other optimizers.

  • bfgs A quasi-Newton optimizer.

  • newton A Newton optimizer. This is the least efficient optimization algorithm, but has the advantage of setting its own stepsize.

See the Stan Reference Manual’s Optimization chapter for a description of these algorithms.

All of the optimizers stream per-iteration intermediate approximations to the command line console. The sub-argument save_iterations specifies whether or not to save the intermediate iterations to the output file. Allowed values are \(0\) or \(1\), corresponding to False and True respectively. The default value is \(0\), i.e., intermediate iterations are not saved to the output file.

10.2 The quasi-Newton optimizers

For both BFGS and L-BFGS optimizers, convergence monitoring is controlled by a number of tolerance values, any one of which being satisfied causes the algorithm to terminate with a solution. See the BFGS and L-BFGS configuration chapter for details on the convergence tests.

Both BFGS and L-BFGS have the following configuration arguments:

  • init_alpha - The initial step size parameter. Must be a positive real number. Default value is \(0.001\)

  • tol_obj - Convergence tolerance on changes in objective function value. Must be a positive real number. Default value is \(1^{-12}\).

  • tol_rel_obj - Convergence tolerance on relative changes in objective function value. Must be a positive real number. Default value is \(1^{4}\).

  • tol_grad - Convergence tolerance on the norm of the gradient. Must be a positive real number. Default value is \(1^{-8}\).

  • tol_rel_grad - Convergence tolerance on the relative norm of the gradient. Must be a positive real number. Default value is \(1^{7}\).

  • tol_param - Convergence tolerance on changes in parameter value. Must be a positive real number. Default value is \(1^{-8}\).

The init_alpha argument specifies the first step size to try on the initial iteration. If the first iteration takes a long time (and requires a lot of function evaluations), set init_alpha to be the roughly equal to the alpha used in that first iteration. The default value is very small, which is reasonable for many problems but might be too large or too small depending on the objective function and initialization. Being too big or too small just means that the first iteration will take longer (i.e., require more gradient evaluations) before the line search finds a good step length.

In addition to the above, the L-BFGS algorithm has argument history_size which controls the size of the history it uses to approximate the Hessian. The value should be less than the dimensionality of the parameter space and, in general, relatively small values (\(5\)-\(10\)) are sufficient; the default value is \(5\).

If L-BFGS performs poorly but BFGS performs well, consider increasing the history size. Increasing history size will increase the memory usage, although this is unlikely to be an issue for typical Stan models.

10.3 The Newton optimizer

There are no configuration parameters for the Newton optimizer. It is not recommended because of the slow Hessian calculation involving finite differences.