11 Maximum Likelihood Estimation
The optimize
method finds the mode of the posterior distribution, assuming that there is one.
If the posterior is not convex, there is no guarantee Stan will be able to find the global mode
as opposed to a local optimum of log probability.
For optimization, the mode is calculated without the Jacobian adjustment
for constrained variables, which shifts the mode due to the change of variables.
Thus modes correspond to modes of the model as written.
The full set of configuration options available for the optimize
method is
reported at the beginning of the sampler output file as CSV comments.
When the example model bernoulli.stan
is run with method=optimize
via the command line with all default arguments,
the resulting Stan CSV file header comments show the complete set
of default configuration options:
# model = bernoulli_model
# method = optimize
# optimize
# algorithm = lbfgs (Default)
# lbfgs
# init_alpha = 0.001 (Default)
# tol_obj = 9.9999999999999998e-13 (Default)
# tol_rel_obj = 10000 (Default)
# tol_grad = 1e-08 (Default)
# tol_rel_grad = 10000000 (Default)
# tol_param = 1e-08 (Default)
# history_size = 5 (Default)
# jacobian = 0 (Default)
# iter = 2000 (Default)
# save_iterations = 0 (Default)
11.1 Jacobian adjustments
The jacobian
argument specifies whether or not the call to the model’s
log probability function should include
the log absolute Jacobian determinant of inverse parameter transforms.
Without the Jacobian adjustment, optimization
returns the (regularized) maximum likelihood estimate (MLE),
\(\mathrm{argmax}_{\theta}\ p(y | \theta)\),
the value which maximizes the likelihood of the data given the parameters,
(including prior terms).
Applying the Jacobian adjustment produces the maximum a posteriori estimate (MAP),
the maximum value of the posterior distribution,
\(\mathrm{argmax}_{\theta}\ p(y | \theta)\,p(\theta)\).
By default this value is 0
(false),
do not include the Jacobian adjustment.
11.2 Optimization algorithms
The algorithm
argument specifies the optimization algorithm.
This argument takes one of the following three values:
lbfgs
A quasi-Newton optimizer. This is the default optimizer and also much faster than the other optimizers.bfgs
A quasi-Newton optimizer.newton
A Newton optimizer. This is the least efficient optimization algorithm, but has the advantage of setting its own stepsize.
See the Stan Reference Manual’s Optimization chapter for a description of these algorithms.
All of the optimizers stream per-iteration intermediate approximations to the command line console.
The sub-argument save_iterations
specifies whether or not to save
the intermediate iterations to the output file.
Allowed values are \(0\) or \(1\), corresponding to False
and True
respectively.
The default value is \(0\), i.e., intermediate iterations are not saved to the output file.
11.3 The quasi-Newton optimizers
For both BFGS and L-BFGS optimizers, convergence monitoring is controlled by a number of tolerance values, any one of which being satisfied causes the algorithm to terminate with a solution. See the BFGS and L-BFGS configuration chapter for details on the convergence tests.
Both BFGS and L-BFGS have the following configuration arguments:
init_alpha
- The initial step size parameter. Must be a positive real number. Default value is \(0.001\)tol_obj
- Convergence tolerance on changes in objective function value. Must be a positive real number. Default value is \(1^{-12}\).tol_rel_obj
- Convergence tolerance on relative changes in objective function value. Must be a positive real number. Default value is \(1^{4}\).tol_grad
- Convergence tolerance on the norm of the gradient. Must be a positive real number. Default value is \(1^{-8}\).tol_rel_grad
- Convergence tolerance on the relative norm of the gradient. Must be a positive real number. Default value is \(1^{7}\).tol_param
- Convergence tolerance on changes in parameter value. Must be a positive real number. Default value is \(1^{-8}\).
The init_alpha
argument specifies the first step size to try on the initial iteration.
If the first iteration takes a long time (and requires a lot of function evaluations),
set init_alpha
to be the roughly equal to the alpha used in that first iteration.
The default value is very small, which is reasonable for many problems but might be too large
or too small depending on the objective function and initialization.
Being too big or too small just means that the first iteration will take longer
(i.e., require more gradient evaluations) before the line search finds a good step length.
In addition to the above, the L-BFGS algorithm has argument history_size
which controls the size of the history it uses to approximate the Hessian.
The value should be less than the dimensionality of the parameter space and,
in general, relatively small values (\(5\)-\(10\)) are sufficient; the default value is \(5\).
If L-BFGS performs poorly but BFGS performs well, consider increasing the history size. Increasing history size will increase the memory usage, although this is unlikely to be an issue for typical Stan models.