This is an old version, view current version.

5 Optimization

The CmdStan executable can run Stan’s optimization algorithms for penalized maximum likelihood estimation which provide a deterministic method to find the posterior mode. If the posterior is not convex, there is no guarantee Stan will be able to find the global mode as opposed to a local optimum of log probability.

The executable does not need to be recompiled in order to switch from sampling to optimization, and the data input format is the same. The following is a minimal call to Stan’s optimizer using defaults for everything but the location of the data file.

> ./bernoulli optimize data

Executing this command prints both output to the console and to a csv file.

The first part of the console output reports on the configuration used. The above command uses all default configurations, therefore the optimizer used is the L-BFGS optimizer and its default initial stepsize and tolerances for monitoring convergence:

 ./bernoulli optimize data
method = optimize
    algorithm = lbfgs (Default)
        init_alpha = 0.001 (Default)
        tol_obj = 1e-12 (Default)
        tol_rel_obj = 10000 (Default)
        tol_grad = 1e-08 (Default)
        tol_rel_grad = 10000000 (Default)
        tol_param = 1e-08 (Default)
        history_size = 5 (Default)
    iter = 2000 (Default)
    save_iterations = 0 (Default)
id = 0 (Default)
  file =
init = 2 (Default)
  seed = 87122538 (Default)
  file = output.csv (Default)
  diagnostic_file =  (Default)
  refresh = 100 (Default)

The second part of the output indicates how well the algorithm fared, here converging and terminating normally. The numbers reported indicate that it took 5 iterations and 8 gradient evaluations. This is, not surprisingly, far fewer iterations than required for sampling; even fewer iterations would be used with less stringent user-specified convergence tolerances. The alpha value is for step size used. In the final state the change in parameters was roughly \(0.002\) and the length of the gradient roughly 3e-05 (\(0.00003\)).

Initial log joint probability = -6.85653
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
       5      -5.00402    0.00184936   3.35074e-05           1           1        8   
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

The output from optimization is written into the file output.csv by default. The output follows the same pattern as the output for sampling, first dumping the entire set of parameters used as comment lines:

# stan_version_major = 2
# stan_version_minor = 23
# stan_version_patch = 0
# model = bernoulli_model
# method = optimize
#   optimize
#     algorithm = lbfgs (Default)

Following the config information, are two lines of output: the CSV headers and the recorded values:


Note that everything is a comment other than a line for the header, and a line for the values. Here, the header indicates the unnormalized log probability with lp__ and the model parameter theta. The maximum log probability is -5.0 and the posterior mode for theta is 0.20. The mode exactly matches what we would expect from the data.3

Because the prior was uniform, the result 0.20 represents the maximum likelihood estimate (MLE) for the very simple Bernoulli model. Note that no uncertainty is reported.

All of the optimizers stream per-iteration intermediate approximations to the command line console. The sub-argument save_iterations specifies whether or not to save the intermediate iterations to the output file. Allowed values are \(0\) or \(1\), corresponding to False and True respectively. The default value is \(0\), i.e., intermediate iterations are not saved to the output file. Running the optimizer with save_iterations=1 writes both the initial log joint probability and values for all iterations to the output CSV file.

Running the example model with option save_iterations=1, i.e., the command

> ./bernoulli optimize save_iterations=1 data

produces CSV file output rows:


  1. The Jacobian adjustment included for the sampler’s log probability function is not applied during optimization, because it can change the shape of the posterior and hence the solution.↩︎