Initializing the NUTS-HMC sampler

In this example, we show how to use parameter estimates returned by any of Stan’s inference algorithms as initial values for Stan’s NUTS-HMC sampler. These include:

By default, the NUTS-HMC sampler randomly initializes all (unconstrained) model parameters uniformly in the interval (-2, 2). If this interval is far from the typical set of the posterior, initializing sampling from these approximation algorithms can speed up and improve adaptation.

Model and data

The Stan model and data are taken from the posteriordb package.

We use the blr model, a Bayesian standard linear regression model with noninformative priors, and its corresponding simulated dataset sblri.json, which was simulated via script sblr.R. For convenience, this example assumes the posteriordb model and data are local, in files blr.stan and sblri.json.

[1]:
import os
from cmdstanpy import CmdStanModel

stan_file = 'blr.stan' # basic linear regression
data_file = 'sblri.json' # simulated data

model = CmdStanModel(stan_file=stan_file)

print(model.code())
19:24:03 - cmdstanpy - INFO - compiling stan file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/blr.stan to exe file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/blr
19:24:11 - cmdstanpy - INFO - compiled model executable: /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/blr
data {
  int<lower=0> N;
  int<lower=0> D;
  matrix[N, D] X;
  vector[N] y;
}
parameters {
  vector[D] beta;
  real<lower=0> sigma;
}
model {
  // prior
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(sigma | 0, 10);
  // likelihood
  target += normal_lpdf(y | X * beta, sigma);
}


Demonstration with Stan’s pathfinder method

Initializing the sampler with estimates from any previous inference algorithm follows the same general usage pattern. First, we call the corresponding method on the CmdStanModel object. From the resulting fit, we call the .create_inits() method to construct a set of per-chain initializations for the model parameters. To make it explicit, we will walk through the process using the pathfinder method (which wraps the CmdStan pathfinder method).

Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the negative inverse Hessian estimates produced by the LBFGS optimizer. Pathfinder returns draws from the Gaussian approximation with the lowest estimated Kullback-Leibler (KL) divergence to the true posterior. By default, CmdStanPy runs multi-path Pathfinder which returns an importance-resampled set of draws over the outputs of 4 independent single-path Pathfinders. This better matches non-normal target densities and also mitigates the problem of L-BFGS getting stuck at local optima or in saddle points on plateaus.

We obtain Pathfinder estimates by calling the .pathfinder() method which returns a CmdStanPathfinder object:

[2]:
pathfinder_fit = model.pathfinder(data=data_file, seed=123)
19:24:11 - cmdstanpy - INFO - Chain [1] start processing
19:24:11 - cmdstanpy - INFO - Chain [1] done processing

Posteriordb provides reference posteriors for all models. For the blr model, conditioned on the dataset sblri.json, the reference posteriors can be found in the sblri-blr.json file.

The reference posteriors for all elements of beta and sigma are all very close to 1.0.

The experiments reported in Figure 3 of the paper Pathfinder: Parallel quasi-Newton variational inference by Zhang et al. show that Pathfinder provides a better estimate of the posterior, as measured by the 1-Wasserstein distance to the reference posterior, than 75 iterations of the warmup Phase I algorithm used by the NUTS-HMC sampler. Furthermore, Pathfinder is more computationally efficient, requiring fewer evaluations of the log density and gradient functions. Therefore, using the Pathfinder estimates to initialize the parameter values for the NUTS-HMC sampler can allow the sampler to do a better job of adapting the stepsize and metric during warmup, resulting in better performance and estimation.

We construct the parameter inits for full MCMC sampling below. The .create_inits() default behavior is to create inits for four chains to correspond with the sampling defaults. You can requests more or less by modifying the chains keyword argument.

[3]:
pathfinder_inits = pathfinder_fit.create_inits()
for chain_init in pathfinder_inits:
    print(chain_init)
{'beta': array([1.0013605 , 0.99982007, 1.0013418 , 1.0015203 , 1.0024493 ]), 'sigma': array(0.91712333)}
{'beta': array([0.99892799, 1.000094  , 1.0015255 , 0.99978958, 1.0009933 ]), 'sigma': array(0.90733134)}
{'beta': array([0.99944904, 1.0001847 , 0.99985551, 1.002342  , 1.0015932 ]), 'sigma': array(0.85289529)}
{'beta': array([1.000951 , 1.0008332, 1.0017069, 1.0014231, 1.0048865]), 'sigma': array(1.0797625)}

We see that the Pathfinder inits are close the reference posteriors for the parameters. To use these inits, we pass the pathfinder_inits object to the inits kwarg:

[4]:
mcmc_pathfinder_inits_fit = model.sample(
    data=data_file, inits=pathfinder_inits, iter_warmup=75, seed=12345
)
19:24:11 - cmdstanpy - INFO - CmdStan start processing

19:24:12 - cmdstanpy - INFO - CmdStan done processing.

19:24:12 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Consider re-running with show_console=True if the above output is unclear!
[5]:
print(mcmc_pathfinder_inits_fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

Despite only running 75 warmup iterations, all posterior diagnostics from the sampler look good.

[6]:
mcmc_pathfinder_inits_fit.summary()
[6]:
Mean MCSE StdDev MAD 5% 50% 95% ESS_bulk ESS_tail ESS_bulk/s R_hat
lp__ -156.785000 0.051593 1.692230 1.501810 -159.916000 -156.485000 -154.61700 1087.49 2193.760 2036.500 1.004260
beta[1] 0.999460 0.000014 0.000958 0.000964 0.997890 0.999449 1.00104 4717.81 3266.460 8834.850 1.000180
beta[2] 1.000210 0.000017 0.001119 0.001110 0.998362 1.000190 1.00207 4433.28 2668.210 8302.020 1.000110
beta[3] 1.000470 0.000014 0.000949 0.000950 0.998891 1.000460 1.00200 4662.19 3373.000 8730.690 1.000380
beta[4] 1.001120 0.000016 0.001029 0.001048 0.999476 1.001120 1.00284 4349.06 3024.630 8144.300 0.999956
beta[5] 1.001550 0.000015 0.001049 0.001055 0.999851 1.001560 1.00324 4670.27 3083.370 8745.830 1.001010
sigma 0.956363 0.004191 0.065810 0.068009 0.848430 0.954728 1.06149 248.35 271.356 465.075 1.017340

If we were to instead use the default random parameter initializations, we would need to run more warmup iterations to produce useful samples. For example, if we only run the same 75 warmup iterations with random inits, the result fails to estimate sigma correctly:

[7]:
mcmc_random_inits_fit = model.sample(data=data_file, iter_warmup=75, seed=12345)
19:24:12 - cmdstanpy - INFO - CmdStan start processing

19:24:12 - cmdstanpy - INFO - CmdStan done processing.
19:24:12 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 10 divergent transitions (1.0%)
        Chain 3 had 180 divergent transitions (18.0%)
        Chain 4 had 94 divergent transitions (9.4%)
        Use the "diagnose()" method on the CmdStanMCMC object to see further information.

[8]:
mcmc_random_inits_fit.summary()
[8]:
Mean MCSE StdDev MAD 5% 50% 95% ESS_bulk ESS_tail ESS_bulk/s R_hat
lp__ -187.24500 24.598100 35.016100 32.136000 -244.788000 -178.206000 -154.91900 5.04533 2000.0 20.5095 2.36310
beta[1] 0.99954 0.000124 0.002125 0.001442 0.996106 0.999495 1.00339 321.44500 2000.0 1306.6900 1.16149
beta[2] 1.00021 0.000131 0.002556 0.001839 0.996256 1.000160 1.00461 381.00600 2000.0 1548.8100 1.25825
beta[3] 1.00040 0.000117 0.002194 0.001474 0.996540 1.000390 1.00372 306.93700 2000.0 1247.7100 1.22836
beta[4] 1.00133 0.000125 0.002322 0.001385 0.997753 1.001310 1.00577 407.67000 2000.0 1657.1900 1.13671
beta[5] 1.00144 0.000117 0.002437 0.001650 0.997061 1.001470 1.00544 480.40300 2000.0 1952.8600 1.19502
sigma 1.87069 0.733300 1.046490 0.913512 0.892552 1.587390 3.58658 5.00048 2000.0 20.3271 2.49136
[9]:
print(mcmc_random_inits_fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
284 of 4000 (7.10%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.01, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  beta[1], beta[2], beta[3], beta[4], beta[5], sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

The diagnostics clearly indicate problems with estimating sigma. In this case, it is necessary to run the model with at least 150 warmup iterations to produce a good set of estimates when starting from the default initialization.

Other inference algorithms

We can follow the same pattern with Stan’s ADVI algorithm by first using the CmdStanModel.variationl method. Because this algorithm is unstable and may fail to converge, we run it with argument require_converged set to False. We also specify a seed, to avoid instabilities as well as for reproducibility.

[10]:
vb_fit = model.variational(data=data_file, require_converged=False, seed=123)
19:24:12 - cmdstanpy - INFO - Chain [1] start processing
19:24:12 - cmdstanpy - INFO - Chain [1] done processing
19:24:12 - cmdstanpy - WARNING - The algorithm may not have converged.
Proceeding because require_converged is set to False

The ADVI algorithm provides estimates of all model parameters.

The variational method returns a CmdStanVB object, which similarly can construct a set of inits with the .create_inits() method:

[11]:
vb_inits = vb_fit.create_inits()
for chain_init in vb_inits:
    print(chain_init)
{'beta': array([1.0079412 , 1.0006877 , 1.0090487 , 0.99748171, 0.98823884]), 'sigma': array(1.4914566)}
{'beta': array([1.0085629 , 0.99790276, 1.0083249 , 0.99795014, 0.98990057]), 'sigma': array(1.4706955)}
{'beta': array([1.0064491 , 0.99778578, 1.0078978 , 0.99819192, 0.98947641]), 'sigma': array(1.5340995)}
{'beta': array([1.005883  , 0.9996351 , 1.007835  , 0.99823935, 0.99011108]), 'sigma': array(1.4694405)}

Which can be passed to the inits keyword argument of the sample method.

[12]:
mcmc_vb_inits_fit = model.sample(
    data=data_file, inits=vb_inits, iter_warmup=75, seed=12345
)
19:24:12 - cmdstanpy - INFO - CmdStan start processing

19:24:12 - cmdstanpy - INFO - CmdStan done processing.
19:24:12 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Consider re-running with show_console=True if the above output is unclear!

[13]:
print(mcmc_vb_inits_fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

Rank-normalized split R-hat values satisfactory for all parameters.

Processing complete, no problems detected.

[14]:
mcmc_vb_inits_fit.summary()
[14]:
Mean MCSE StdDev MAD 5% 50% 95% ESS_bulk ESS_tail ESS_bulk/s R_hat
lp__ -156.921000 0.098333 1.921290 1.621930 -160.738000 -156.519000 -154.62800 535.233 575.839 1035.270 1.00523
beta[1] 0.999469 0.000013 0.000932 0.000934 0.997969 0.999470 1.00099 4853.540 3253.430 9387.900 1.00130
beta[2] 1.000230 0.000017 0.001165 0.001152 0.998214 1.000230 1.00215 4617.920 2995.360 8932.140 1.00213
beta[3] 1.000410 0.000013 0.000966 0.000962 0.998840 1.000430 1.00197 5346.620 3700.420 10341.600 1.00187
beta[4] 1.001180 0.000016 0.001063 0.001058 0.999418 1.001190 1.00292 4481.650 2890.660 8668.570 1.00024
beta[5] 1.001570 0.000016 0.001051 0.001040 0.999879 1.001560 1.00330 4361.920 2872.490 8436.990 1.00034
sigma 0.967508 0.005134 0.076829 0.070556 0.853894 0.962404 1.10761 252.375 172.951 488.153 1.01498

The sampler estimates match the reference posterior with no diagnostic issues.

Inits can also be constructed from the laplace method:

[15]:
laplace_inits = model.laplace_sample(data=data_file, seed=123).create_inits()
19:24:12 - cmdstanpy - INFO - Chain [1] start processing
19:24:12 - cmdstanpy - INFO - Chain [1] done processing
19:24:12 - cmdstanpy - INFO - Chain [1] start processing
19:24:12 - cmdstanpy - INFO - Chain [1] done processing
[16]:
mcmc_laplace_inits_fit = model.sample(
    data=data_file, inits=laplace_inits, iter_warmup=75, seed=12345
)
19:24:12 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.

[17]:
print(mcmc_laplace_inits_fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

Rank-normalized split R-hat values satisfactory for all parameters.

Processing complete, no problems detected.

And the optimize method. Since optimizations attempts to return a posterior mode, this will initialize all chains at the same point, which is not typically ideal.

[18]:
optimized_inits = model.optimize(data=data_file, seed=123).create_inits()
19:24:13 - cmdstanpy - INFO - Chain [1] start processing
19:24:13 - cmdstanpy - INFO - Chain [1] done processing
19:24:13 - cmdstanpy - WARNING - The default behavior of CmdStanMLE.stan_variable() will change in a future release to always return a numpy.ndarray, even for scalar variables.
[19]:
mcmc_optimize_inits_fit = model.sample(
    data=data_file, inits=optimized_inits, iter_warmup=75, seed=12345
)
19:24:13 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.

[20]:
print(mcmc_optimize_inits_fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

It is also possible to use the output of the sample() method itself to construct inits to be fed into a future sampling run:

[21]:
first_mcmc_inits = model.sample(
    data=data_file, iter_warmup=75, seed=12345
).create_inits()
19:24:13 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.
19:24:13 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 10 divergent transitions (1.0%)
        Chain 3 had 180 divergent transitions (18.0%)
        Chain 4 had 94 divergent transitions (9.4%)
        Use the "diagnose()" method on the CmdStanMCMC object to see further information.

[22]:
second_mcmc_fit = model.sample(data=data_file, inits=first_mcmc_inits, iter_warmup=75, seed=12345)
19:24:13 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.
19:24:13 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Consider re-running with show_console=True if the above output is unclear!

[23]:
print(second_mcmc_fit.diagnose())
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

We see that despite the initial sampling issues in the first MCMC run, the inits sourced from that run result in reasonable sampling in the second run.