Initializing the NUTS-HMC sampler¶

In this example, we show how to use parameter estimates returned by any of Stan’s inference algorithms as initial values for Stan’s NUTS-HMC sampler. These include:

By default, the NUTS-HMC sampler randomly initializes all (unconstrained) model parameters uniformly in the interval (-2, 2). If this interval is far from the typical set of the posterior, initializing sampling from these approximation algorithms can speed up and improve adaptation.

Model and data¶

The Stan model and data are taken from the posteriordb package.

We use the blr model, a Bayesian standard linear regression model with noninformative priors, and its corresponding simulated dataset sblri.json, which was simulated via script sblr.R. For convenience, this example assumes the posteriordb model and data are local, in files blr.stan and sblri.json.

[1]:

import os
from cmdstanpy import CmdStanModel

stan_file = 'blr.stan' # basic linear regression
data_file = 'sblri.json' # simulated data

model = CmdStanModel(stan_file=stan_file)

print(model.code())

19:24:03 - cmdstanpy - INFO - compiling stan file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/blr.stan to exe file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/blr
19:24:11 - cmdstanpy - INFO - compiled model executable: /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/blr

data {
  int<lower=0> N;
  int<lower=0> D;
  matrix[N, D] X;
  vector[N] y;
}
parameters {
  vector[D] beta;
  real<lower=0> sigma;
}
model {
  // prior
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(sigma | 0, 10);
  // likelihood
  target += normal_lpdf(y | X * beta, sigma);
}

Demonstration with Stan’s `pathfinder` method¶

Initializing the sampler with estimates from any previous inference algorithm follows the same general usage pattern. First, we call the corresponding method on the CmdStanModel object. From the resulting fit, we call the .create_inits() method to construct a set of per-chain initializations for the model parameters. To make it explicit, we will walk through the process using the pathfinder method (which wraps the CmdStan pathfinder method).

Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the negative inverse Hessian estimates produced by the LBFGS optimizer. Pathfinder returns draws from the Gaussian approximation with the lowest estimated Kullback-Leibler (KL) divergence to the true posterior. By default, CmdStanPy runs multi-path Pathfinder which returns an importance-resampled set of draws over the outputs of 4 independent single-path Pathfinders. This better matches non-normal target densities and also mitigates the problem of L-BFGS getting stuck at local optima or in saddle points on plateaus.

We obtain Pathfinder estimates by calling the .pathfinder() method which returns a CmdStanPathfinder object:

[2]:

pathfinder_fit = model.pathfinder(data=data_file, seed=123)

19:24:11 - cmdstanpy - INFO - Chain [1] start processing
19:24:11 - cmdstanpy - INFO - Chain [1] done processing

Posteriordb provides reference posteriors for all models. For the blr model, conditioned on the dataset sblri.json, the reference posteriors can be found in the sblri-blr.json file.

The reference posteriors for all elements of beta and sigma are all very close to 1.0.

The experiments reported in Figure 3 of the paper Pathfinder: Parallel quasi-Newton variational inference by Zhang et al. show that Pathfinder provides a better estimate of the posterior, as measured by the 1-Wasserstein distance to the reference posterior, than 75 iterations of the warmup Phase I algorithm used by the NUTS-HMC sampler. Furthermore, Pathfinder is more computationally efficient, requiring fewer evaluations of the log density and gradient functions. Therefore, using the Pathfinder estimates to initialize the parameter values for the NUTS-HMC sampler can allow the sampler to do a better job of adapting the stepsize and metric during warmup, resulting in better performance and estimation.

We construct the parameter inits for full MCMC sampling below. The .create_inits() default behavior is to create inits for four chains to correspond with the sampling defaults. You can requests more or less by modifying the chains keyword argument.

[3]:

pathfinder_inits = pathfinder_fit.create_inits()
for chain_init in pathfinder_inits:
    print(chain_init)

{'beta': array([1.0013605 , 0.99982007, 1.0013418 , 1.0015203 , 1.0024493 ]), 'sigma': array(0.91712333)}
{'beta': array([0.99892799, 1.000094  , 1.0015255 , 0.99978958, 1.0009933 ]), 'sigma': array(0.90733134)}
{'beta': array([0.99944904, 1.0001847 , 0.99985551, 1.002342  , 1.0015932 ]), 'sigma': array(0.85289529)}
{'beta': array([1.000951 , 1.0008332, 1.0017069, 1.0014231, 1.0048865]), 'sigma': array(1.0797625)}

We see that the Pathfinder inits are close the reference posteriors for the parameters. To use these inits, we pass the pathfinder_inits object to the inits kwarg:

[4]:

mcmc_pathfinder_inits_fit = model.sample(
    data=data_file, inits=pathfinder_inits, iter_warmup=75, seed=12345
)

19:24:11 - cmdstanpy - INFO - CmdStan start processing

19:24:12 - cmdstanpy - INFO - CmdStan done processing.

19:24:12 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Consider re-running with show_console=True if the above output is unclear!

[5]:

print(mcmc_pathfinder_inits_fit.diagnose())

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

Despite only running 75 warmup iterations, all posterior diagnostics from the sampler look good.

[6]:

mcmc_pathfinder_inits_fit.summary()

[6]:

	Mean	MCSE	StdDev	MAD	5%	50%	95%	ESS_bulk	ESS_tail	ESS_bulk/s	R_hat
lp__	-156.785000	0.051593	1.692230	1.501810	-159.916000	-156.485000	-154.61700	1087.49	2193.760	2036.500	1.004260
beta[1]	0.999460	0.000014	0.000958	0.000964	0.997890	0.999449	1.00104	4717.81	3266.460	8834.850	1.000180
beta[2]	1.000210	0.000017	0.001119	0.001110	0.998362	1.000190	1.00207	4433.28	2668.210	8302.020	1.000110
beta[3]	1.000470	0.000014	0.000949	0.000950	0.998891	1.000460	1.00200	4662.19	3373.000	8730.690	1.000380
beta[4]	1.001120	0.000016	0.001029	0.001048	0.999476	1.001120	1.00284	4349.06	3024.630	8144.300	0.999956
beta[5]	1.001550	0.000015	0.001049	0.001055	0.999851	1.001560	1.00324	4670.27	3083.370	8745.830	1.001010
sigma	0.956363	0.004191	0.065810	0.068009	0.848430	0.954728	1.06149	248.35	271.356	465.075	1.017340

If we were to instead use the default random parameter initializations, we would need to run more warmup iterations to produce useful samples. For example, if we only run the same 75 warmup iterations with random inits, the result fails to estimate sigma correctly:

[7]:

mcmc_random_inits_fit = model.sample(data=data_file, iter_warmup=75, seed=12345)

19:24:12 - cmdstanpy - INFO - CmdStan start processing

19:24:12 - cmdstanpy - INFO - CmdStan done processing.
19:24:12 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 10 divergent transitions (1.0%)
        Chain 3 had 180 divergent transitions (18.0%)
        Chain 4 had 94 divergent transitions (9.4%)
        Use the "diagnose()" method on the CmdStanMCMC object to see further information.

[8]:

mcmc_random_inits_fit.summary()

[8]:

	Mean	MCSE	StdDev	MAD	5%	50%	95%	ESS_bulk	ESS_tail	ESS_bulk/s	R_hat
lp__	-187.24500	24.598100	35.016100	32.136000	-244.788000	-178.206000	-154.91900	5.04533	2000.0	20.5095	2.36310
beta[1]	0.99954	0.000124	0.002125	0.001442	0.996106	0.999495	1.00339	321.44500	2000.0	1306.6900	1.16149
beta[2]	1.00021	0.000131	0.002556	0.001839	0.996256	1.000160	1.00461	381.00600	2000.0	1548.8100	1.25825
beta[3]	1.00040	0.000117	0.002194	0.001474	0.996540	1.000390	1.00372	306.93700	2000.0	1247.7100	1.22836
beta[4]	1.00133	0.000125	0.002322	0.001385	0.997753	1.001310	1.00577	407.67000	2000.0	1657.1900	1.13671
beta[5]	1.00144	0.000117	0.002437	0.001650	0.997061	1.001470	1.00544	480.40300	2000.0	1952.8600	1.19502
sigma	1.87069	0.733300	1.046490	0.913512	0.892552	1.587390	3.58658	5.00048	2000.0	20.3271	2.49136

[9]:

print(mcmc_random_inits_fit.diagnose())

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
284 of 4000 (7.10%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.01, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  beta[1], beta[2], beta[3], beta[4], beta[5], sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

The diagnostics clearly indicate problems with estimating sigma. In this case, it is necessary to run the model with at least 150 warmup iterations to produce a good set of estimates when starting from the default initialization.

Other inference algorithms¶

We can follow the same pattern with Stan’s ADVI algorithm by first using the CmdStanModel.variationl method. Because this algorithm is unstable and may fail to converge, we run it with argument require_converged set to False. We also specify a seed, to avoid instabilities as well as for reproducibility.

[10]:

vb_fit = model.variational(data=data_file, require_converged=False, seed=123)

19:24:12 - cmdstanpy - INFO - Chain [1] start processing
19:24:12 - cmdstanpy - INFO - Chain [1] done processing
19:24:12 - cmdstanpy - WARNING - The algorithm may not have converged.
Proceeding because require_converged is set to False

The ADVI algorithm provides estimates of all model parameters.

The variational method returns a CmdStanVB object, which similarly can construct a set of inits with the .create_inits() method:

[11]:

vb_inits = vb_fit.create_inits()
for chain_init in vb_inits:
    print(chain_init)

{'beta': array([1.0079412 , 1.0006877 , 1.0090487 , 0.99748171, 0.98823884]), 'sigma': array(1.4914566)}
{'beta': array([1.0085629 , 0.99790276, 1.0083249 , 0.99795014, 0.98990057]), 'sigma': array(1.4706955)}
{'beta': array([1.0064491 , 0.99778578, 1.0078978 , 0.99819192, 0.98947641]), 'sigma': array(1.5340995)}
{'beta': array([1.005883  , 0.9996351 , 1.007835  , 0.99823935, 0.99011108]), 'sigma': array(1.4694405)}

Which can be passed to the inits keyword argument of the sample method.

[12]:

mcmc_vb_inits_fit = model.sample(
    data=data_file, inits=vb_inits, iter_warmup=75, seed=12345
)

19:24:12 - cmdstanpy - INFO - CmdStan start processing

19:24:12 - cmdstanpy - INFO - CmdStan done processing.
19:24:12 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Consider re-running with show_console=True if the above output is unclear!

[13]:

print(mcmc_vb_inits_fit.diagnose())

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

Rank-normalized split R-hat values satisfactory for all parameters.

Processing complete, no problems detected.

[14]:

mcmc_vb_inits_fit.summary()

[14]:

	Mean	MCSE	StdDev	MAD	5%	50%	95%	ESS_bulk	ESS_tail	ESS_bulk/s	R_hat
lp__	-156.921000	0.098333	1.921290	1.621930	-160.738000	-156.519000	-154.62800	535.233	575.839	1035.270	1.00523
beta[1]	0.999469	0.000013	0.000932	0.000934	0.997969	0.999470	1.00099	4853.540	3253.430	9387.900	1.00130
beta[2]	1.000230	0.000017	0.001165	0.001152	0.998214	1.000230	1.00215	4617.920	2995.360	8932.140	1.00213
beta[3]	1.000410	0.000013	0.000966	0.000962	0.998840	1.000430	1.00197	5346.620	3700.420	10341.600	1.00187
beta[4]	1.001180	0.000016	0.001063	0.001058	0.999418	1.001190	1.00292	4481.650	2890.660	8668.570	1.00024
beta[5]	1.001570	0.000016	0.001051	0.001040	0.999879	1.001560	1.00330	4361.920	2872.490	8436.990	1.00034
sigma	0.967508	0.005134	0.076829	0.070556	0.853894	0.962404	1.10761	252.375	172.951	488.153	1.01498

The sampler estimates match the reference posterior with no diagnostic issues.

Inits can also be constructed from the laplace method:

[15]:

laplace_inits = model.laplace_sample(data=data_file, seed=123).create_inits()

19:24:12 - cmdstanpy - INFO - Chain [1] start processing
19:24:12 - cmdstanpy - INFO - Chain [1] done processing
19:24:12 - cmdstanpy - INFO - Chain [1] start processing
19:24:12 - cmdstanpy - INFO - Chain [1] done processing

[16]:

mcmc_laplace_inits_fit = model.sample(
    data=data_file, inits=laplace_inits, iter_warmup=75, seed=12345
)

19:24:12 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.

[17]:

print(mcmc_laplace_inits_fit.diagnose())

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

Rank-normalized split R-hat values satisfactory for all parameters.

Processing complete, no problems detected.

And the optimize method. Since optimizations attempts to return a posterior mode, this will initialize all chains at the same point, which is not typically ideal.

[18]:

optimized_inits = model.optimize(data=data_file, seed=123).create_inits()

19:24:13 - cmdstanpy - INFO - Chain [1] start processing
19:24:13 - cmdstanpy - INFO - Chain [1] done processing
19:24:13 - cmdstanpy - WARNING - The default behavior of CmdStanMLE.stan_variable() will change in a future release to always return a numpy.ndarray, even for scalar variables.

[19]:

mcmc_optimize_inits_fit = model.sample(
    data=data_file, inits=optimized_inits, iter_warmup=75, seed=12345
)

19:24:13 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.

[20]:

print(mcmc_optimize_inits_fit.diagnose())

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

It is also possible to use the output of the sample() method itself to construct inits to be fed into a future sampling run:

[21]:

first_mcmc_inits = model.sample(
    data=data_file, iter_warmup=75, seed=12345
).create_inits()

19:24:13 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.
19:24:13 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 10 divergent transitions (1.0%)
        Chain 3 had 180 divergent transitions (18.0%)
        Chain 4 had 94 divergent transitions (9.4%)
        Use the "diagnose()" method on the CmdStanMCMC object to see further information.

[22]:

second_mcmc_fit = model.sample(data=data_file, inits=first_mcmc_inits, iter_warmup=75, seed=12345)

19:24:13 - cmdstanpy - INFO - CmdStan start processing

19:24:13 - cmdstanpy - INFO - CmdStan done processing.
19:24:13 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'blr.stan', line 16, column 2 to column 45)
Consider re-running with show_console=True if the above output is unclear!

[23]:

print(second_mcmc_fit.diagnose())

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Rank-normalized split effective sample size satisfactory for all parameters.

The following parameters had rank-normalized split R-hat greater than 1.01:
  sigma
Such high values indicate incomplete mixing and biased estimation.
You should consider regularizing your model with additional prior information or a more effective parameterization.

Processing complete.

We see that despite the initial sampling issues in the first MCMC run, the inits sourced from that run result in reasonable sampling in the second run.

Variational Inference using ADVI

Generating new quantities of interest

Initializing the NUTS-HMC sampler¶

Model and data¶

Demonstration with Stan’s pathfinder method¶

Other inference algorithms¶

Demonstration with Stan’s `pathfinder` method¶