Using Variational Estimates to Initialize the NUTS-HMC Sampler

In this example we show how to use the parameter estimates return by Stan’s variational inference algorithms pathfinder and ADVI as the initial parameter values for Stan’s NUTS-HMC sampler. By default, the sampler algorithm randomly initializes all model parameters in the range uniform[-2, 2]. When the true parameter value is outside of this range, starting from the estimates from Pathfinder and ADVI will speed up and improve adaptation.

Model and data

The Stan model and data are taken from the posteriordb package.

We use the blr model, a Bayesian standard linear regression model with noninformative priors, and its corresponding simulated dataset sblri.json, which was simulated via script sblr.R. For conveince, we have copied the posteriordb model and data to this directory, in files blr.stan and sblri.json.

[1]:
import os
from cmdstanpy import CmdStanModel

stan_file = 'blr.stan' # basic linear regression
data_file = 'sblri.json' # simulated data

model = CmdStanModel(stan_file=stan_file)

print(model.code())
data {
  int<lower=0> N;
  int<lower=0> D;
  matrix[N, D] X;
  vector[N] y;
}
parameters {
  vector[D] beta;
  real<lower=0> sigma;
}
model {
  // prior
  target += normal_lpdf(beta | 0, 10);
  target += normal_lpdf(sigma | 0, 10);
  // likelihood
  target += normal_lpdf(y | X * beta, sigma);
}


Run Stan’s pathfinder or variational algorithm, obtain fitted estimates

The CmdStanModel pathfinder method wraps the CmdStan pathfinder method.

Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the negative inverse Hessian estimates produced by the LBFGS optimizer. Pathfinder returns draws from the Gaussian approximation with the lowest estimated Kullback-Leibler (KL) divergence to the true posterior. By default, CmdStanPy runs multi-path Pathfinder which returns an importance-resampled set of draws over the outputs of 4 independent single-path Pathfinders. This better matches non-normal target densities and also mitigates the problem of L-BFGS getting stuck at local optima or in saddle points on plateaus.

The method create_inits returns a Python Dict containing a set of per-chain initializations for the model parameters. Each set of initializations is a random draw from the Pathfinder sample.

[2]:
pathfinder_fit = model.pathfinder(data=data_file, seed=123)
17:01:02 - cmdstanpy - INFO - Chain [1] start processing
17:01:02 - cmdstanpy - INFO - Chain [1] done processing

Posteriordb provides reference posteriors for all models. For the blr model, conditioned on the dataset sblri.json, the reference posteriors are in file sblri-blr.json

The reference posteriors for all elements of beta and sigma are all very close to 1.0.

The experiments reported in Figure 3 of the paper Pathfinder: Parallel quasi-Newton variational inference by Zhang et al. show that Pathfinder provides a better estimate of the posterior, as measured by the 1-Wasserstein distance to the reference posterior, than 75 iterations of the warmup Phase I algorithm used by the NUTS-HMC sampler. furthermore, Pathfinder is more computationally efficient, requiring fewer evaluations of the log density and gradient functions. Therefore, using the estimates from Pathfinder to initialize the parameter values for the NUTS-HMC sampler will allow the sampler to do a better job of adapting the stepsize and metric during warmup, resulting in better performance and estimation.

[3]:
pathfinder_inits = pathfinder_fit.create_inits()
print(pathfinder_inits)
[{'beta': array([0.996649, 0.999455, 1.00093 , 0.99873 , 1.00207 ]), 'sigma': array(0.934232)}, {'beta': array([1.00016 , 0.998764, 1.00055 , 1.00212 , 1.00047 ]), 'sigma': array(1.04441)}, {'beta': array([1.00139 , 0.997917, 1.00134 , 1.00123 , 1.00116 ]), 'sigma': array(0.946814)}, {'beta': array([0.999491, 0.999225, 1.00114 , 0.999147, 0.998943]), 'sigma': array(0.977812)}]
[4]:
mcmc_pathfinder_inits_fit = model.sample(
    data=data_file, inits=pathfinder_inits, iter_warmup=75, seed=12345
)
17:01:05 - cmdstanpy - INFO - CmdStan start processing

17:01:05 - cmdstanpy - INFO - CmdStan done processing.
17:01:05 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/Users/mitzi/github/stan-dev/cmdstanpy/docsrc/users-guide/examples/blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/Users/mitzi/github/stan-dev/cmdstanpy/docsrc/users-guide/examples/blr.stan', line 16, column 2 to column 45)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in '/Users/mitzi/github/stan-dev/cmdstanpy/docsrc/users-guide/examples/blr.stan', line 16, column 2 to column 45)
Consider re-running with show_console=True if the above output is unclear!

[ ]:
mcmc_pathfinder_inits_fit.summary()

Using the default random parameter initializations, we need to run more warmup iterations. If we only run 75 warmup iterations with random inits, the result fails to estimate sigma correctly. It is necessary to run the model with at least 150 warmup iterations to produce a good set of estimates.

[ ]:
mcmc_random_inits_fit = model.sample(data=data_file, iter_warmup=75, seed=12345)
[ ]:
mcmc_random_inits_fit.summary()
[ ]:
print(mcmc_random_inits_fit.diagnose())

The CmdStanModel method variational runs CmdStan’s ADVI algorithm. Because this algorithm is unstable and may fail to converge, we run it with argument require_converged set to False. We also specify a seed, to avoid instabilities as well as for reproducibility.

[ ]:
vb_fit = model.variational(data=data_file, require_converged=False, seed=123)

The ADVI algorithm provides estimates of all model parameters.

The variational method returns a CmdStanVB object, with method stan_variables, which returns the approximat posterior samples of all model parameters as a Python dictionary. Here, we report the approximate posterior mean.

[ ]:
vb_mean = {var: samples.mean(axis=0) for var, samples in vb_fit.stan_variables(mean=False).items()}
print(vb_mean)
[ ]:
mcmc_vb_inits_fit = model.sample(
    data=data_file, inits=vb_mean, iter_warmup=75, seed=12345
)
[ ]:
mcmc_vb_inits_fit.summary()

The sampler estimates match the reference posterior.

[ ]:
print(mcmc_vb_inits_fit.diagnose())