“Hello, World!”

Fitting a Stan model using the NUTS-HMC sampler

In order to verify the installation and also to demonstrate the CmdStanPy workflow, we use CmdStanPy to fit the the example Stan model bernoulli.stan to the dataset bernoulli.data.json. The bernoulli.stan is a Hello, World! program which illustrates the basic syntax of the Stan language. It allows the user to verify that CmdStanPy, CmdStan, the StanC compiler, and the C++ toolchain have all been properly installed.

For substantive example models and guidance on coding statistical models in Stan, see the Stan User’s Guide.

The Stan model

The model bernoulli.stan is a trivial model: given a set of N observations of i.i.d. binary data y[1] … y[N], it calculates the Bernoulli chance-of-success theta.

data {
  int<lower=0> N;
  array[N] int<lower=0,upper=1> y;
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}

The CmdStanModel class manages the Stan program and its corresponding compiled executable. It provides properties and functions to inspect the model code and filepaths. A CmdStanModel can be instantiated from a Stan file or its corresponding compiled executable file.

# import packages
In [1]: import os

In [2]: from cmdstanpy import CmdStanModel

# specify Stan program file
In [3]: stan_file = os.path.join('users-guide', 'examples', 'bernoulli.stan')

# instantiate the model object
In [4]: model = CmdStanModel(stan_file=stan_file)
INFO:cmdstanpy:compiling stan file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli.stan to exe file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli
INFO:cmdstanpy:compiled model executable: /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli

# inspect model object
In [5]: print(model)
CmdStanModel: name=bernoulli
	 stan_file=/home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli.stan
	 exe_file=/home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli
	 compiler_options=stanc_options={}, cpp_options={}

# inspect compiled model
In [6]: print(model.exe_info())
{'stan_version_major': '2', 'stan_version_minor': '34', 'stan_version_patch': '1', 'STAN_THREADS': 'false', 'STAN_MPI': 'false', 'STAN_OPENCL': 'false', 'STAN_NO_RANGE_CHECKS': 'false', 'STAN_CPP_OPTIMS': 'false'}

Data inputs

CmdStanPy accepts input data either as a Python dictionary which maps data variable names to values, or as the corresponding JSON file.

The bernoulli model requires two inputs: the number of observations N, and an N-length vector y of binary outcomes. The data file bernoulli.data.json contains the following inputs:

{
 "N" : 10,
 "y" : [0,1,0,0,0,0,0,0,0,1]
}

Fitting the model

The sample() method is used to do Bayesian inference over the model conditioned on data using using Hamiltonian Monte Carlo (HMC) sampling. It runs Stan’s HMC-NUTS sampler on the model and data and returns a CmdStanMCMC object. The data can be specified either as a filepath or a Python dictionary; in this example, we use the example datafile bernoulli.data.json: By default, the sample method runs 4 sampler chains.

# specify data file
In [7]: data_file = os.path.join('users-guide', 'examples', 'bernoulli.data.json')

# fit the model
In [8]: fit = model.sample(data=data_file)
INFO:cmdstanpy:CmdStan start processing
                                                                                                                                                                                                                                                                                                                                
INFO:cmdstanpy:CmdStan done processing.

Note this model can be fit using other methods

Accessing the results

The sampler outputs are the set of per-chain Stan CSV files, a non-standard CSV file format. Each data row of the Stan CSV file contains the per-iteration estimate of the Stan model parameters, transformed parameters, and generated quantities variables. Container variables, i.e., vector, row-vector, matrix, and array variables are necessarily serialized into a single row’s worth of data. The output objects parse the set of Stan CSV files into a set of in-memory data structures and provide accessor functions for the all estimates and metadata. CmdStanPy makes a distinction between the per-iteration model outputs and the per-iteration algorithm outputs: the former are ‘stan_variables’ and the latter are ‘method_variables’.

The CmdStanMCMC object provides the following accessor methods:

  • stan_variable(): returns an numpy.ndarray whose structure corresponds to the Stan program variable structure

  • stan_variables(): returns an Python dictionary mapping the Stan program variable names to the corresponding numpy.ndarray.

  • draws(): returns a numpy.ndarray which is either a 3-D array draws X chains X CSV columns, or a 2-D array draws X columns, where the chains are concatenated into a single column. The argument vars can be used to restrict this to just the columns for one or more variables.

  • draws_pd(): returns a pandas.DataFrame over all columns in the Stan CSV file. The argument vars can be used to restrict this to one or more variables.

  • draws_xr(): returns an xarray.Dataset which maps model variable names to their respective values. The argument vars can be used to restrict this to one or more variables.

  • method_variables(): returns a Python dictionary over the sampler diagnostic/information output columns which by convention end in __, e.g., lp__.

# access model variable by name
In [9]: print(fit.stan_variable('theta'))
[0.174379 0.142577 0.104175 ... 0.296568 0.275516 0.289409]

In [10]: print(fit.draws_pd('theta')[:3])
      theta
0  0.174379
1  0.142577
2  0.104175

In [11]: print(fit.draws_xr('theta'))
<xarray.Dataset> Size: 40kB
Dimensions:  (draw: 1000, chain: 4)
Coordinates:
  * chain    (chain) int64 32B 1 2 3 4
  * draw     (draw) int64 8kB 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999
Data variables:
    theta    (chain, draw) float64 32kB 0.1744 0.1426 0.1042 ... 0.2755 0.2894
Attributes:
    stan_version:        2.34.1
    model:               bernoulli_model
    num_draws_sampling:  1000

# access all model variables
In [12]: for k, v in fit.stan_variables().items():
   ....:     print(f'{k}\t{v.shape}')
   ....: 
theta	(4000,)

# access the sampler method variables
In [13]: for k, v in fit.method_variables().items():
   ....:     print(f'{k}\t{v.shape}')
   ....: 
lp__	(1000, 4)
accept_stat__	(1000, 4)
stepsize__	(1000, 4)
treedepth__	(1000, 4)
n_leapfrog__	(1000, 4)
divergent__	(1000, 4)
energy__	(1000, 4)

# access all Stan CSV file columns
In [14]: print(f'numpy.ndarray of draws: {fit.draws().shape}')
numpy.ndarray of draws: (1000, 4, 8)

In [15]: fit.draws_pd()
Out[15]: 
      chain__  iter__  draw__  ...  divergent__  energy__     theta
0         1.0     1.0     1.0  ...          0.0   6.96995  0.174379
1         1.0     2.0     2.0  ...          0.0   7.23168  0.142577
2         1.0     3.0     3.0  ...          0.0   8.29890  0.104175
3         1.0     4.0     4.0  ...          0.0   8.29236  0.083123
4         1.0     5.0     5.0  ...          0.0   8.59298  0.076468
...       ...     ...     ...  ...          ...       ...       ...
3995      4.0   996.0  3996.0  ...          0.0   7.11186  0.362054
3996      4.0   997.0  3997.0  ...          0.0   7.09438  0.333609
3997      4.0   998.0  3998.0  ...          0.0   6.90865  0.296568
3998      4.0   999.0  3999.0  ...          0.0   6.79980  0.275516
3999      4.0  1000.0  4000.0  ...          0.0   6.79844  0.289409

[4000 rows x 11 columns]

In addition to the MCMC sample itself, the CmdStanMCMC object provides access to the the per-chain HMC tuning parameters from the NUTS-HMC adaptive sampler, (if present).

In [16]: print(fit.metric_type)
diag_e

In [17]: print(fit.metric)
[[0.548827]
 [0.618353]
 [0.488154]
 [0.506284]]

In [18]: print(fit.step_size)
[0.916475 0.945805 0.890823 1.05791 ]

The CmdStanMCMC object also provides access to metadata about the model and the sampler run.

In [19]: print(fit.metadata.cmdstan_config['model'])
bernoulli_model

In [20]: print(fit.metadata.cmdstan_config['seed'])
5650

CmdStan utilities: stansummary, diagnose

CmdStan is distributed with a posterior analysis utility stansummary that reads the outputs of all chains and computes summary statistics for all sampler and model parameters and quantities of interest. The CmdStanMCMC method summary() runs this utility and returns summaries of the total joint log-probability density lp__ plus all model parameters and quantities of interest in a pandas.DataFrame:

In [21]: fit.summary()
Out[21]: 
           Mean      MCSE    StdDev  ...    N_Eff  N_Eff/s    R_hat
lp__  -7.244140  0.016066  0.686744  ...  1827.16  45679.0  1.00054
theta  0.249347  0.003168  0.116642  ...  1356.00  33900.0  1.00428

[2 rows x 9 columns]

CmdStan is distributed with a second posterior analysis utility diagnose which analyzes the per-draw sampler parameters across all chains looking for potential problems which indicate that the sample isn’t a representative sample from the posterior. The diagnose() method runs this utility and prints the output to the console.

In [22]: print(fit.diagnose())
Processing csv files: /tmp/tmpgvv426mx/bernoullig33uprfe/bernoulli-20240326163037_1.csv, /tmp/tmpgvv426mx/bernoullig33uprfe/bernoulli-20240326163037_2.csv, /tmp/tmpgvv426mx/bernoullig33uprfe/bernoulli-20240326163037_3.csv, /tmp/tmpgvv426mx/bernoullig33uprfe/bernoulli-20240326163037_4.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.