Generating new quantities of interest.¶

The generated quantities block computes quantities of interest based on the data, transformed data, parameters, and transformed parameters. It can be used to:

generate simulated data for model testing by forward sampling
generate predictions for new data
calculate posterior event probabilities, including multiple comparisons, sign tests, etc.
calculating posterior expectations
transform parameters for reporting
apply full Bayesian decision theory
calculate log likelihoods, deviances, etc. for model comparison

Example: add posterior predictive checks to `bernoulli.stan`¶

In this example we use the CmdStan example model bernoulli.stan and data file bernoulli.data.json as our existing model and data.

We instantiate the model bernoulli, as in the “Hello World” section of the CmdStanPy tutorial notebook.

[1]:

import os
from cmdstanpy import cmdstan_path, CmdStanModel, CmdStanMCMC, CmdStanGQ

bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
stan_file = os.path.join(bernoulli_dir, 'bernoulli.stan')
data_file = os.path.join(bernoulli_dir, 'bernoulli.data.json')

# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)
print(model.code())

data {
  int<lower=0> N;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  real<lower=0, upper=1> theta;
}
model {
  theta ~ beta(1, 1); // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}

The input data consists of N - the number of bernoulli trials and y - the list of observed outcomes. Inspection of the data shows that on average, there is a 20% chance of success for any given Bernoulli trial.

[2]:

# examine bernoulli data
import json
import statistics
with open(data_file,'r') as fp:
    data_dict = json.load(fp)
print(data_dict)
print('mean of y: {}'.format(statistics.mean(data_dict['y'])))

{'N': 10, 'y': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]}
mean of y: 0.2

As in the “Hello World” tutorial, we produce a sample from the posterior of the model conditioned on the data:

[3]:

# fit the model to the data
fit = model.sample(data=data_file)

14:28:47 - cmdstanpy - INFO - CmdStan start processing

14:28:47 - cmdstanpy - INFO - CmdStan done processing.

The fitted model produces an estimate of theta - the chance of success

[4]:

fit.summary()

[4]:

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
lp__	-7.289450	0.021530	0.713553	-8.717780	-7.01163	-6.750090	1098.43	21968.7	1.00177
theta	0.249577	0.003296	0.121900	0.079791	0.23454	0.477362	1367.84	27356.8	1.00060

To run a prior predictive check, we add a generated quantities block to the model, in which we generate a new data vector y_rep using the current estimate of theta. The resulting model is in file bernoulli_ppc.stan

[5]:

model_ppc = CmdStanModel(stan_file='bernoulli_ppc.stan')
print(model_ppc.code())

14:28:47 - cmdstanpy - INFO - compiling stan file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc.stan to exe file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc
14:28:54 - cmdstanpy - INFO - compiled model executable: /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc

data {
  int<lower=0> N;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  real<lower=0, upper=1> theta;
}
model {
  theta ~ beta(1, 1);
  y ~ bernoulli(theta);
}
generated quantities {
  array[N] int y_rep;
  for (n in 1 : N) {
    y_rep[n] = bernoulli_rng(theta);
  }
}

We run the generate_quantities method on bernoulli_ppc using existing sample fit as input. The generate_quantities method takes the values of theta in the fit sample as the set of draws from the posterior used to generate the corresponding y_rep quantities of interest.

The arguments to the generate_quantities method are: + data - the data used to fit the model + previous_fit - either a CmdStanMCMC, CmdStanVB, or CmdStanMLE object or a list of stan-csv files

[6]:

new_quantities = model_ppc.generate_quantities(data=data_file, previous_fit=fit)

14:28:54 - cmdstanpy - INFO - Chain [1] start processing
14:28:54 - cmdstanpy - INFO - Chain [2] start processing
14:28:54 - cmdstanpy - INFO - Chain [1] done processing
14:28:54 - cmdstanpy - INFO - Chain [3] start processing
14:28:54 - cmdstanpy - INFO - Chain [2] done processing
14:28:54 - cmdstanpy - INFO - Chain [4] start processing
14:28:54 - cmdstanpy - INFO - Chain [3] done processing
14:28:54 - cmdstanpy - INFO - Chain [4] done processing

The generate_quantities method returns a CmdStanGQ object which contains the values for all variables in the generated quantities block of the program bernoulli_ppc.stan. Unlike the output from the sample method, it doesn’t contain any information on the joint log probability density, sampler state, or parameters or transformed parameter values.

In this example, each draw consists of the N-length array of replicate of the bernoulli model’s input variable y, which is an N-length array of Bernoulli outcomes.

[7]:

print(new_quantities.draws().shape, new_quantities.column_names)
for i in range(3):
    print (new_quantities.draws()[i,:])

14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".

(1000, 4, 10) ('y_rep[1]', 'y_rep[2]', 'y_rep[3]', 'y_rep[4]', 'y_rep[5]', 'y_rep[6]', 'y_rep[7]', 'y_rep[8]', 'y_rep[9]', 'y_rep[10]')
[[0. 0. 0. 0. 0. 1. 0. 1. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]
[[1. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 1. 1. 1. 1. 0. 0.]]

We can also use draws_pd(inc_sample=True) to get a pandas DataFrame which combines the input drawset with the generated quantities.

[8]:

sample_plus = new_quantities.draws_pd(inc_sample=True)
print(type(sample_plus),sample_plus.shape)
names = list(sample_plus.columns.values[7:18])
sample_plus.iloc[0:3, :]

14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".

<class 'pandas.core.frame.DataFrame'> (4000, 21)

[8]:

	lp__	accept_stat__	stepsize__	treedepth__	n_leapfrog__	energy__	theta	chain__	iter__	...	y_rep[1]	y_rep[5]	y_rep[6]	y_rep[8]	y_rep[10]
0	-6.96784	0.709141	1.02818	2.0	3.0	8.18547	0.338275	1.0	1.0	...	0.0	0.0	1.0	1.0	1.0
1	-6.90772	1.000000	1.02818	1.0	1.0	6.99126	0.324645	1.0	2.0	...	0.0	0.0	0.0	0.0	0.0
2	-6.76658	0.906379	1.02818	2.0	3.0	7.29067	0.274579	1.0	3.0	...	1.0	1.0	0.0	0.0	0.0

3 rows × 21 columns

For models as simple as the bernoulli models here, it would be trivial to re-run the sampler and generate a new sample which contains both the estimate of the parameters theta as well as y_rep values. For models which are difficult to fit, i.e., when producing a sample is computationally expensive, the generate_quantities method is preferred.

Using Variational Estimates to Initialize the NUTS-HMC Sampler

Advanced Topic: Using External C++ Functions

Generating new quantities of interest.¶

Example: add posterior predictive checks to bernoulli.stan¶

Example: add posterior predictive checks to `bernoulli.stan`¶