Generating new quantities of interest.

The generated quantities block computes quantities of interest based on the data, transformed data, parameters, and transformed parameters. It can be used to:

  • generate simulated data for model testing by forward sampling

  • generate predictions for new data

  • calculate posterior event probabilities, including multiple comparisons, sign tests, etc.

  • calculating posterior expectations

  • transform parameters for reporting

  • apply full Bayesian decision theory

  • calculate log likelihoods, deviances, etc. for model comparison

Example: add posterior predictive checks to bernoulli.stan

In this example we use the CmdStan example model bernoulli.stan and data file bernoulli.data.json as our existing model and data.

We instantiate the model bernoulli, as in the “Hello World” section of the CmdStanPy tutorial notebook.

[1]:
import os
from cmdstanpy import cmdstan_path, CmdStanModel, CmdStanMCMC, CmdStanGQ

bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
stan_file = os.path.join(bernoulli_dir, 'bernoulli.stan')
data_file = os.path.join(bernoulli_dir, 'bernoulli.data.json')

# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)
print(model.code())
data {
  int<lower=0> N;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  real<lower=0, upper=1> theta;
}
model {
  theta ~ beta(1, 1); // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}

The input data consists of N - the number of bernoulli trials and y - the list of observed outcomes. Inspection of the data shows that on average, there is a 20% chance of success for any given Bernoulli trial.

[2]:
# examine bernoulli data
import json
import statistics
with open(data_file,'r') as fp:
    data_dict = json.load(fp)
print(data_dict)
print('mean of y: {}'.format(statistics.mean(data_dict['y'])))
{'N': 10, 'y': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]}
mean of y: 0.2

As in the “Hello World” tutorial, we produce a sample from the posterior of the model conditioned on the data:

[3]:
# fit the model to the data
fit = model.sample(data=data_file)
14:28:47 - cmdstanpy - INFO - CmdStan start processing

14:28:47 - cmdstanpy - INFO - CmdStan done processing.

The fitted model produces an estimate of theta - the chance of success

[4]:
fit.summary()
[4]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.289450 0.021530 0.713553 -8.717780 -7.01163 -6.750090 1098.43 21968.7 1.00177
theta 0.249577 0.003296 0.121900 0.079791 0.23454 0.477362 1367.84 27356.8 1.00060

To run a prior predictive check, we add a generated quantities block to the model, in which we generate a new data vector y_rep using the current estimate of theta. The resulting model is in file bernoulli_ppc.stan

[5]:
model_ppc = CmdStanModel(stan_file='bernoulli_ppc.stan')
print(model_ppc.code())
14:28:47 - cmdstanpy - INFO - compiling stan file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc.stan to exe file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc
14:28:54 - cmdstanpy - INFO - compiled model executable: /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc
data {
  int<lower=0> N;
  array[N] int<lower=0, upper=1> y;
}
parameters {
  real<lower=0, upper=1> theta;
}
model {
  theta ~ beta(1, 1);
  y ~ bernoulli(theta);
}
generated quantities {
  array[N] int y_rep;
  for (n in 1 : N) {
    y_rep[n] = bernoulli_rng(theta);
  }
}


We run the generate_quantities method on bernoulli_ppc using existing sample fit as input. The generate_quantities method takes the values of theta in the fit sample as the set of draws from the posterior used to generate the corresponding y_rep quantities of interest.

The arguments to the generate_quantities method are: + data - the data used to fit the model + previous_fit - either a CmdStanMCMC, CmdStanVB, or CmdStanMLE object or a list of stan-csv files

[6]:
new_quantities = model_ppc.generate_quantities(data=data_file, previous_fit=fit)
14:28:54 - cmdstanpy - INFO - Chain [1] start processing
14:28:54 - cmdstanpy - INFO - Chain [2] start processing
14:28:54 - cmdstanpy - INFO - Chain [1] done processing
14:28:54 - cmdstanpy - INFO - Chain [3] start processing
14:28:54 - cmdstanpy - INFO - Chain [2] done processing
14:28:54 - cmdstanpy - INFO - Chain [4] start processing
14:28:54 - cmdstanpy - INFO - Chain [3] done processing
14:28:54 - cmdstanpy - INFO - Chain [4] done processing

The generate_quantities method returns a CmdStanGQ object which contains the values for all variables in the generated quantities block of the program bernoulli_ppc.stan. Unlike the output from the sample method, it doesn’t contain any information on the joint log probability density, sampler state, or parameters or transformed parameter values.

In this example, each draw consists of the N-length array of replicate of the bernoulli model’s input variable y, which is an N-length array of Bernoulli outcomes.

[7]:
print(new_quantities.draws().shape, new_quantities.column_names)
for i in range(3):
    print (new_quantities.draws()[i,:])
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
(1000, 4, 10) ('y_rep[1]', 'y_rep[2]', 'y_rep[3]', 'y_rep[4]', 'y_rep[5]', 'y_rep[6]', 'y_rep[7]', 'y_rep[8]', 'y_rep[9]', 'y_rep[10]')
[[0. 0. 0. 0. 0. 1. 0. 1. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]
[[1. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 1. 0. 1. 1. 1. 1. 0. 0.]]

We can also use draws_pd(inc_sample=True) to get a pandas DataFrame which combines the input drawset with the generated quantities.

[8]:
sample_plus = new_quantities.draws_pd(inc_sample=True)
print(type(sample_plus),sample_plus.shape)
names = list(sample_plus.columns.values[7:18])
sample_plus.iloc[0:3, :]
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
<class 'pandas.core.frame.DataFrame'> (4000, 21)
[8]:
lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__ theta chain__ iter__ ... y_rep[1] y_rep[2] y_rep[3] y_rep[4] y_rep[5] y_rep[6] y_rep[7] y_rep[8] y_rep[9] y_rep[10]
0 -6.96784 0.709141 1.02818 2.0 3.0 0.0 8.18547 0.338275 1.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0
1 -6.90772 1.000000 1.02818 1.0 1.0 0.0 6.99126 0.324645 1.0 2.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 -6.76658 0.906379 1.02818 2.0 3.0 0.0 7.29067 0.274579 1.0 3.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

3 rows × 21 columns

For models as simple as the bernoulli models here, it would be trivial to re-run the sampler and generate a new sample which contains both the estimate of the parameters theta as well as y_rep values. For models which are difficult to fit, i.e., when producing a sample is computationally expensive, the generate_quantities method is preferred.