Generating new quantities of interest.¶
The generated quantities block computes quantities of interest based on the data, transformed data, parameters, and transformed parameters. It can be used to:
generate simulated data for model testing by forward sampling
generate predictions for new data
calculate posterior event probabilities, including multiple comparisons, sign tests, etc.
calculating posterior expectations
transform parameters for reporting
apply full Bayesian decision theory
calculate log likelihoods, deviances, etc. for model comparison
Example: add posterior predictive checks to bernoulli.stan
¶
In this example we use the CmdStan example model bernoulli.stan and data file bernoulli.data.json as our existing model and data.
We instantiate the model bernoulli
, as in the “Hello World” section of the CmdStanPy tutorial notebook.
[1]:
import os
from cmdstanpy import cmdstan_path, CmdStanModel, CmdStanMCMC, CmdStanGQ
bernoulli_dir = os.path.join(cmdstan_path(), 'examples', 'bernoulli')
stan_file = os.path.join(bernoulli_dir, 'bernoulli.stan')
data_file = os.path.join(bernoulli_dir, 'bernoulli.data.json')
# instantiate, compile bernoulli model
model = CmdStanModel(stan_file=stan_file)
print(model.code())
data {
int<lower=0> N;
array[N] int<lower=0, upper=1> y;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
theta ~ beta(1, 1); // uniform prior on interval 0,1
y ~ bernoulli(theta);
}
The input data consists of N
- the number of bernoulli trials and y
- the list of observed outcomes. Inspection of the data shows that on average, there is a 20% chance of success for any given Bernoulli trial.
[2]:
# examine bernoulli data
import json
import statistics
with open(data_file,'r') as fp:
data_dict = json.load(fp)
print(data_dict)
print('mean of y: {}'.format(statistics.mean(data_dict['y'])))
{'N': 10, 'y': [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]}
mean of y: 0.2
As in the “Hello World” tutorial, we produce a sample from the posterior of the model conditioned on the data:
[3]:
# fit the model to the data
fit = model.sample(data=data_file)
14:28:47 - cmdstanpy - INFO - CmdStan start processing
14:28:47 - cmdstanpy - INFO - CmdStan done processing.
The fitted model produces an estimate of theta
- the chance of success
[4]:
fit.summary()
[4]:
Mean | MCSE | StdDev | 5% | 50% | 95% | N_Eff | N_Eff/s | R_hat | |
---|---|---|---|---|---|---|---|---|---|
lp__ | -7.289450 | 0.021530 | 0.713553 | -8.717780 | -7.01163 | -6.750090 | 1098.43 | 21968.7 | 1.00177 |
theta | 0.249577 | 0.003296 | 0.121900 | 0.079791 | 0.23454 | 0.477362 | 1367.84 | 27356.8 | 1.00060 |
To run a prior predictive check, we add a generated quantities
block to the model, in which we generate a new data vector y_rep
using the current estimate of theta. The resulting model is in file bernoulli_ppc.stan
[5]:
model_ppc = CmdStanModel(stan_file='bernoulli_ppc.stan')
print(model_ppc.code())
14:28:47 - cmdstanpy - INFO - compiling stan file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc.stan to exe file /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc
14:28:54 - cmdstanpy - INFO - compiled model executable: /home/runner/work/cmdstanpy/cmdstanpy/docsrc/users-guide/examples/bernoulli_ppc
data {
int<lower=0> N;
array[N] int<lower=0, upper=1> y;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
theta ~ beta(1, 1);
y ~ bernoulli(theta);
}
generated quantities {
array[N] int y_rep;
for (n in 1 : N) {
y_rep[n] = bernoulli_rng(theta);
}
}
We run the generate_quantities
method on bernoulli_ppc
using existing sample fit
as input. The generate_quantities
method takes the values of theta
in the fit
sample as the set of draws from the posterior used to generate the corresponding y_rep
quantities of interest.
The arguments to the generate_quantities
method are: + data
- the data used to fit the model + previous_fit
- either a CmdStanMCMC
, CmdStanVB
, or CmdStanMLE
object or a list of stan-csv files
[6]:
new_quantities = model_ppc.generate_quantities(data=data_file, previous_fit=fit)
14:28:54 - cmdstanpy - INFO - Chain [1] start processing
14:28:54 - cmdstanpy - INFO - Chain [2] start processing
14:28:54 - cmdstanpy - INFO - Chain [1] done processing
14:28:54 - cmdstanpy - INFO - Chain [3] start processing
14:28:54 - cmdstanpy - INFO - Chain [2] done processing
14:28:54 - cmdstanpy - INFO - Chain [4] start processing
14:28:54 - cmdstanpy - INFO - Chain [3] done processing
14:28:54 - cmdstanpy - INFO - Chain [4] done processing
The generate_quantities
method returns a CmdStanGQ
object which contains the values for all variables in the generated quantities block of the program bernoulli_ppc.stan
. Unlike the output from the sample
method, it doesn’t contain any information on the joint log probability density, sampler state, or parameters or transformed parameter values.
In this example, each draw consists of the N-length array of replicate of the bernoulli
model’s input variable y
, which is an N-length array of Bernoulli outcomes.
[7]:
print(new_quantities.draws().shape, new_quantities.column_names)
for i in range(3):
print (new_quantities.draws()[i,:])
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
(1000, 4, 10) ('y_rep[1]', 'y_rep[2]', 'y_rep[3]', 'y_rep[4]', 'y_rep[5]', 'y_rep[6]', 'y_rep[7]', 'y_rep[8]', 'y_rep[9]', 'y_rep[10]')
[[0. 0. 0. 0. 0. 1. 0. 1. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]
[[1. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 1. 0. 1. 1. 1. 1. 0. 0.]]
We can also use draws_pd(inc_sample=True)
to get a pandas DataFrame which combines the input drawset with the generated quantities.
[8]:
sample_plus = new_quantities.draws_pd(inc_sample=True)
print(type(sample_plus),sample_plus.shape)
names = list(sample_plus.columns.values[7:18])
sample_plus.iloc[0:3, :]
14:28:54 - cmdstanpy - WARNING - Sample doesn't contain draws from warmup iterations, rerun sampler with "save_warmup=True".
<class 'pandas.core.frame.DataFrame'> (4000, 21)
[8]:
lp__ | accept_stat__ | stepsize__ | treedepth__ | n_leapfrog__ | divergent__ | energy__ | theta | chain__ | iter__ | ... | y_rep[1] | y_rep[2] | y_rep[3] | y_rep[4] | y_rep[5] | y_rep[6] | y_rep[7] | y_rep[8] | y_rep[9] | y_rep[10] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -6.96784 | 0.709141 | 1.02818 | 2.0 | 3.0 | 0.0 | 8.18547 | 0.338275 | 1.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
1 | -6.90772 | 1.000000 | 1.02818 | 1.0 | 1.0 | 0.0 | 6.99126 | 0.324645 | 1.0 | 2.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | -6.76658 | 0.906379 | 1.02818 | 2.0 | 3.0 | 0.0 | 7.29067 | 0.274579 | 1.0 | 3.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 rows × 21 columns
For models as simple as the bernoulli models here, it would be trivial to re-run the sampler and generate a new sample which contains both the estimate of the parameters theta
as well as y_rep
values. For models which are difficult to fit, i.e., when producing a sample is computationally expensive, the generate_quantities
method is preferred.