Generating Quantities of Interest from a Fitted Model
The generate_quantities
method allows you to generate additional quantities of interest from a fitted model without re-running the sampler. Instead, you write a modified version of the original Stan program and add a generated quantities block or modify the existing one which specifies how to compute the new quantities of interest. Running the generate_quantities
method on the new program together with sampler outputs (i.e., a set of draws) from the fitted model runs the generated quantities block of the new program using the the existing sample by plugging in the per-draw parameter estimates for the computations in the generated quantities block.
This method requires sub-argument fitted_params
which takes as its value an existing Stan CSV file that contains a parameter values from an equivalent model, i.e., a model with the same parameters block, conditioned on the same data.
The generated quantities block computes quantities of interest (QOIs) based on the data, transformed data, parameters, and transformed parameters. It can be used to:
- generate simulated data for model testing by forward sampling
- generate predictions for new data
- calculate posterior event probabilities, including multiple comparisons, sign tests, etc.
- calculate posterior expectations
- transform parameters for reporting
- apply full Bayesian decision theory
- calculate log likelihoods, deviances, etc. for model comparison
For an overview of the uses of this feature, see the Stan User’s Guide section on Stand-alone generated quantities and ongoing prediction.
Example
To illustrate how this works we use the generate_quantities
method to do posterior predictive checks using the estimate of theta
given the example bernoulli model and data, following the posterior predictive simulation procedure in the Stan User’s Guide.
We write a program bernoulli_ppc.stan
which contains the following generated quantities block, with comments to explain the procedure:
generated quantities {
array[N] int y_sim;
// use current estimate of theta to generate new sample
for (n in 1:N) {
y_sim[n] = bernoulli_rng(theta);
}// estimate theta_rep from new sample
real<lower=0, upper=1> theta_rep = sum(y_sim) * 1.0 / N;
}
The rest of the program is the same as in bernoulli.stan
.
The generate_method
requires the sub-argument fitted_params
which takes as its value the name of a Stan CSV file. The per-draw parameter values from the fitted_params
file will be used to run the generated quantities block.
If we run the bernoulli.stan
program for a single chain to generate a sample in file bernoulli_fit.csv
:
> ./bernoulli sample data file=bernoulli.data.json output file=bernoulli_fit.csv
Then we can run the bernoulli_ppc.stan
to carry out the posterior predictive checks:
> ./bernoulli_ppc generate_quantities fitted_params=bernoulli_fit.csv \
data file=bernoulli.data.json \
output file=bernoulli_ppc.csv
The output file bernoulli_ppc.csv
contains only the values for the variables declared in the generated quantities
block, i.e., theta_rep
and the elements of y_sim
:
# model = bernoulli_ppc_model
# method = generate_quantities
# generate_quantities
# fitted_params = bernoulli_fit.csv
# id = 1 (Default)
# data
# file = bernoulli.data.json
# init = 2 (Default)
# random
# seed = 2983956445 (Default)
# output
# file = output.csv (Default)
y_sim.1,y_sim.2,y_sim.3,y_sim.4,y_sim.5,y_sim.6,y_sim.7,y_sim.8,y_sim.9,y_sim.10,theta_rep
1,1,1,0,0,0,1,1,0,1,0.6
1,1,0,1,0,0,1,0,1,0,0.5
1,0,1,1,1,1,1,1,0,1,0.8
0,1,0,1,0,1,0,1,0,0,0.4
1,0,0,0,0,0,0,0,0,0,0.1
0,0,0,0,0,1,1,1,0,0,0.3
0,0,1,0,1,0,0,0,0,0,0.2
1,0,1,0,1,1,0,1,1,0,0.6
...
Given the current implementation, to see the fitted parameter values for each draw, create a copy variable in the generated quantities block, e.g.:
generated quantities {
array[N] int y_sim;
// use current estimate of theta to generate new sample
for (n in 1:N) {
y_sim[n] = bernoulli_rng(theta);
}real<lower=0, upper=1> theta_cp = theta;
// estimate theta_rep from new sample
real<lower=0, upper=1> theta_rep = sum(y_sim) * 1.0 / N;
}
Now the output is slightly more interpretable: theta_cp
is the same as the theta
used to generate the values y_sim[1]
through y_sim[1]
. Comparing columns theta_cp
and theta_rep
allows us to see how the uncertainty in our estimate of theta
is carried forward into our predictions:
y_sim.1,y_sim.2,y_sim.3,y_sim.4,y_sim.5,y_sim.6,y_sim.7,y_sim.8,y_sim.9,y_sim.10,theta_cp,theta_rep
0,1,1,0,1,0,0,1,1,0,0.545679,0.5
1,1,1,1,1,1,0,1,1,0,0.527164,0.8
1,1,1,1,0,1,1,1,1,0,0.529116,0.8
1,0,1,1,1,1,0,0,1,0,0.478844,0.6
0,1,0,0,0,0,1,0,1,0,0.238793,0.3
0,0,0,0,0,1,1,0,0,0,0.258294,0.2
1,1,1,0,0,0,0,0,0,0,0.258465,0.3
Errors
The fitted_params
file must be a Stan CSV file; attempts to use a regular CSV file will result an error message of the form:
Error reading fitted param names from sample csv file <filename.csv>
The fitted_params
file must contain columns corresponding to legal values for all parameters defined in the model. If any parameters are missing, the program will exit with an error message of the form:
Error reading fitted param names from sample csv file <filename.csv>
The parameter values of the fitted_params
are on the constrained scale and must obey all constraints. For example, if we modify the contents of the first reported draw in bernoulli_fit.csv
so that the value of theta
is outside the declared bounds real<lower=0, upper=1>
, the program will return the following error message:
Exception: lub_free: Bounded variable is 1.21397, but must be in the interval [0, 1] \
(in 'bernoulli_ppc.stan', line 5, column 2 to column 30)