24.7 Stand-alone generated quantities and ongoing prediction

This is an old version, view current version.

24.7 Stand-alone generated quantities and ongoing prediction

Stan’s sampling algorithms take a Stan program representing a posterior $p(\theta \mid y, x)$ along with actual data $x$ and $y$ to produce a set of draws $\theta^{(1)}, \ldots, \theta^{(M)}$ from the posterior. Posterior predictive draws $\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)$ can be generated by drawing $\tilde{y}^{(m)} \sim p(y \mid \tilde{x}, \theta^{(m)})$ from the sampling distribution. Note that drawing $\tilde{y}^{(m)}$ only depends on the new predictors $\tilde{x}$ and the posterior draws $\theta^{(m)}$ . Most importantly, neither the original data or the model density is required.

By saving the posterior draws, predictions for new data items $\tilde{x}$ may be generated whenever needed. In Stan’s interfaces, this is done by writing a second Stan program that inputs the original program’s parameters and the new predictors. For example, for the linear regression case, the program to take posterior draws declares the data and parameters, and defines the model.

data {
  int<lower = 0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ lognormal(0, 0.5);
}

A second program can be used to generate new observations. This follow-on program need only declare the parameters as they were originally defined. This may require defining constants in the data block such as sizes and hyperparameters that are involved in parameter size or constraint declarations. Then additional data is read in corresponding to predictors for new outcomes that have yet to be observed. There is no need to repeat the model or unneeded transformed parameters or generated quantities. The complete follow-on program for prediction just declares the predictors in the data, the original parameters, and then the predictions in the generated quantities block.

data {
  int<lower = 0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
generated quantities {
  vector[N_tilde] y_tilde
    = normal_rng(alpha + beta * x_tilde, sigma);
}

When running stand-alone generated quantities, the inputs required are the original draws for the parameters and any predictors corresponding to new predictions, and the output will be draws for $\tilde{y}$ or derived quantities such as event probabilities.

Any posterior predictive quantities desired may be generated this way. For example, event probabilities are estimated in the usual way by defining indicator variables in the generated quantities block.