Stan User’s Guide

This is an old version, view current version.

26.1 Simulating from the posterior predictive distribution

The posterior predictive distribution is the distribution over new observations given previous observations. It’s predictive in the sense that it’s predicting behavior on new data that is not part of the training set. It’s posterior in that everything is conditioned on observed data \(y\).

The posterior predictive distribution for replications \(y^{\textrm{rep}}\) of the original data set \(y\) given model parameters \(\theta\) is defined by \[ p(y^{\textrm{rep}} \mid y) = \int p(y^{\textrm{rep}} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta. \]

As with other posterior predictive quantities, generating a replicated data set \(y^{\textrm{rep}}\) from the posterior predictive distribution is straightforward using the generated quantities block. Consider a simple regression model with parameters \(\theta = (\alpha, \beta, \sigma).\)

data {
  int<lower = 0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  alpha ~ normal(0, 2);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 1);
  y ~ normal(alpha + beta * x, sigma);
}

To generate a replicated data set x_rep for this simple model, the following generated quantities block suffices.

generated quantities {
  vector[N] y_rep = normal_rng(alpha + beta * x, sigma);
}

The replicated data variable is declared to be the same shape and size as the original data y. The vectorized form of the normal random number generator is used with the original predictors x and the model parameters alpha, beta, and sigma. The posterior predictive sampling for posterior predictive checks is different from usual posterior predictive sampling discussed in the chapter on posterior predictions in that the original predictors \(x\) are used. That is, the posterior predictions are for the original data.