28.1 Evaluating posterior predictive densities

Given training data \((x, y)\) consisting of parallel sequences of predictors and observations and test data \((\tilde{x}, \tilde{y})\) of the same structure, the posterior predictive density is \[ p(\tilde{y} \mid \tilde{x}, x, y) = \int p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta, \]

where \(\theta\) is the vector of model parameters. This predictive density is the density of the test observations, conditioned on both the test predictors \(\tilde{x}\) and the training data \((x, y).\)

This integral may be calculated with Monte Carlo methods as usual, \[ p(\tilde{y} \mid \tilde{x}, x, y) \approx \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \] where the \(\theta^{(m)} \sim p(\theta \mid x, y)\) are draws from the posterior given only the training data \((x, y).\)

To avoid underflow in calculations, it will be more stable to compute densities on the log scale. Taking the logarithm and pushing it through results in a stable computation, \[\begin{eqnarray*} \log p(\tilde{y} \mid \tilde{x}, x, y) & \approx & \log \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \\[4pt] & = & -\log M + \log \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \\[4pt] & = & -\log M + \log \sum_{m = 1}^M \exp(\log p(\tilde{y} \mid \tilde{x}, \theta^{(m)})) \\[4pt] & = & -\log M + \textrm{log-sum-exp}_{m = 1}^M \log p(\tilde{y} \mid \tilde{x}, \theta^{(m)}) \end{eqnarray*}\] where the log sum of exponentials function is defined so as to make the above equation hold, \[ \textrm{log-sum-exp}_{m = 1}^M \, \mu_m = \log \sum_{m=1}^M \exp(\mu_m). \] The log sum of exponentials function can be implemented so as to avoid underflow and maintain high arithmetic precision as \[ \textrm{log-sum-exp}_{m = 1}^M \mu_m = \textrm{max}(\mu) + \sum_{m = 1}^M \exp(\mu_m - \textrm{max}(\mu)). \] Pulling the maximum out preserves all of its precision. By subtracting the maximum, the terms \(\mu_m - \textrm{max}(\mu) \leq 0\), and thus will not overflow.

28.1.1 Stan program

To evaluate the log predictive density of a model, it suffices to implement the log predictive density of the test data in the generated quantities block. The log sum of exponentials calculation must be done on the outside of Stan using the posterior draws of \(\log p(\tilde{y} \mid \tilde{x}, \theta^{(m)}).\)

Here is the code for evaluating the log posterior predictive density in a simple linear regression of the test data \(\tilde{y}\) given predictors \(\tilde{x}\) and training data \((x, y).\)

data {
  int<lower = 0> N;
  vector[N] y;
  vector[N] x;
  int<lower = 0> N_tilde;
  vector[N_tilde] x_tilde;
  vector[N_tilde] y_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
  real log_p = normal_lpdf(y_tilde | alpha + beta * x_tilde, sigma);
}

Only the training data x and y are used in the model block. The test data y_tilde and test predictors x_tilde appear in only the generated quantities block. Thus the program is not cheating by using the test data during training. Although this model does not do so, it would be fair to use x_tilde in the model block—only the test observations y_tilde are unknown before they are predicted.

Given \(M\) posterior draws from Stan, the sequence log_p[1:M] will be available, so that the log posterior predictive density of the test data given training data and predictors is just log_sum_exp(log_p) - log(M).