28.1 Evaluating posterior predictive densities
Given training data \((x, y)\) consisting of parallel sequences of predictors and observations and test data \((\tilde{x}, \tilde{y})\) of the same structure, the posterior predictive density is \[ p(\tilde{y} \mid \tilde{x}, x, y) = \int p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta, \]
where \(\theta\) is the vector of model parameters. This predictive density is the density of the test observations, conditioned on both the test predictors \(\tilde{x}\) and the training data \((x, y).\)
This integral may be calculated with Monte Carlo methods as usual, \[ p(\tilde{y} \mid \tilde{x}, x, y) \approx \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \] where the \(\theta^{(m)} \sim p(\theta \mid x, y)\) are draws from the posterior given only the training data \((x, y).\)
To avoid underflow in calculations, it will be more stable to compute densities on the log scale. Taking the logarithm and pushing it through results in a stable computation, \[\begin{eqnarray*} \log p(\tilde{y} \mid \tilde{x}, x, y) & \approx & \log \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \\[4pt] & = & -\log M + \log \sum_{m = 1}^M p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \\[4pt] & = & -\log M + \log \sum_{m = 1}^M \exp(\log p(\tilde{y} \mid \tilde{x}, \theta^{(m)})) \\[4pt] & = & -\log M + \textrm{log-sum-exp}_{m = 1}^M \log p(\tilde{y} \mid \tilde{x}, \theta^{(m)}) \end{eqnarray*}\] where the log sum of exponentials function is defined so as to make the above equation hold, \[ \textrm{log-sum-exp}_{m = 1}^M \, \mu_m = \log \sum_{m=1}^M \exp(\mu_m). \] The log sum of exponentials function can be implemented so as to avoid underflow and maintain high arithmetic precision as \[ \textrm{log-sum-exp}_{m = 1}^M \mu_m = \textrm{max}(\mu) + \sum_{m = 1}^M \exp(\mu_m - \textrm{max}(\mu)). \] Pulling the maximum out preserves all of its precision. By subtracting the maximum, the terms \(\mu_m - \textrm{max}(\mu) \leq 0\), and thus will not overflow.
28.1.1 Stan program
To evaluate the log predictive density of a model, it suffices to implement the log predictive density of the test data in the generated quantities block. The log sum of exponentials calculation must be done on the outside of Stan using the posterior draws of \(\log p(\tilde{y} \mid \tilde{x}, \theta^{(m)}).\)
Here is the code for evaluating the log posterior predictive density in a simple linear regression of the test data \(\tilde{y}\) given predictors \(\tilde{x}\) and training data \((x, y).\)
data {
int<lower = 0> N;
vector[N] y;
vector[N] x;
int<lower = 0> N_tilde;
vector[N_tilde] x_tilde;
vector[N_tilde] y_tilde;
}
parameters {
real alpha;
real beta;
real<lower = 0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
real log_p = normal_lpdf(y_tilde | alpha + beta * x_tilde, sigma);
}
Only the training data x
and y
are used in the model block. The
test data y_tilde
and test predictors x_tilde
appear in only the
generated quantities block. Thus the program is not cheating by using
the test data during training. Although this model does not do so,
it would be fair to use x_tilde
in the model block—only the
test observations y_tilde
are unknown before they are predicted.
Given \(M\) posterior draws from Stan, the sequence log_p[1:M]
will be
available, so that the log posterior predictive density of the test
data given training data and predictors is just log_sum_exp(log_p) - log(M)
.