26.5 Posterior prediction for regressions

This is an old version, view current version.

26.5 Posterior prediction for regressions

26.5.1 Posterior predictive distributions for regressions

Consider a regression with a single predictor $x_n$ for the training outcome $y_n$ and $\tilde{x}_n$ for the test outcome $\tilde{y}_n.$ Without considering the parametric form of any of the distributions, the posterior predictive distribution for a general regression in $\begin{eqnarray} p(\tilde{y} \mid \tilde{x}, y, x) & = & \int p(\tilde{y} \mid x, \theta) \cdot p(\theta \mid y, x) \, \textrm{d}\theta \\[4pt] & \approx & \frac{1}{M} \sum_{m=1}^M \, p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \end{eqnarray}$ where $\theta^{(m)} \sim p(\theta \mid x, y).$

26.5.2 Stan program

The following program defines a Poisson regression with a single predictor. These predictors are all coded as data, as are their sizes. Only the observed $y$ values are coded as data. The predictive quantities $\tilde{y}$ appear in the generated quantities block, where they are generated by random number generation.

data {
  int<lower=0> N;
  vector[N] x;
  array[N] int<lower=0> y;
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
}
model {
  y ~ poisson_log(alpha + beta * x);
  { alpha, beta } ~ normal(0, 1);
}
generated quantities {
  array[N_tilde] int<lower=0> y_tilde
    = poisson_log_rng(alpha + beta * x_tilde);
}

The Poisson distributions in both the model and generated quantities block are coded using the log rate as a parameter (that’s poisson_log vs. poisson, with the suffixes defining the scale of the parameter). The regression coefficients, an intercept alpha and slope beta, are given standard normal priors.

In the model block, the log rate for the Poisson is a linear function of the training data $x$ , whereas in the generated quantities block it is a function of the test data $\tilde{x}$ . Because the generated quantities block does not affect the posterior draws, the model fits $\alpha$ and $\beta$ using only the training data, reserving $\tilde{x}$ to generate $\tilde{y}.$

The result from running Stan is a predictive sample $\tilde{y}^{(1)}, \ldots \tilde{y}^{(M)}$ where each $\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y).$

The mean of the posterior predictive distribution is the expected value $\begin{align} \mathbb{E}[\tilde{y} \mid \tilde{x}, x, y] & = & \int \tilde{y} \cdot p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \tilde{y}^{(m)}, \end{align}$ where the $\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)$ are drawn from the posterior predictive distribution. Thus the posterior mean of y_tilde[n] after running Stan is the expected value of $\tilde{y}_n$ conditioned on the training data $x, y$ and predictor $\tilde{x}_n.$ This is the Bayesian estimate for $\tilde{y}$ with minimum expected squared error. The posterior draws can also be used to estimate quantiles for the median and any posterior intervals of interest for $\tilde{y}$ , as well as covariance of the $\tilde{y_n}.$ The posterior draws $\tilde{y}^{(m)}$ may also be used to estimate predictive event probabilities, such as $\mbox{Pr}[\tilde{y}_1 > 0]$ or $\mbox{Pr}[\prod_{n = 1}^{\tilde{N}}(\tilde{y_n}) > 1],$ as expectations of indicator functions.

All of this can be carried out by running Stan only a single time to draw a single sample of $M$ draws, $\tilde{y}^{(1)}, \ldots, \tilde{y}^{(M)} \sim p(\tilde{y} \mid \tilde{x}, x, y).$ It’s only when moving to cross-validation where multiple runs are required.