This is an old version, view current version.

Posterior Predictive Sampling

The goal of inference is often posterior prediction, that is evaluating or sampling from the posterior predictive distribution \(p(\tilde{y} \mid y),\) where \(y\) is observed data and \(\tilde{y}\) is yet to be observed data. Often there are unmodeled predictors \(x\) and \(\tilde{x}\) for the observed data \(y\) and unobserved data \(\tilde{y}\). With predictors, the posterior predictive density is \(p(\tilde{y} \mid \tilde{x}, x, y).\) All of these variables may represent multivariate quantities.

This chapter explains how to sample from the posterior predictive distribution in Stan, including applications to posterior predictive simulation and calculating event probabilities. These techniques can be coded in Stan using random number generation in the generated quantities block. Further, a technique for fitting and performing inference in two stages is presented in a section on stand-alone generated quantities in Stan

Posterior predictive distribution

Given a full Bayesian model \(p(y, \theta)\), the posterior predictive density for new data \(\tilde{y}\) given observed data \(y\) is \[ p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta. \] The product under the integral reduces to the joint posterior density \(p(\tilde{y}, \theta \mid y),\) so that the integral is simply marginalizing out the parameters \(\theta,\) leaving the predictive density \(p(\tilde{y} \mid y)\) of future observations given past observations.

Computing the posterior predictive distribution

The posterior predictive density (or mass) of a prediction \(\tilde{y}\) given observed data \(y\) can be computed using \(M\) Monte Carlo draws

\[ \theta^{(m)} \sim p(\theta \mid y) \] from the posterior as \[ p(\tilde{y} \mid y) \approx \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \theta^{(m)}). \]

Computing directly using this formula will lead to underflow in many situations, but the log posterior predictive density, \(\log p(\tilde{y} \mid y)\) may be computed using the stable log sum of exponents function as \[\begin{eqnarray*} \log p(\tilde{y} \mid y) & \approx & \log \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \theta^{(m)}). \\[4pt] & = & - \log M + \textrm{log-sum-exp}_{m = 1}^M \log p(\tilde{y} \mid \theta^{(m)}), \end{eqnarray*}\] where \[ \textrm{log-sum-exp}_{m = 1}^M v_m = \log \sum_{m = 1}^M \exp v_m \] is used to maintain arithmetic precision. See the section on log sum of exponentials for more details.

Sampling from the posterior predictive distribution

Given draws from the posterior \(\theta^{(m)} \sim p(\theta \mid y),\) draws from the posterior predictive \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid y)\) can be generated by randomly generating from the sampling distribution with the parameter draw plugged in, \[ \tilde{y}^{(m)} \sim p(y \mid \theta^{(m)}). \]

Randomly drawing \(\tilde{y}\) from the data model is critical because there are two forms of uncertainty in posterior predictive quantities, aleatoric uncertainty and epistemic uncertainty. Epistemic uncertainty arises because \(\theta\) is unknown and estimated based only on a finite sample of data \(y\). Aleatoric uncertainty arises because even a known value of \(\theta\) leads to uncertainty about new \(\tilde{y}\) as described by the data model \(p(\tilde{y} \mid \theta)\). Both forms of uncertainty show up in the factored form of the posterior predictive distribution, \[ p(\tilde{y} \mid y) = \int \underbrace{p(\tilde{y} \mid \theta)}_{\begin{array}{l} \textrm{aleatoric} \\[-2pt] \textrm{uncertainty} \end{array}} \cdot \underbrace{p(\theta \mid y)}_{\begin{array}{l} \textrm{epistemic} \\[-2pt] \textrm{uncertainty} \end{array}} \, \textrm{d}\theta. \]

Posterior predictive simulation in Stan

Posterior predictive quantities can be coded in Stan using the generated quantities block.

Simple Poisson model

For example, consider a simple Poisson model for count data with a rate parameter \(\lambda > 0\) having a gamma-distributed prior, \[ \lambda \sim \textrm{gamma}(1, 1). \] The \(N\) observations \(y_1, \ldots, y_N\) are modeled as Poisson distributed, \[ y_n \sim \textrm{poisson}(\lambda). \]

Stan code

The following Stan program defines a variable for \(\tilde{y}\) by random number generation in the generated quantities block.

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0> lambda;
}
model {
  lambda ~ gamma(1, 1);
  y ~ poisson(lambda);
}
generated quantities {
  int<lower=0> y_tilde = poisson_rng(lambda);
}

The random draw from the data model for \(\tilde{y}\) is coded using Stan’s Poisson random number generator in the generated quantities block. This accounts for the aleatoric component of the uncertainty; Stan’s posterior sampler will account for the epistemic uncertainty, generating a new \(\tilde{y}^{(m)} \sim p(y \mid \lambda^{(m)})\) for each posterior draw \(\lambda^{(m)} \sim p(\theta \mid y).\)

The posterior draws \(\tilde{y}^{(m)}\) may be used to estimate the expected value of \(\tilde{y}\) or any of its quantiles or posterior intervals, as well as event probabilities involving \(\tilde{y}\). In general, \(\mathbb{E}[f(\tilde{y}, \theta) \mid y]\) may be evaluated as \[ \mathbb{E}[f(\tilde{y}, \theta) \mid y] \approx \frac{1}{M} \sum_{m=1}^M f(\tilde{y}^{(m)}, \theta^{(m)}), \] which is just the posterior mean of \(f(\tilde{y}, \theta).\) This quantity is computed by Stan if the value of \(f(\tilde{y}, \theta)\) is assigned to a variable in the generated quantities block. That is, if we have

generated quantities {
  real f_val = f(y_tilde, theta);
  // ...
}

where the value of \(f(\tilde{y}, \theta)\) is assigned to variable f_val, then the posterior mean of f_val will be the expectation \(\mathbb{E}[f(\tilde{y}, \theta) \mid y]\).

Analytic posterior and posterior predictive

The gamma distribution is the conjugate prior distribution for the Poisson distribution, so the posterior density \(p(\lambda \mid y)\) will also follow a gamma distribution.

Because the posterior follows a gamma distribution and the sampling distribution is Poisson, the posterior predictive \(p(\tilde{y} \mid y)\) will follow a negative binomial distribution, because the negative binomial is defined as a compound gamma-Poisson. That is, \(y \sim \textrm{negative-binomial}(\alpha, \beta)\) if \(\lambda \sim \textrm{gamma}(\alpha, \beta)\) and \(y \sim \textrm{poisson}(\lambda).\) Rather than marginalizing out the rate parameter \(\lambda\) analytically as can be done to define the negative binomial probability mass function, the rate \(\lambda^{(m)} \sim p(\lambda \mid y)\) is sampled from the posterior and then used to generate a draw of \(\tilde{y}^{(m)} \sim p(y \mid \lambda^{(m)}).\)

Posterior prediction for regressions

Posterior predictive distributions for regressions

Consider a regression with a single predictor \(x_n\) for the training outcome \(y_n\) and \(\tilde{x}_n\) for the test outcome \(\tilde{y}_n.\) Without considering the parametric form of any of the distributions, the posterior predictive distribution for a general regression in \[\begin{eqnarray} p(\tilde{y} \mid \tilde{x}, y, x) & = & \int p(\tilde{y} \mid x, \theta) \cdot p(\theta \mid y, x) \, \textrm{d}\theta \\[4pt] & \approx & \frac{1}{M} \sum_{m=1}^M \, p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \end{eqnarray}\] where \(\theta^{(m)} \sim p(\theta \mid x, y).\)

Stan program

The following program defines a Poisson regression with a single predictor. These predictors are all coded as data, as are their sizes. Only the observed \(y\) values are coded as data. The predictive quantities \(\tilde{y}\) appear in the generated quantities block, where they are generated by random number generation.

data {
  int<lower=0> N;
  vector[N] x;
  array[N] int<lower=0> y;
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
}
model {
  y ~ poisson_log(alpha + beta * x);
  { alpha, beta } ~ normal(0, 1);
}
generated quantities {
  array[N_tilde] int<lower=0> y_tilde
    = poisson_log_rng(alpha + beta * x_tilde);
}

The Poisson distributions in both the model and generated quantities block are coded using the log rate as a parameter (that’s poisson_log vs. poisson, with the suffixes defining the scale of the parameter). The regression coefficients, an intercept alpha and slope beta, are given standard normal priors.

In the model block, the log rate for the Poisson is a linear function of the training data \(x\), whereas in the generated quantities block it is a function of the test data \(\tilde{x}\). Because the generated quantities block does not affect the posterior draws, the model fits \(\alpha\) and \(\beta\) using only the training data, reserving \(\tilde{x}\) to generate \(\tilde{y}.\)

The result from running Stan is a predictive sample \(\tilde{y}^{(1)}, \ldots \tilde{y}^{(M)}\) where each \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y).\)

The mean of the posterior predictive distribution is the expected value \[\begin{align} \mathbb{E}[\tilde{y} \mid \tilde{x}, x, y] & = \int \tilde{y} \cdot p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta \\[4pt] & \approx \frac{1}{M} \sum_{m = 1}^M \tilde{y}^{(m)}, \end{align}\] where the \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\) are drawn from the posterior predictive distribution. Thus the posterior mean of y_tilde[n] after running Stan is the expected value of \(\tilde{y}_n\) conditioned on the training data \(x, y\) and predictor \(\tilde{x}_n.\) This is the Bayesian estimate for \(\tilde{y}\) with minimum expected squared error. The posterior draws can also be used to estimate quantiles for the median and any posterior intervals of interest for \(\tilde{y}\), as well as covariance of the \(\tilde{y_n}.\) The posterior draws \(\tilde{y}^{(m)}\) may also be used to estimate predictive event probabilities, such as \(\Pr[\tilde{y}_1 > 0]\) or \(\Pr[\prod_{n = 1}^{\tilde{N}}(\tilde{y_n}) > 1],\) as expectations of indicator functions.

All of this can be carried out by running Stan only a single time to draw a single sample of \(M\) draws, \[ \tilde{y}^{(1)}, \ldots, \tilde{y}^{(M)} \sim p(\tilde{y} \mid \tilde{x}, x, y). \] It’s only when moving to cross-validation where multiple runs are required.

Estimating event probabilities

Event probabilities involving either parameters or predictions or both may be coded in the generated quantities block. For example, to evaluate \(\Pr[\lambda > 5 \mid y]\) in the simple Poisson example with only a rate parameter \(\lambda\), it suffices to define a generated quantity

generated quantities {
  int<lower=0, upper=1> lambda_gt_5 = lambda > 5;
  // ...
}

The value of the expression lambda > 5 is 1 if the condition is true and 0 otherwise. The posterior mean of this parameter is the event probability \[\begin{eqnarray*} \Pr[\lambda > 5 \mid y] & = & \int \textrm{I}(\lambda > 5) \cdot p(\lambda \mid y) \, \textrm{d}\lambda \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{I}[\lambda^{(m)} > 5], \end{eqnarray*}\] where each \(\lambda^{(m)} \sim p(\lambda \mid y)\) is distributed according to the posterior. In Stan, this is recovered as the posterior mean of the parameter lambda_gt_5.

In general, event probabilities may be expressed as expectations of indicator functions. For example, \[\begin{eqnarray*} \Pr[\lambda > 5 \mid y] & = & \mathbb{E}[\textrm{I}[\lambda > 5] \mid y] \\[4pt] & = & \int \textrm{I}(\lambda > 5) \cdot p(\lambda \mid y) \, \textrm{d}\lambda \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{I}(\lambda^{(m)} > 5). \end{eqnarray*}\] The last line above is the posterior mean of the indicator function as coded in Stan.

Event probabilities involving posterior predictive quantities \(\tilde{y}\) work exactly the same way as those for parameters. For example, if \(\tilde{y}_n\) is the prediction for the \(n\)-th unobserved outcome (such as the score of a team in a game or a level of expression of a protein in a cell), then \[\begin{eqnarray*} \Pr[\tilde{y}_3 > \tilde{y}_7 \mid \tilde{x}, x, y] & = & \mathbb{E}\!\left[I[\tilde{y}_3 > \tilde{y}_7] \mid \tilde{x}, x, y\right] \\[4pt] & = & \int \textrm{I}(\tilde{y}_3 > \tilde{y}_7) \cdot p(\tilde{y} \mid \tilde{x}, x, y) \, \textrm{d}\tilde{y} \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \textrm{I}(\tilde{y}^{(m)}_3 > \tilde{y}^{(m)}_7), \end{eqnarray*}\] where \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y).\)

Stand-alone generated quantities and ongoing prediction

Stan’s sampling algorithms take a Stan program representing a posterior \(p(\theta \mid y, x)\) along with actual data \(x\) and \(y\) to produce a set of draws \(\theta^{(1)}, \ldots, \theta^{(M)}\) from the posterior. Posterior predictive draws \(\tilde{y}^{(m)} \sim p(\tilde{y} \mid \tilde{x}, x, y)\) can be generated by drawing \[ \tilde{y}^{(m)} \sim p(y \mid \tilde{x}, \theta^{(m)}) \] from the data model. Note that drawing \(\tilde{y}^{(m)}\) only depends on the new predictors \(\tilde{x}\) and the posterior draws \(\theta^{(m)}\). Most importantly, neither the original data or the model density is required.

By saving the posterior draws, predictions for new data items \(\tilde{x}\) may be generated whenever needed. In Stan’s interfaces, this is done by writing a second Stan program that inputs the original program’s parameters and the new predictors. For example, for the linear regression case, the program to take posterior draws declares the data and parameters, and defines the model.

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ lognormal(0, 0.5);
}

A second program can be used to generate new observations. This follow-on program need only declare the parameters as they were originally defined. This may require defining constants in the data block such as sizes and hyperparameters that are involved in parameter size or constraint declarations. Then additional data is read in corresponding to predictors for new outcomes that have yet to be observed. There is no need to repeat the model or unneeded transformed parameters or generated quantities. The complete follow-on program for prediction just declares the predictors in the data, the original parameters, and then the predictions in the generated quantities block.

data {
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
generated quantities {
  vector[N_tilde] y_tilde
    = normal_rng(alpha + beta * x_tilde, sigma);
}

When running stand-alone generated quantities, the inputs required are the original draws for the parameters and any predictors corresponding to new predictions, and the output will be draws for \(\tilde{y}\) or derived quantities such as event probabilities.

Any posterior predictive quantities desired may be generated this way. For example, event probabilities are estimated in the usual way by defining indicator variables in the generated quantities block.

Back to top