Stan User’s Guide

This is an old version, view current version.

1.16 Applications of Pseudorandom Number Generation

The main application of pseudorandom number generator (PRNGs) is for posterior inference, including prediction and posterior predictive checks. They can also be used for pure data simulation, which is like a posterior predictive check with no conditioning. See the function reference manual for a complete description of the syntax and usage of pseudorandom number generators.

Prediction

Consider predicting unobserved outcomes using linear regression. Given predictors \(x_1, \ldots, x_N\) and observed outcomes \(y_1,\ldots,y_N\), and assuming a standard linear regression with intercept \(\alpha\), slope \(\beta\), and error scale \(\sigma\), along with improper uniform priors, the posterior over the parameters given \(x\) and \(y\) is

\[ p(\alpha, \beta, \sigma \, | \, x, y) \propto \prod_{n=1}^N \mathsf{normal}(y_n \, | \, \alpha + \beta x_n, \sigma). \]

For this model, the posterior predictive inference for a new outcome \(\tilde{y}_m\) given a predictor \(\tilde{x}_m\), conditioned on the observed data \(x\) and \(y\), is \[ p(\tilde{y}_n \, | \, \tilde{x}_n, x, y) = \int_{(\alpha,\beta,\sigma)} \mathsf{normal}(\tilde{y}_n \, | \, \alpha + \beta \tilde{x}_n, \sigma) * p(\alpha, \beta, \sigma \, | \, x, y) \ \mathrm{d}(\alpha,\beta,\sigma). \]

To code the posterior predictive inference in Stan, a standard linear regression is combined with a random number in the generated quantities block.

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
  vector[N_tilde] y_tilde;
  for (n in 1:N_tilde)
    y_tilde[n] = normal_rng(alpha + beta * x_tilde[n], sigma);
}

Given observed predictors \(x\) and outcomes \(y\), y_tilde will be drawn according to \(p(\tilde{y} \, | \, \tilde{x}, y, x)\). This means that, for example, the posterior mean for y_tilde is the estimate of the outcome that minimizes expected square error (conditioned on the data and model).

Posterior Predictive Checks

A good way to investigate the fit of a model to the data, a critical step in Bayesian data analysis, is to generate simulated data according to the parameters of the model. This is carried out with exactly the same procedure as before, only the observed data predictors \(x\) are used in place of new predictors \(\tilde{x}\) for unobserved outcomes. If the model fits the data well, the predictions for \(\tilde{y}\) based on \(x\) should match the observed data \(y\).

To code posterior predictive checks in Stan requires only a slight modification of the prediction code to use \(x\) and \(N\) in place of \(\tilde{x}\) and \(\tilde{N}\),

generated quantities {
  vector[N] y_tilde;
  for (n in 1:N)
    y_tilde[n] = normal_rng(alpha + beta * x[n], sigma);
}

Gelman et al. (2013) recommend choosing several posterior draws \(\tilde{y}^{(1)}, \ldots, \tilde{y}^{(M)}\) and plotting each of them alongside the data \(y\) that was actually observed. If the model fits well, the simulated \(\tilde{y}\) will look like the actual data \(y\).

References

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third. London: Chapman &Hall/CRC Press.