1.16 Applications of pseudorandom number generation
The main application of pseudorandom number generator (PRNGs) is for posterior inference, including prediction and posterior predictive checks. They can also be used for pure data simulation, which is like a posterior predictive check with no conditioning. See the function reference manual for a complete description of the syntax and usage of pseudorandom number generators.
Prediction
Consider predicting unobserved outcomes using linear regression. Given predictors \(x_1, \dotsc, x_N\) and observed outcomes \(y_1, \dotsc, y_N\), and assuming a standard linear regression with intercept \(\alpha\), slope \(\beta\), and error scale \(\sigma\), along with improper uniform priors, the posterior over the parameters given \(x\) and \(y\) is \[ p\left(\alpha, \beta, \sigma \mid x, y \right) \propto \prod_{n=1}^N \textsf{normal}\left(y_n \mid \alpha + \beta x_n, \sigma\right). \]
For this model, the posterior predictive inference for a new outcome \(\tilde{y}_m\) given a predictor \(\tilde{x}_m\), conditioned on the observed data \(x\) and \(y\), is \[ p\left(\tilde{y}_n \mid \tilde{x}_n, x, y\right) = \int_{(\alpha,\beta,\sigma)} \textsf{normal}\left(\tilde{y}_n \mid \alpha + \beta \tilde{x}_n, \sigma\right) \times p\left(\alpha, \beta, \sigma \mid x, y\right) \,\textrm{d}(\alpha,\beta,\sigma). \]
To code the posterior predictive inference in Stan, a standard linear regression is combined with a random number in the generated quantities block.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
int<lower=0> N_tilde;
vector[N_tilde] x_tilde;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
vector[N_tilde] y_tilde;
for (n in 1:N_tilde) {
y_tilde[n] = normal_rng(alpha + beta * x_tilde[n], sigma);
}
}
Given observed predictors \(x\) and outcomes \(y\), y_tilde
will
be drawn according to \(p\left(\tilde{y} \mid \tilde{x}, y, x\right)\). This
means that, for example, the posterior mean for y_tilde
is the
estimate of the outcome that minimizes expected square error
(conditioned on the data and model).
Posterior predictive checks
A good way to investigate the fit of a model to the data, a critical step in Bayesian data analysis, is to generate simulated data according to the parameters of the model. This is carried out with exactly the same procedure as before, only the observed data predictors \(x\) are used in place of new predictors \(\tilde{x}\) for unobserved outcomes. If the model fits the data well, the predictions for \(\tilde{y}\) based on \(x\) should match the observed data \(y\).
To code posterior predictive checks in Stan requires only a slight modification of the prediction code to use \(x\) and \(N\) in place of \(\tilde{x}\) and \(\tilde{N}\),
generated quantities {
vector[N] y_tilde;
for (n in 1:N) {
y_tilde[n] = normal_rng(alpha + beta * x[n], sigma);
}
}
Andrew Gelman et al. (2013) recommend choosing several posterior draws \(\tilde{y}^{(1)}, \dotsc, \tilde{y}^{(M)}\) and plotting each of them alongside the data \(y\) that was actually observed. If the model fits well, the simulated \(\tilde{y}\) will look like the actual data \(y\).