26.5 Posterior prediction for regressions
26.5.1 Posterior predictive distributions for regressions
Consider a regression with a single predictor xn for the training outcome yn and ˜xn for the test outcome ˜yn. Without considering the parametric form of any of the distributions, the posterior predictive distribution for a general regression in p(˜y∣˜x,y,x)=∫p(˜y∣x,θ)⋅p(θ∣y,x)dθ≈1MM∑m=1p(˜y∣˜x,θ(m)), where θ(m)∼p(θ∣x,y).
26.5.2 Stan program
The following program defines a Poisson regression with a single predictor. These predictors are all coded as data, as are their sizes. Only the observed y values are coded as data. The predictive quantities ˜y appear in the generated quantities block, where they are generated by random number generation.
data {
int<lower=0> N;
vector[N] x;
array[N] int<lower=0> y;
int<lower=0> N_tilde;
vector[N_tilde] x_tilde;
}
parameters {
real alpha;
real beta;
}
model {
y ~ poisson_log(alpha + beta * x);
{ alpha, beta } ~ normal(0, 1);
}
generated quantities {
array[N_tilde] int<lower=0> y_tilde
= poisson_log_rng(alpha + beta * x_tilde);
}
The Poisson distributions in both the model and generated quantities
block are coded using the log rate as a parameter (that’s
poisson_log
vs. poisson
, with the suffixes defining the scale of
the parameter). The regression coefficients, an intercept alpha
and
slope beta
, are given standard normal priors.
In the model block, the log rate for the Poisson is a linear function of the training data x, whereas in the generated quantities block it is a function of the test data ˜x. Because the generated quantities block does not affect the posterior draws, the model fits α and β using only the training data, reserving ˜x to generate ˜y.
The result from running Stan is a predictive sample ˜y(1),…˜y(M) where each ˜y(m)∼p(˜y∣˜x,x,y).
The mean of the posterior predictive distribution is the expected value
E[˜y∣˜x,x,y]=∫˜y⋅p(˜y∣˜x,θ)⋅p(θ∣x,y)dθ≈1MM∑m=1˜y(m),
where the ˜y(m)∼p(˜y∣˜x,x,y) are
drawn from the posterior predictive distribution. Thus the posterior
mean of y_tilde[n]
after running Stan is the expected value of
˜yn conditioned on the training data x,y and
predictor ˜xn. This is the Bayesian estimate for ˜y
with minimum expected squared error. The posterior draws can also be
used to estimate quantiles for the median and any posterior intervals
of interest for ˜y, as well as covariance of the ~yn.
The posterior draws ˜y(m) may also be used to estimate
predictive event probabilities, such as Pr[˜y1>0] or
Pr[∏˜Nn=1(~yn)>1], as expectations of indicator
functions.
All of this can be carried out by running Stan only a single time to draw a single sample of M draws, ˜y(1),…,˜y(M)∼p(˜y∣˜x,x,y). It’s only when moving to cross-validation where multiple runs are required.