1.14 Prediction, forecasting, and backcasting
Stan models can be used for “predicting” the values of arbitrary model unknowns. When predictions are about the future, they’re called “forecasts;” when they are predictions about the past, as in climate reconstruction or cosmology, they are sometimes called “backcasts” (or “aftcasts” or “hindcasts” or “antecasts,” depending on the author’s feelings about the opposite of “fore”).
Programming predictions
As a simple example, the following linear regression provides the same
setup for estimating the coefficients beta
as in our very first
example, using y
for the N
observations and
x
for the N
predictor vectors. The model parameters and
model for observations are exactly the same as before.
To make predictions, we need to be given the number of predictions,
N_new
, and their predictor matrix, x_new
. The
predictions themselves are modeled as a parameter y_new
. The
model statement for the predictions is exactly the same as for the
observations, with the new outcome vector y_new
and prediction
matrix x_new
.
data {
int<lower=1> K;
int<lower=0> N;
matrix[N, K] x;
vector[N] y;
int<lower=0> N_new;
matrix[N_new, K] x_new;
}
parameters {
vector[K] beta;
real<lower=0> sigma;
vector[N_new] y_new; // predictions
}
model {
y ~ normal(x * beta, sigma); // observed model
y_new ~ normal(x_new * beta, sigma); // prediction model
}
Predictions as generated quantities
Where possible, the most efficient way to generate predictions is to use the generated quantities block. This provides proper Monte Carlo (not Markov chain Monte Carlo) inference, which can have a much higher effective sample size per iteration.
// ...data as above...
parameters {
vector[K] beta;
real<lower=0> sigma;
}
model {
y ~ normal(x * beta, sigma);
}
generated quantities {
vector[N_new] y_new;
for (n in 1:N_new) {
y_new[n] = normal_rng(x_new[n] * beta, sigma);
}
}
Now the data are just as before, but the parameter y_new
is now
declared as a generated quantity, and the prediction model is
removed from the model and replaced by a pseudo-random draw from a
normal distribution.
Overflow in generated quantities
It is possible for values to overflow or underflow in generated quantities. The problem is that if the result is NaN, then any constraints placed on the variables will be violated. It is possible to check a value assigned by an RNG and reject it if it overflows, but this is both inefficient and leads to biased posterior estimates. Instead, the conditions causing overflow, such as trying to generate a negative binomial random variate with a mean of \(2^{31}\), must be intercepted and dealt with. This is typically done by reparameterizing or reimplementing the random number generator using real values rather than integers, which are upper-bounded by \(2^{31} - 1\) in Stan.