31.2 Coding the bootstrap in Stan
The bootstrap procedure can be coded quite generally in Stan models. The
following code illustrates a Stan model coding the likelihood for a
simple linear regression. There is a parallel vector
x
of predictors in addition to outcomes y
. To allow a single
program to fit both the original data and random subsamples, the
variable resample
is set to 1 to resample and 0 to use the original data.
data {
int<lower = 0> N;
vector[N] x;
vector[N] y;
int<lower = 0, upper = 1> resample;
}
transformed data {
simplex[N] uniform = rep_vector(1.0 / N, N);
int<lower = 1, upper = N> boot_idxs[N];
for (n in 1:N)
boot_idxs[n] = resample ? categorical_rng(uniform) : n;
}
parameters {
real alpha;
real beta;
real<lower = 0> sigma;
}
model {
y[boot_idxs] ~ normal(alpha + beta * x[boot_idxs], sigma);
}
The model accepts data in the usual form for a linear regression as a
number of observations \(N\) with a size \(N\) vector \(x\) of predictors
and a size \(N\) vector of outcomes. The transformed data block
generates a set of indexes into the data that is the same size as the
data. This is done by independently sampling each entry of boot_idxs
from 1:N
, using a discrete uniform distribution coded as a
categorical random number generator with an equal chance for each
outcome. If resampling is not done, the array boot_idxs
is defined
to be the sequence 1:N
, because x == x[1:N]
and y = y[1:N]
.
For example, when resample == 1
, if \(N = 4,\) the value of
boot_idxs
might be {2, 1, 1, 3}
, resulting in a bootstrap sample
{y[2], y[1], y[1], y[3]}
with the first element repeated twice and
the fourth element not sampled at all.
The parameters are the usual regression coefficients for the intercept
alpha
, slope beta
, and error scale sigma
. The model uses the
bootstrap index variable boot_idx
to index the predictors as
x[boot_idx]
and outcomes as y[boot_idx]
. This generates a new
size-\(N\) vector whose entries are defined by x[boot_idx][n] = x[boot_idx[n]]
and similarly for y
. For example, if \(N = 4\) and
boot_idxs = {2, 1, 1, 3}
, then x[boot_idxs] = [x[2], x[1], x[1], x[3]]'
and y[boot_idxs] = [y[2], y[1], y[1], y[3]]'
. The predictor
and outcome vectors remain aligned, with both elements of the pair
x[1]
and y[1]
repeated twice.
With the model defined this way, if resample
is 1, the model is fit
to a bootstrap subsample of the data. If resample
is 0, the model
is fit to the original data as given. By running the bootstrap fit
multiple times, confidence intervals can be generated from quantiles
of the results.