31.2 Coding the bootstrap in Stan

This is an old version, view current version.

31.2 Coding the bootstrap in Stan

The bootstrap procedure can be coded quite generally in Stan models. The following code illustrates a Stan model coding the likelihood for a simple linear regression. There is a parallel vector x of predictors in addition to outcomes y. To allow a single program to fit both the original data and random subsamples, the variable resample is set to 1 to resample and 0 to use the original data.

data {
  int<lower = 0> N;
  vector[N] x;
  vector[N] y;
  int<lower = 0, upper = 1> resample;
}
transformed data {
  simplex[N] uniform = rep_vector(1.0 / N, N);
  int<lower = 1, upper = N> boot_idxs[N];
  for (n in 1:N)
    boot_idxs[n] = resample ? categorical_rng(uniform) : n;
}
parameters {
  real alpha;
  real beta;
  real<lower = 0> sigma;
}
model {
  y[boot_idxs] ~ normal(alpha + beta * x[boot_idxs], sigma);
}

The model accepts data in the usual form for a linear regression as a number of observations \(N\) with a size \(N\) vector \(x\) of predictors and a size \(N\) vector of outcomes. The transformed data block generates a set of indexes into the data that is the same size as the data. This is done by independently sampling each entry of boot_idxs from 1:N, using a discrete uniform distribution coded as a categorical random number generator with an equal chance for each outcome. If resampling is not done, the array boot_idxs is defined to be the sequence 1:N, because x == x[1:N] and y = y[1:N].

For example, when resample == 1, if \(N = 4,\) the value of boot_idxs might be {2, 1, 1, 3}, resulting in a bootstrap sample {y[2], y[1], y[1], y[3]} with the first element repeated twice and the fourth element not sampled at all.

The parameters are the usual regression coefficients for the intercept alpha, slope beta, and error scale sigma. The model uses the bootstrap index variable boot_idx to index the predictors as x[boot_idx] and outcomes as y[boot_idx]. This generates a new size-\(N\) vector whose entries are defined by x[boot_idx][n] = x[boot_idx[n]] and similarly for y. For example, if \(N = 4\) and boot_idxs = {2, 1, 1, 3}, then x[boot_idxs] = [x[2], x[1], x[1], x[3]]' and y[boot_idxs] = [y[2], y[1], y[1], y[3]]'. The predictor and outcome vectors remain aligned, with both elements of the pair x[1] and y[1] repeated twice.

With the model defined this way, if resample is 1, the model is fit to a bootstrap subsample of the data. If resample is 0, the model is fit to the original data as given. By running the bootstrap fit multiple times, confidence intervals can be generated from quantiles of the results.