3.3 Sliced Missing Data

This is an old version, view current version.

If the missing data are part of some larger data structure, then it can often be effectively reassembled using index arrays and slicing. Here’s an example for time-series data, where only some entries in the series are observed.

data {
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs];
  int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis];
  real y_obs[N_obs];
}
transformed data {
  int<lower = 0> N = N_obs + N_mis;
}
parameters {
  real y_mis[N_mis];
  real<lower=0> sigma;
}
transformed parameters {
  real y[N];
  y[ii_obs] = y_obs;
  y[ii_mis] = y_mis;
}
model {
  sigma ~ gamma(1, 1);
  y[1] ~ normal(0, 100);
  y[2:N] ~ normal(y[1:(N - 1)], sigma);
}

The index arrays ii_obs and ii_mis contain the indexes into the final array y of the observed data (coded as a data vector y_obs) and the missing data (coded as a parameter vector y_mis). See the time series chapter for further discussion of time-series model and specifically the autoregression section for an explanation of the vectorization for y as well as an explanation of how to convert this example to a full AR(1) model. To ensure y[1] has a proper posterior in case it is missing, we have given it an explicit, albeit broad, prior.

Another potential application would be filling the columns of a data matrix of predictors for which some predictors are missing; matrix columns can be accessed as vectors and assigned the same way, as in

x[N_obs_2, 2] = x_obs_2;
x[N_mis_2, 2] = x_mis_2;

where the relevant variables are all hard coded with index 2 because Stan doesn’t support ragged arrays. These could all be packed into a single array with more fiddly indexing that slices out vectors from longer vectors (see the ragged data structures section for a general discussion of coding ragged data structures in Stan).