3.3 Sliced missing data
If the missing data are part of some larger data structure, then it can often be effectively reassembled using index arrays and slicing. Here’s an example for time-series data, where only some entries in the series are observed.
data {
int<lower = 0> N_obs;
int<lower = 0> N_mis;
int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs];
int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis];
real y_obs[N_obs];
}
transformed data {
int<lower = 0> N = N_obs + N_mis;
}
parameters {
real y_mis[N_mis];
real<lower=0> sigma;
}
transformed parameters {
real y[N];
y[ii_obs] = y_obs;
y[ii_mis] = y_mis;
}
model {
sigma ~ gamma(1, 1);
y[1] ~ normal(0, 100);
y[2:N] ~ normal(y[1:(N - 1)], sigma);
}
The index arrays ii_obs
and ii_mis
contain the indexes into the
final array y
of the observed data (coded as a data vector y_obs
)
and the missing data (coded as a parameter vector y_mis
). See the
time series chapter for further discussion of
time-series model and specifically the autoregression
section for an explanation of the
vectorization for y
as well as an explanation of how to convert this
example to a full AR(1) model. To ensure y[1]
has a proper
posterior in case it is missing, we have given it an explicit, albeit
broad, prior.
Another potential application would be filling the columns of a data matrix of predictors for which some predictors are missing; matrix columns can be accessed as vectors and assigned the same way, as in
x[N_obs_2, 2] = x_obs_2;
x[N_mis_2, 2] = x_mis_2;
where the relevant variables are all hard coded with index 2
because
Stan doesn’t support ragged arrays. These could all be packed into a
single array with more fiddly indexing that slices out vectors from
longer vectors (see the ragged data structures
section for a general discussion of
coding ragged data structures in Stan).