3.1 Missing data

This is an old version, view current version.

Stan treats variables declared in the data and transformed data blocks as known and the variables in the parameters block as unknown.

An example involving missing normal observations could be coded as follows.¹⁰

data {
  int<lower=0> N_obs;
  int<lower=0> N_mis;
  real y_obs[N_obs];
}
parameters {
  real mu;
  real<lower=0> sigma;
  real y_mis[N_mis];
}
model {
  y_obs ~ normal(mu, sigma);
  y_mis ~ normal(mu, sigma);
}

The number of observed and missing data points are coded as data with non-negative integer variables N_obs and N_mis. The observed data are provided as an array data variable y_obs. The missing data are coded as an array parameter, y_mis. The ordinary parameters being estimated, the location mu and scale sigma, are also coded as parameters. The model is vectorized on the observed and missing data; combining them in this case would be less efficient because the data observations would be promoted and have needless derivatives calculated.

A more meaningful estimation example would involve a regression of the observed and missing observations using predictors that were known for each and specified in the data block.↩