Multiple imputation
Missing data is common in applied data analysis. Ignoring this missingness can distort posterior inferences and reduce their precision, and modeling it can improve inference quality. There are several ways to model missing data. Chapter 18 in Gelman et al. (2013) offers an approximately Bayesian perspective, and the chapter Missing Data and Partially Known Parameters in the Reference Manual shows an approach that handles missing data as parameters.
Another way to model missing data is through multiple imputation, which replaces missing values with sampled values to obtain different versions of a complete data set. Then, the model of interest can be fitted to each of these complete versions separately. Combining the resulting draws from these different fits produces a sample that accounts for the uncertainty in the missing values.
Motivation and outline
Suppose that we have a data set \(x\) with columns \(x_{\cdot, 1}, \ldots, x_{\cdot, K}\) that make up covariates in a regression with quantities of interest \(\theta\). With the completely observed data, \(x^\text{comp}\), we could estimate the posterior distribution of \(\theta\) as \(p(\theta \mid x^\text{comp})\). But with missing data, our matrix is split into \(x^{\text{obs}}\) (the observed values of \(x\)) and \(x^{\text{mis}}\) (the missing values of \(x\)).
Fortunately, we can treat \(x^{\text{mis}}\) as additional, nuisance parameters that we can estimate along with \(\theta\) as \(p(\theta, x^{\text{mis}} \mid x^{\text{obs}})\). So, we can express the marginal distribution of \(\theta\) given only the observed values \(x^{\text{obs}}\) as \[\begin{align*} p(\theta \mid x^{\text{obs}}) =& \int p(\theta, x^{\text{mis}} \mid x^{\text{obs}}) \mathrm{d}x^{\text{mis}} \\ =& \int p(\theta \mid x^{\text{obs}}, x^{\text{mis}}) p(x^{\text{mis}} \mid x^{\text{obs}}) \mathrm{d}x^{\text{mis}} \\ =& \int p(\theta \mid x^{\text{imp}}) p(x^{\text{mis}} \mid x^{\text{obs}}) \mathrm{d}x^{\text{mis}}, \end{align*}\] where \(x^{\text{imp}}\) is a data set that includes imputed values of \(x^{\text{mis}}\).
The equations above show that we do not need to describe \(p(\theta \mid x^{\text{obs}})\) directly. Instead, we can first find a way to sample from \(x^{\text{mis}}\) based on \(x^{\text{obs}}\), and then use these samples to fit the model \(p(\theta \mid x^\text{imp})\), treating \(x^\text{imp}\) as if it was \(x^\text{comp}\).
Note that the model for \(p(x^{\text{mis}} \mid x^{\text{obs}})\) needs to be explicit. In a typical regression setting with covariates \(x\) and outcome \(y\), this model for \(x\) is not needed because inferences for the regression parameters are independent of the model of \(x\) when \(x\) is fully observed. But with missing data and multiple imputation, we need to generate values for \(x^{\text{mis}}\) based on \(x^{\text{obs}}\), which in turn needs an explicit model for \(x^{\text{mis}}\).
Thus, the general outline for multiple imputation with a number \(M\) of imputations is:
- Draw \(M\) random samples \(x^{\text{mis}}_1, \ldots, x^{\text{mis}}_M\) from the posterior predictive distribution \(p(x^{\text{mis}} \mid x^{\text{obs}})\).
- Use these samples to get \(M\) different complete data sets \(x_1^\text{imp}, \ldots, x_M^\text{imp}\).
- For each of the \(M\) imputed data sets, sample from the model of interest \(p(\theta \mid x_m^\text{imp})\), each time treating \(x_m^\text{imp}\) as if it was \(x^\text{comp}\).
- Combine the draws for \(\theta\) from all \(M\) fits.
Pseudocode for outline
Following the notation above, we can express the outline with the pseudocode below:
theta_estimates = []
for (m in 1:M) {
x_mis <- get_imputations(x_obs)
x_imp <- add_imputations(x_obs, x_mis)
imputation_fits <- my_model.fit(x_imp).draws()
theta_estimates <- theta_estimates.append(imputation_fits)
}Each of the functions in this pseudocode can encapsulate simple or complex procedures. get_imputations(), for example, may impute missing data by taking values directly from x_obs; or it may model a regression where an \(x_{\cdot, k}\) with missing values is the outcome and the \(x_{\cdot, j}\) without missing observations are predictors. Below we illustrate how to implement this second approach in Stan.
Imputing one variable in Stan
Imagine our data set is composed of two numerical, continuous variables \(x\) and \(y\), and one binary variable \(z\). We want to estimate the parameters that govern the conditional association \(p(y \mid x, z)\), but several values of \(x\) are missing. So, before we fit this model for \(y\), we impute the missing values in \(x\).
We first find all the fully observed data points and identify their values as \(x^{\text{obs}}\), \(y^{\text{obs}}\), and \(z^{\text{obs}}\). Then we use these data to fit the model \[ x^{\text{obs}} \sim \text{normal}(\gamma_0 + \gamma_1 y^{\text{obs}} + \gamma_2 z^{\text{obs}}, \lambda). \]
This model can then give us samples for the missing values \(x^{\text{mis}}\) conditional on the corresponding values of \(y\) and \(z\), which we call \(y^{\text{aux}}\) and \(z^{\text{aux}}\). The Stan code for this imputation is:
data {
int<lower=0> N_obs;
vector[N_obs] x_obs;
vector[N_obs] y_obs;
array[N_obs] int<lower=0, upper=1> z_obs;
int<lower=0> N_mis;
vector[N_mis] y_aux;
array[N_mis] int<lower=0, upper=1> z_aux;
}
parameters {
vector[3] gamma;
real<lower=0> lambda;
}
model {
gamma ~ normal(0, 1);
lambda ~ exponential(1);
x_obs ~ normal(gamma[1] + gamma[2] * y_obs + gamma[3] * z_obs,
lambda);
}
generated quantities {
array[N_mis] x_imp = normal_rng(gamma[1] + gamma[2] * y_aux[n]
+ gamma[3] * z_aux[n], lambda);
}The generated quantities block automatically samples \(\gamma_0\), \(\gamma_1\), \(\gamma_2\), and \(\lambda\) from their posterior distributions. So, the random draws of x_imp incorporate uncertainty from the estimated parameters and from the sampling variation in \(x^{\text{mis}}\).
Multiple posterior draws of \(x^{\text{mis}}\) give us multiple imputed data sets. With these data sets we can model \(p(y \mid \beta, \sigma, x, z)\) as \[ y \sim \text{normal}(\beta_0 + \beta_1 x^\text{comp} + \beta_2 z, \sigma), \] where \(x^\text{comp}\) contains observed and imputed values. The Stan code for this model is:
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
vector[N] z;
}
parameters {
vector[3] beta;
real<lower=0> sigma;
}
model {
beta ~ normal(0, 1);
sigma ~ exponential(1);
y ~ normal(beta[1] + beta[2] * x + beta[3] * z, sigma);
}With multiple imputation, the model for \(p(y \mid \beta, \sigma, x, z)\) does not need to distinguish between observed and imputed values of \(x\). Instead, this model can treat all imputed data sets as complete because we can combine all the posterior draws from multiple fits.
Imputing two or more variables
A more general scenario involves an outcome variable \(y\) and several explanatory variables \(x_1, \ldots, x_K\), each of which can have missing values that we want to impute.
One solution is to do multiple imputation through chained equations, a procedure often called “MICE”1. The MICE procedure in this scenario is:
- Initialize missing values in all \(x_i\). For each variable \(x_i\) with missing entries, fill its missing values with random samples from its observed values (or use another simple initialization rule).
- Update each \(x_i\) given \(y\) and the other \(x\)’s. For \(i=1, \ldots, K\), fit a model for the observed \(x_i\) conditional on the current versions (with observed and imputed values) of \(y\) and \(x_{-k} = x_1, \ldots x_{k-1}, x_{k+1}, \ldots, x_{K}\). Use this model to draw impuations from the predictive distribution of the missing \(x_i\). Completing this step for all \(i = 1, \ldots, K\) constitutes a single imputation cycle.
- Warmup period. Repeat the imputation cycle in step 1 several times as warmup to let the imputations stabilize.
- Create M imputed datasets. After the warmup, record the current complete dataset as one imputed dataset. Then either restart from step 0 and repeat steps 1–3 until you have \(M\) imputed datasets; or do a single long run, i.e., continue iterating steps 1–2 and save the imputed dataset at \(M\) well-spaced iterations (e.g., every \(S\) cycles) to obtain \(M\) imputed datasets, without restarting from step 0 each time.
- Fit the target model \(p(y \mid \beta, \sigma, x_1, \ldots, x_K)\) separately to each of the \(M\) imputed datasets and save the posterior draws for the parameters of interest.
- Combine the draws (or other summaries) from all \(M\) fits.
Note that, as described here, the MICE procedure does not guarantee that the conditional distributions of all variables will be compatible. Compatibility means that there is a joint distribution for all the variables used in the imputation that can be decomposed as the conditional distributions we used. An incompatible imputation model will technically not sample from any well-defined probability distribution. But the consequences of this are not always serious. See section 6.4 in Carpenter et al. (2023) for a more detailed explanation of compatibility and of the related concept of congeniality.
Imputing two variables in Stan
Imagine that we again want to use numerical variables \(x\) and \(y\), and dichotomic variable \(z\) to estimate the parameters in \(p(y \mid \beta, \sigma, x, z)\). But now both \(x\) and \(z\) have missing values that we want to impute.
To apply the MICE procedure in this example, we can reuse the models for \(x\) and \(y\) that we defined above. We also need a new model to impute \(z^{\text{mis}}\), so we use the logistic regression \[ z^{\text{obs}} \sim \text{Bernoulli}( \text{logit}^{-1}( \alpha_0 + \alpha_1 y^{\text{obs}} + \alpha_2 x^{\text{obs}}) ). \] Here, \(z^{\text{obs}}\) contains only the completely observed values from our original data set, while \(x^{\text{obs}}\) and \(y^{\text{obs}}\) contain all the values that correspond to \(z^{\text{obs}}\). Thus, \(x^{\text{obs}}\) can include observed and imputed values, and \(y^{\text{obs}}\) need not be the same as in the model for \(x^{\text{obs}}\).
With this model we can sample values for \(z^{\text{mis}}\) conditional on its corresponding values \(y^{\text{aux}}\) and \(x^{\text{aux}}\).
The Stan code to impute \(z^{\text{mis}}\) is:
data {
int<lower=0> N_obs;
array[N_obs] int<lower=0, upper=1> z_obs;
vector[N_obs] y_obs;
vector[N_obs] x_obs;
int<lower=0> N_mis;
vector[N_mis] y_aux;
vector[N_mis] x_aux;
}
parameters {
vector[3] gamma;
}
model {
gamma ~ normal(0, 1);
z_obs ~ bernoulli_logit(gamma[1] + gamma[2] * y_obs
+ gamma[3] * x_obs);
}
generated quantities {
array[N_mis] int z_imp = bernoulli_logit_rng(gamma[1]
+ gamma[2] * y_aux
+ gamma[3] * x_aux);
}The pseudocode for this example of MICE, restarting the imputations after each imputation cycle, is shown below. Note that we do not need to initialize \(x^{\text{mis}}\) in step 0 because we can impute it directly with the model for \(x^{\text{obs}}\).
completed_datasets <- []
for (m in 1:M) {
data_m <- copy(data_orig)
data_m.z[missing_idz] <- random_sample(data_m.z[observed_idz])
for (cycle in 1:n_warm) {
stanmod_x <- build_model("model_for_x.stan")
fit_x <- stanmod_x.fit(
obs_data=data_m[observed_idx],
mis_data=data_m[missing_idx]
)
data_m.x[missing_idx] <- get_imputations(fit_x, "x_imp")
stanmod_z <- build_model("model_for_z.stan")
fit_z <- stanmod_z.fit(
obs_data=data_m[observed_idz],
mis_data=data_m[missing_idz]
)
data_m.z[missing_idz] <- get_imputations(fit_z, "z_imp")
}
completed_datasets.append(data_m)
}
all_draws <- []
for (dataset in completed_datasets) {
stanmod_y <- build_model("model_for_y.stan")
fit_y <- stanmod_y.fit(dataset)
all_draws.append(extract_draws(model_results))
}
all_results <- combine_results(all_draws)Combining posterior draws
With Stan’s MCMC sampler, we can treat posterior draw chains from imputed data sets as if they were chains based on complete data. There is one important difference. Multiple imputation expresses uncertainty in the missing values as consistent differences in the estimates obtained from different imputated data sets. This means that chains obtained from the same imputed data set should converge, but chains obtained from different data sets do not have to. So, we need not worry if diagnostics2 signal that the chains from different imputed data sets are not converging properly. See an example in Bürkner (2025).
Cut models
A full Bayesian probability model includes a feedback flow of information between all parameters and all data. Cut models separate some parts of this feedback flow so that different subsets of data influence only some parameters in the model (see Plummer (2015)).
Multiple imputation interrupts the flow of information from data to parameters. In our regression above, for example, the imputations influence the distribution of the parameters in \(p(y \mid x_1, \ldots, x_K)\), but these parameters do not influence the imputations. So, we could use multiple imputation to implement a cut model.
References
Footnotes
Section 5 of chapter 4 in van Buuren (2018) details the MICE procedure in a frequentist context.↩︎
Such as \(\hat{R}\). See Split R-hat for detecting non-stationarity in the Reference Manual.↩︎