4.3 Censored data

This is an old version, view current version.

Censoring hides values from points that are too large, too small, or both. Unlike with truncated data, the number of data points that were censored is known. The textbook example is the household scale which does not report values above 300 pounds.

Estimating censored values

One way to model censored data is to treat the censored data as missing data that is constrained to fall in the censored range of values. Since Stan does not allow unknown values in its arrays or matrices, the censored values must be represented explicitly, as in the following right-censored case.

data {
  int<lower=0> N_obs;
  int<lower=0> N_cens;
  array[N_obs] real y_obs;
  real<lower=max(y_obs)> U;
}
parameters {
  array[N_cens] real<lower=U> y_cens;
  real mu;
  real<lower=0> sigma;
}
model {
  y_obs ~ normal(mu, sigma);
  y_cens ~ normal(mu, sigma);
}

Because the censored data array y_cens is declared to be a parameter, it will be sampled along with the location and scale parameters mu and sigma. Because the censored data array y_cens is declared to have values of type real<lower=U>, all imputed values for censored data will be greater than U. The imputed censored data affects the location and scale parameters through the last sampling statement in the model.

Integrating out censored values

Although it is wrong to ignore the censored values in estimating location and scale, it is not necessary to impute values. Instead, the values can be integrated out. Each censored data point has a probability of \[\begin{align*} \operatorname{Pr}[y > U] &= \int_U^{\infty} \textsf{normal}\left(y \mid \mu,\sigma \right) \,\textsf{d}y \\ &= 1 - \Phi\left(\frac{U - \mu}{\sigma}\right), \end{align*}\]

where \(\Phi()\) is the standard normal cumulative distribution function. With \(M\) censored observations, the total probability on the log scale is \[\begin{align*} \log \prod_{m=1}^M \operatorname{Pr}[y_m > U] &= \log \left( 1 - \Phi\left(\frac{U - \mu}{\sigma}\right)\right)^{M} \\ &= M \times \texttt{normal}\mathtt{\_}\texttt{lccdf}\left(U \mid \mu, \sigma \right), \end{align*}\]

where normal_lccdf is the log of complementary CDF (Stan provides <distr>_lccdf for each distribution implemented in Stan).

The following right-censored model assumes that the censoring point is known, so it is declared as data.

data {
  int<lower=0> N_obs;
  int<lower=0> N_cens;
  array[N_obs] real y_obs;
  real<lower=max(y_obs)> U;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  y_obs ~ normal(mu, sigma);
  target += N_cens * normal_lccdf(U | mu, sigma);
}

For the observed values in y_obs, the normal sampling model is used without truncation. The log probability is directly incremented using the calculated log cumulative normal probability of the censored data items.

For the left-censored data the CDF (normal_lcdf) has to be used instead of complementary CDF. If the censoring point variable (L) is unknown, its declaration should be moved from the data to the parameters block.

data {
  int<lower=0> N_obs;
  int<lower=0> N_cens;
  array[N_obs] real y_obs;
}
parameters {
  real<upper=min(y_obs)> L;
  real mu;
  real<lower=0> sigma;
}
model {
  L ~ normal(mu, sigma);
  y_obs ~ normal(mu, sigma);
  target += N_cens * normal_lcdf(L | mu, sigma);
}