4.3 Censored data
Censoring hides values from points that are too large, too small, or both. Unlike with truncated data, the number of data points that were censored is known. The textbook example is the household scale which does not report values above 300 pounds.
Estimating censored values
One way to model censored data is to treat the censored data as missing data that is constrained to fall in the censored range of values. Since Stan does not allow unknown values in its arrays or matrices, the censored values must be represented explicitly, as in the following right-censored case.
data {
int<lower=0> N_obs;
int<lower=0> N_cens;
array[N_obs] real y_obs;
real<lower=max(y_obs)> U;
}parameters {
array[N_cens] real<lower=U> y_cens;
real mu;
real<lower=0> sigma;
}model {
y_obs ~ normal(mu, sigma);
y_cens ~ normal(mu, sigma); }
Because the censored data array y_cens
is declared to be a parameter, it
will be sampled along with the location and scale parameters mu
and sigma
. Because the censored data array y_cens
is
declared to have values of type real<lower=U>
, all imputed values
for censored data will be greater than U
. The imputed censored
data affects the location and scale parameters through the last
sampling statement in the model.
Integrating out censored values
Although it is wrong to ignore the censored values in estimating location and scale, it is not necessary to impute values. Instead, the values can be integrated out. Each censored data point has a probability of \[\begin{align*} \operatorname{Pr}[y > U] &= \int_U^{\infty} \textsf{normal}\left(y \mid \mu,\sigma \right) \,\textsf{d}y \\ &= 1 - \Phi\left(\frac{U - \mu}{\sigma}\right), \end{align*}\]
where \(\Phi()\) is the standard normal cumulative distribution function. With \(M\) censored observations, the total probability on the log scale is \[\begin{align*} \log \prod_{m=1}^M \operatorname{Pr}[y_m > U] &= \log \left( 1 - \Phi\left(\frac{U - \mu}{\sigma}\right)\right)^{M} \\ &= M \times \texttt{normal}\mathtt{\_}\texttt{lccdf}\left(U \mid \mu, \sigma \right), \end{align*}\]
where normal_lccdf
is the log of complementary CDF
(Stan provides <distr>_lccdf
for each distribution
implemented in Stan).
The following right-censored model assumes that the censoring point is known, so it is declared as data.
data {
int<lower=0> N_obs;
int<lower=0> N_cens;
array[N_obs] real y_obs;
real<lower=max(y_obs)> U;
}parameters {
real mu;
real<lower=0> sigma;
}model {
y_obs ~ normal(mu, sigma);target += N_cens * normal_lccdf(U | mu, sigma);
}
For the observed values in y_obs
, the normal sampling model is
used without truncation. The log probability is directly incremented
using the calculated log cumulative normal probability of the censored
data items.
For the left-censored data the CDF (normal_lcdf
) has to be
used instead of complementary CDF. If the censoring point variable
(L
) is unknown, its declaration should be moved from the data
to the parameters block.
data {
int<lower=0> N_obs;
int<lower=0> N_cens;
array[N_obs] real y_obs;
}parameters {
real<upper=min(y_obs)> L;
real mu;
real<lower=0> sigma;
}model {
L ~ normal(mu, sigma);
y_obs ~ normal(mu, sigma);target += N_cens * normal_lcdf(L | mu, sigma);
}