4.2 Truncated data

This is an old version, view current version.

Truncated data are data for which measurements are only reported if they fall above a lower bound, below an upper bound, or between a lower and upper bound.

Truncated data may be modeled in Stan using truncated distributions. For example, suppose the truncated data are $y_n$ with an upper truncation point of $U = 300$ so that $y_n < 300$ . In Stan, this data can be modeled as following a truncated normal distribution for the observations as follows.

data {
  int<lower=0> N;
  real U;
  array[N] real<upper=U> y;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  y ~ normal(mu, sigma) T[ , U];
}

The model declares an upper bound U as data and constrains the data for y to respect the constraint; this will be checked when the data are loaded into the model before sampling begins.

This model implicitly uses an improper flat prior on the scale and location parameters; these could be given priors in the model using sampling statements.

Constraints and out-of-bounds returns

If the sampled variate in a truncated distribution lies outside of the truncation range, the probability is zero, so the log probability will evaluate to $-\infty$ . For instance, if variate y is sampled with the statement.

y ~ normal(mu, sigma) T[L, U];

then if any value inside y is less than the value of L or greater than the value of U, the sampling statement produces a zero-probability estimate. For user-defined truncation, this zeroing outside of truncation bounds must be handled explicitly.

To avoid variables straying outside of truncation bounds, appropriate constraints are required. For example, if y is a parameter in the above model, the declaration should constrain it to fall between the values of L and U.

parameters {
  array[N] real<lower=L, upper=U> y;
  // ...
}

If in the above model, L or U is a parameter and y is data, then L and U must be appropriately constrained so that all data are in range and the value of L is less than that of U (if they are equal, the parameter range collapses to a single point and the Hamiltonian dynamics used by the sampler break down). The following declarations ensure the bounds are well behaved.

parameters {
  real<upper=min(y)> L;           // L < y[n]
  real<lower=fmax(L, max(y))> U;  // L < U; y[n] < U

For pairs of real numbers, the function fmax is used rather than max.

Unknown truncation points

If the truncation points are unknown, they may be estimated as parameters. This can be done with a slight rearrangement of the variable declarations from the model in the previous section with known truncation points.

data {
  int<lower=1> N;
  array[N] real y;
}
parameters {
  real<upper=min(y)> L;
  real<lower=max(y)> U;
  real mu;
  real<lower=0> sigma;
}
model {
  L ~ // ...
  U ~ // ...
  y ~ normal(mu, sigma) T[L, U];
}

Here there is a lower truncation point L which is declared to be less than or equal to the minimum value of y. The upper truncation point U is declared to be larger than the maximum value of y. This declaration, although dependent on the data, only enforces the constraint that the data fall within the truncation bounds. With N declared as type int<lower=1>, there must be at least one data point. The constraint that L is less than U is enforced indirectly, based on the non-empty data.

The ellipses where the priors for the bounds L and U should go should be filled in with a an informative prior in order for this model to not concentrate L strongly around min(y) and U strongly around max(y).