4.2 Truncated Data
Truncated data are data for which measurements are only reported if they fall above a lower bound, below an upper bound, or between a lower and upper bound.
Truncated data may be modeled in Stan using truncated distributions. For example, suppose the truncated data are \(y_n\) with an upper truncation point of \(U = 300\) so that \(y_n < 300\). In Stan, this data can be modeled as following a truncated normal distribution for the observations as follows.
data {
int<lower=0> N;
real U;
real<upper=U> y[N];
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[,U];
}
The model declares an upper bound U as data and constrains
the data for y to respect the constraint; this will be checked
when the data are loaded into the model before sampling begins.
This model implicitly uses an improper flat prior on the scale and location parameters; these could be given priors in the model using sampling statements.
Constraints and Out-of-Bounds Returns
If the sampled variate in a truncated distribution lies outside of
the truncation range, the probability is zero, so the log probability
will evaluate to \(-\infty\). For instance, if variate y is
sampled with the statement.
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[L,U];
then if the value of y[n] is less than the value of L
or greater than the value of U, the sampling statement produces
a zero-probability estimate. For user-defined truncation, this
zeroing outside of truncation bounds must be handled explicitly.
To avoid variables straying outside of truncation bounds, appropriate
constraints are required. For example, if y is a parameter in
the above model, the declaration should constrain it to fall between
the values of L and U.
parameters {
real<lower=L,upper=U> y[N];
...
If in the above model, L or U is a parameter and
y is data, then L and U must be appropriately
constrained so that all data are in range and the value of L is
less than that of U (if they are equal, the parameter range
collapses to a single point and the Hamiltonian dynamics used by
the sampler break down). The following declarations ensure the bounds
are well behaved.
parameters {
real<upper=min(y)> L; // L < y[n]
real<lower=fmax(L, max(y))> U; // L < U; y[n] < U
For pairs of real numbers, the function fmax is used
rather than max.
Unknown Truncation Points
If the truncation points are unknown, they may be estimated as parameters. This can be done with a slight rearrangement of the variable declarations from the model in the previous section with known truncation points.
data {
int<lower=1> N;
real y[N];
}
parameters {
real<upper = min(y)> L;
real<lower = max(y)> U;
real mu;
real<lower=0> sigma;
}
model {
L ~ ...;
U ~ ...;
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[L,U];
}
Here there is a lower truncation point L which is declared to
be less than or equal to the minimum value of y. The upper
truncation point U is declared to be larger than the maximum
value of y. This declaration, although dependent on the data,
only enforces the constraint that the data fall within the truncation
bounds. With N declared as type int<lower=1>, there must be
at least one data point. The constraint that L is less than
U is enforced indirectly, based on the non-empty data.
The ellipses where the priors for the bounds L and U
should go should be filled in with a an informative prior in
order for this model to not concentrate L strongly around
min(y) and U strongly around max(y).