4.2 Truncated data
Truncated data are data for which measurements are only reported if they fall above a lower bound, below an upper bound, or between a lower and upper bound.
Truncated data may be modeled in Stan using truncated distributions. For example, suppose the truncated data are \(y_n\) with an upper truncation point of \(U = 300\) so that \(y_n < 300\). In Stan, this data can be modeled as following a truncated normal distribution for the observations as follows.
data {
int<lower=0> N;
real U;
real<upper=U> y[N];
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[,U];
}
The model declares an upper bound U
as data and constrains
the data for y
to respect the constraint; this will be checked
when the data are loaded into the model before sampling begins.
This model implicitly uses an improper flat prior on the scale and location parameters; these could be given priors in the model using sampling statements.
Constraints and out-of-bounds returns
If the sampled variate in a truncated distribution lies outside of
the truncation range, the probability is zero, so the log probability
will evaluate to \(-\infty\). For instance, if variate y
is
sampled with the statement.
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[L,U];
then if the value of y[n]
is less than the value of L
or greater than the value of U
, the sampling statement produces
a zero-probability estimate. For user-defined truncation, this
zeroing outside of truncation bounds must be handled explicitly.
To avoid variables straying outside of truncation bounds, appropriate
constraints are required. For example, if y
is a parameter in
the above model, the declaration should constrain it to fall between
the values of L
and U
.
parameters {
real<lower=L,upper=U> y[N];
...
If in the above model, L
or U
is a parameter and
y
is data, then L
and U
must be appropriately
constrained so that all data are in range and the value of L
is
less than that of U
(if they are equal, the parameter range
collapses to a single point and the Hamiltonian dynamics used by
the sampler break down). The following declarations ensure the bounds
are well behaved.
parameters {
real<upper=min(y)> L; // L < y[n]
real<lower=fmax(L, max(y))> U; // L < U; y[n] < U
For pairs of real numbers, the function fmax
is used
rather than max
.
Unknown truncation points
If the truncation points are unknown, they may be estimated as parameters. This can be done with a slight rearrangement of the variable declarations from the model in the previous section with known truncation points.
data {
int<lower=1> N;
real y[N];
}
parameters {
real<upper = min(y)> L;
real<lower = max(y)> U;
real mu;
real<lower=0> sigma;
}
model {
L ~ ...;
U ~ ...;
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[L,U];
}
Here there is a lower truncation point L
which is declared to
be less than or equal to the minimum value of y
. The upper
truncation point U
is declared to be larger than the maximum
value of y
. This declaration, although dependent on the data,
only enforces the constraint that the data fall within the truncation
bounds. With N
declared as type int<lower=1>
, there must be
at least one data point. The constraint that L
is less than
U
is enforced indirectly, based on the non-empty data.
The ellipses where the priors for the bounds L
and U
should go should be filled in with a an informative prior in
order for this model to not concentrate L
strongly around
min(y)
and U
strongly around max(y)
.