4.2 Truncated Data
Truncated data are data for which measurements are only reported if they fall above a lower bound, below an upper bound, or between a lower and upper bound.
Truncated data may be modeled in Stan using truncated distributions. For example, suppose the truncated data are \(y_n\) with an upper truncation point of \(U = 300\) so that \(y_n < 300\). In Stan, this data can be modeled as following a truncated normal distribution for the observations as follows.
data {
int<lower=0> N;
real U;
real<upper=U> y[N];
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[,U];
}
The model declares an upper bound U
as data and constrains the data for y
to respect the constraint; this will be checked when the data are loaded into the model before sampling begins.
This model implicitly uses an improper flat prior on the scale and location parameters; these could be given priors in the model using sampling statements.
Constraints and Out-of-Bounds Returns
If the sampled variate in a truncated distribution lies outside of the truncation range, the probability is zero, so the log probability will evaluate to \(-\infty\). For instance, if variate y
is sampled with the statement.
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[L,U];
then if the value of y[n]
is less than the value of L
or greater than the value of U
, the sampling statement produces a zero-probability estimate. For user-defined truncation, this zeroing outside of truncation bounds must be handled explicitly.
To avoid variables straying outside of truncation bounds, appropriate constraints are required. For example, if y
is a parameter in the above model, the declaration should constrain it to fall between the values of L
and U
.
parameters {
real<lower=L,upper=U> y[N];
...
If in the above model, L
or U
is a parameter and y
is data, then L
and U
must be appropriately constrained so that all data are in range and the value of L
is less than that of U
(if they are equal, the parameter range collapses to a single point and the Hamiltonian dynamics used by the sampler break down). The following declarations ensure the bounds are well behaved.
parameters {
real<upper=min(y)> L; // L < y[n]
real<lower=fmax(L, max(y))> U; // L < U; y[n] < U
For pairs of real numbers, the function fmax
is used rather than max
.
Unknown Truncation Points
If the truncation points are unknown, they may be estimated as parameters. This can be done with a slight rearrangement of the variable declarations from the model in the previous section with known truncation points.
data {
int<lower=1> N;
real y[N];
}
parameters {
real<upper = min(y)> L;
real<lower = max(y)> U;
real mu;
real<lower=0> sigma;
}
model {
L ~ ...;
U ~ ...;
for (n in 1:N)
y[n] ~ normal(mu, sigma) T[L,U];
}
Here there is a lower truncation point L
which is declared to be less than or equal to the minimum value of y
. The upper truncation point U
is declared to be larger than the maximum value of y
. This declaration, although dependent on the data, only enforces the constraint that the data fall within the truncation bounds. With N
declared as type int<lower=1>
, there must be at least one data point. The constraint that L
is less than U
is enforced indirectly, based on the non-empty data.
The ellipses where the priors for the bounds L
and U
should go should be filled in with a an informative prior in order for this model to not concentrate L
strongly around min(y)
and U
strongly around max(y)
.