5.6 Zero-Inflated and Hurdle Models
Zero-inflated and hurdle models both provide mixtures of a Poisson and Bernoulli probability mass function to allow more flexibility in modeling the probability of a zero outcome. Zero-inflated models, as defined by Lambert (1992), add additional probability mass to the outcome of zero. Hurdle models, on the other hand, are formulated as pure mixtures of zero and non-zero outcomes.
Zero inflation and hurdle models can be formulated for discrete distributions other than the Poisson. Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior.
Zero Inflation
Consider the following example for zero-inflated Poisson
distributions. It uses a parameter theta
here there is a
probability \(\theta\) of drawing a zero, and a probability \(1 - \theta\)
of drawing from \(\mathsf{Poisson}(\lambda)\) (now \(\theta\) is being
used for mixing proportions because \(\lambda\) is the traditional
notation for a Poisson mean parameter). The probability function is
thus
\[
p(y_n|\theta,\lambda)
=
\left\{
\begin{array}{ll}
\theta + (1 - \theta) * \mathsf{Poisson}(0|\lambda) & \mbox{ if } y_n = 0, \mbox{ and}
\\[3pt]
(1-\theta) * \mathsf{Poisson}(y_n|\lambda) & \mbox{ if } y_n > 0.
\end{array}
\right.
\]
The log probability function can be implemented directly in Stan as follows.
data {
int<lower=0> N;
int<lower=0> y[N];
}
parameters {
real<lower=0, upper=1> theta;
real<lower=0> lambda;
}
model {
for (n in 1:N) {
if (y[n] == 0)
target += log_sum_exp(bernoulli_lpmf(1 | theta),
bernoulli_lpmf(0 | theta)
+ poisson_lpmf(y[n] | lambda));
else
target += bernoulli_lpmf(0 | theta)
+ poisson_lpmf(y[n] | lambda);
}
}
The log_sum_exp(lp1,lp2)
function adds the log probabilities
on the linear scale; it is defined to be equal to log(exp(lp1) + exp(lp2))
, but is more arithmetically stable and faster.
Optimizing the zero-inflated Poisson model
The code given above to compute the zero-inflated Poisson
redundantly calculates all of the Bernoulli terms and also
poisson_lpmf(0 | lambda)
every time the first condition
body executes. The use of the redundant terms is conditioned on
y
, which is known when the data are read in. This allows
the transformed data block to be used to compute some more convenient
terms for expressing the log density each iteration.
The number of zero cases is computed and handled separately.
Then the nonzero cases are collected into their own array for
vectorization. The number of zeros is required to declare
y_nonzero
, so it must be computed in a function.
functions {
int num_zeros(int[] y) {
int sum = 0;
for (n in 1:size(y))
sum += (y[n] == 0);
return sum;
}
}
...
transformed data {
int<lower = 0> N_zero = num_zeros(y);
int<lower = 1> y_nonzero[N - N_zero];
int N_nonzero = 0;
for (n in 1:N) {
if (y[n] == 0) continue;
N_nonzero += 1;
y_nonzero[N_nonzero] = y[n];
}
}
...
model {
...
target
+= N_zero
* log_sum_exp(bernoulli_lpmf(1 | theta),
bernoulli_lpmf(0 | theta)
+ poisson_lpmf(0 | lambda));
target += N_nonzero * bernoulli_lpmf(0 | theta);
target += poisson_lpmf(y_nonzero | lambda);
...
The boundary conditions of all zeros and no zero outcomes is handled
appropriately; in the vectorized case, if y_nonzero
is empty,
N_nonzero
will be zero, and the last two target increment
terms will add zeros.
Hurdle Models
The hurdle model is similar to the zero-inflated model, but more flexible in that the zero outcomes can be deflated as well as inflated. The probability mass function for the hurdle likelihood is defined by
\[ p(y|\theta,\lambda) = \begin{cases} \ \theta & \mbox{if } y = 0, \mbox{ and} \\ \ (1 - \theta) \ \frac{\displaystyle \mathsf{Poisson}(y | \lambda)} {\displaystyle 1 - \mathsf{PoissonCDF}(0 | \lambda)} & \mbox{if } y > 0, \end{cases} \]
where \(\mathsf{PoissonCDF}\) is the cumulative distribution function for the Poisson distribution. The hurdle model is even more straightforward to program in Stan, as it does not require an explicit mixture.
if (y[n] == 0)
1 ~ bernoulli(theta);
else {
0 ~ bernoulli(theta);
y[n] ~ poisson(lambda) T[1, ];
}
The Bernoulli statements are just shorthand for adding \(\log \theta\)
and \(\log (1 - \theta)\) to the log density. The T[1,]
after the
Poisson indicates that it is truncated below at 1; see the truncation
section for more about truncation and the
Poisson regression section for the specifics of
the Poisson CDF. The net effect is equivalent to the direct
definition of the log likelihood.
if (y[n] == 0)
target += log(theta);
else
target += log1m(theta) + poisson_lpmf(y[n] | lambda)
- poisson_lccdf(0 | lambda));
Julian King pointed out that because \[ \log \left( 1 - \mathsf{PoissonCDF}(0 | \lambda) \right) \ = \ \log \left( 1 - \mathsf{Poisson}(0 | \lambda) \right) \ = \ \log(1 - \exp(-\lambda)) \] the CCDF in the else clause can be replaced with a simpler expression.
target += log1m(theta) + poisson_lpmf(y[n] | lambda)
- log1m_exp(-lambda));
The resulting code is about 15% faster than the code with the CCDF.
This is an example where collecting counts ahead of time can also greatly speed up the execution speed without changing the density. For data size \(N=200\) and parameters \(\theta=0.3\) and \(\lambda = 8\), the speedup is a factor of 10; it will be lower for smaller \(N\) and greater for larger \(N\); it will also be greater for larger \(\theta\).
To achieve this speedup, it helps to have a function to count the number of non-zero entries in an array of integers,
functions {
int num_zero(int[] y) {
int nz = 0;
for (n in 1:size(y))
if (y[n] == 0)
nz += 1;
return nz;
}
}
Then a transformed data block can be used to store the sufficient statistics,
transformed data {
int<lower=0, upper=N> N0 = num_zero(y);
int<lower=0, upper=N> Ngt0 = N - N0;
int<lower=1> y_nz[N - num_zero(y)];
{
int pos = 1;
for (n in 1:N) {
if (y[n] != 0) {
y_nz[pos] = y[n];
pos += 1;
}
}
}
}
The model block can then be reduced to three statements.
model {
N0 ~ binomial(N, theta);
y_nz ~ poisson(lambda);
target += -Ngt0 * log1m_exp(-lambda);
}
The first statement accounts for the Bernoulli contribution to both
the zero and non-zero counts. The second line is the Poisson
contribution from the non-zero counts, which is now vectorized.
Finally, the normalization for the truncation is a single line, so
that the expression for the log CCDF at 0 isn’t repeated. Also note
that the negation is applied to the constant Ngt0
; whenever
possible, leave subexpressions constant because then gradients need
not be propagated until a non-constant term is encountered.