5.6 Zero-inflated and hurdle models

This is an old version, view current version.

5.6 Zero-inflated and hurdle models

Zero-inflated and hurdle models both provide mixtures of a Poisson and Bernoulli probability mass function to allow more flexibility in modeling the probability of a zero outcome. Zero-inflated models, as defined by Lambert (1992), add additional probability mass to the outcome of zero. Hurdle models, on the other hand, are formulated as pure mixtures of zero and non-zero outcomes.

Zero inflation and hurdle models can be formulated for discrete distributions other than the Poisson. Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior.

Zero inflation

Consider the following example for zero-inflated Poisson distributions. It uses a parameter theta here there is a probability \(\theta\) of drawing a zero, and a probability \(1 - \theta\) of drawing from \(\textsf{Poisson}(\lambda)\) (now \(\theta\) is being used for mixing proportions because \(\lambda\) is the traditional notation for a Poisson mean parameter). The probability function is thus \[ p(y_n \mid \theta,\lambda) = \begin{cases} \theta + (1 - \theta) \times \textsf{Poisson}(0 \mid \lambda) & \quad\text{if } y_n = 0, \text{ and}\\ (1-\theta) \times \textsf{Poisson}(y_n \mid \lambda) &\quad\text{if } y_n > 0. \end{cases} \]

The log probability function can be implemented directly in Stan as follows.

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> lambda;
}
model {
  for (n in 1:N) {
    if (y[n] == 0) {
      target += log_sum_exp(bernoulli_lpmf(1 | theta),
                            bernoulli_lpmf(0 | theta)
                              + poisson_lpmf(y[n] | lambda));
    } else {
      target += bernoulli_lpmf(0 | theta)
                  + poisson_lpmf(y[n] | lambda);
    }
  }
}

The log_sum_exp(lp1,lp2) function adds the log probabilities on the linear scale; it is defined to be equal to log(exp(lp1) + exp(lp2)), but is more arithmetically stable and faster.

Optimizing the zero-inflated Poisson model

The code given above to compute the zero-inflated Poisson redundantly calculates all of the Bernoulli terms and also poisson_lpmf(0 | lambda) every time the first condition body executes. The use of the redundant terms is conditioned on y, which is known when the data are read in. This allows the transformed data block to be used to compute some more convenient terms for expressing the log density each iteration.

The number of zero cases is computed and handled separately. Then the nonzero cases are collected into their own array for vectorization. The number of zeros is required to declare y_nonzero, so it must be computed in a function.

functions {
  int num_zeros(array[] int y) {
    int sum = 0;
    for (n in 1:size(y)) {
      sum += (y[n] == 0);
    }
    return sum;
  }
}
// ...
transformed data {
  int<lower=0> N_zero = num_zeros(y);
  array[N - N_zero] int<lower=1> y_nonzero;
  int N_nonzero = 0;
  for (n in 1:N) {
    if (y[n] == 0) continue;
    N_nonzero += 1;
    y_nonzero[N_nonzero] = y[n];
  }
}
// ...
model {
  // ...
   target
     += N_zero
          * log_sum_exp(bernoulli_lpmf(1 | theta),
                        bernoulli_lpmf(0 | theta)
                          + poisson_lpmf(0 | lambda));
   target += N_nonzero * bernoulli_lpmf(0 | theta);
   target += poisson_lpmf(y_nonzero | lambda);
  // ...
}

The boundary conditions of all zeros and no zero outcomes is handled appropriately; in the vectorized case, if y_nonzero is empty, N_nonzero will be zero, and the last two target increment terms will add zeros.

Hurdle models

The hurdle model is similar to the zero-inflated model, but more flexible in that the zero outcomes can be deflated as well as inflated. The probability mass function for the hurdle likelihood is defined by \[ p(y\mid\theta,\lambda) = \begin{cases} \theta &\quad\text{if } y = 0, \text{ and}\\ (1 - \theta) \frac{\displaystyle \textsf{Poisson}(y \mid \lambda)} {\displaystyle 1 - \textsf{PoissonCDF}(0 \mid \lambda)} &\quad\text{if } y > 0, \end{cases} \]

where \(\textsf{PoissonCDF}\) is the cumulative distribution function for the Poisson distribution. The hurdle model is even more straightforward to program in Stan, as it does not require an explicit mixture.

if (y[n] == 0) {
  1 ~ bernoulli(theta);
} else {
  0 ~ bernoulli(theta);
  y[n] ~ poisson(lambda) T[1, ];
}

The Bernoulli statements are just shorthand for adding \(\log \theta\) and \(\log (1 - \theta)\) to the log density. The T[1,] after the Poisson indicates that it is truncated below at 1; see the truncation section for more about truncation and the Poisson regression section for the specifics of the Poisson CDF. The net effect is equivalent to the direct definition of the log likelihood.

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
            - poisson_lccdf(0 | lambda));
}

Julian King pointed out that because \[\begin{align*} \log \left( 1 - \textsf{PoissonCDF}(0 \mid \lambda) \right) &= \log \left( 1 - \textsf{Poisson}(0 \mid \lambda) \right) \\ &= \log(1 - \exp(-\lambda)) \end{align*}\] the CCDF in the else clause can be replaced with a simpler expression.

target += log1m(theta) + poisson_lpmf(y[n] | lambda)
          - log1m_exp(-lambda));

The resulting code is about 15% faster than the code with the CCDF.

This is an example where collecting counts ahead of time can also greatly speed up the execution speed without changing the density. For data size \(N=200\) and parameters \(\theta=0.3\) and \(\lambda = 8\), the speedup is a factor of 10; it will be lower for smaller \(N\) and greater for larger \(N\); it will also be greater for larger \(\theta\).

To achieve this speedup, it helps to have a function to count the number of non-zero entries in an array of integers,

functions {
  int num_zero(array[] int y) {
    int nz = 0;
    for (n in 1:size(y)) {
      if (y[n] == 0) {
        nz += 1;
      } 
    }
    return nz;
  }
}

Then a transformed data block can be used to store the sufficient statistics,

transformed data {
  int<lower=0, upper=N> N0 = num_zero(y);
  int<lower=0, upper=N> Ngt0 = N - N0;
  array[N - num_zero(y)] int<lower=1> y_nz;
  {
    int pos = 1;
    for (n in 1:N) {
      if (y[n] != 0) {
        y_nz[pos] = y[n];
        pos += 1;
      }
    }
  }
}

The model block is then reduced to three statements.

model {
  N0 ~ binomial(N, theta);
  y_nz ~ poisson(lambda);
  target += -Ngt0 * log1m_exp(-lambda);
}

The first statement accounts for the Bernoulli contribution to both the zero and non-zero counts. The second line is the Poisson contribution from the non-zero counts, which is now vectorized. Finally, the normalization for the truncation is a single line, so that the expression for the log CCDF at 0 isn’t repeated. Also note that the negation is applied to the constant Ngt0; whenever possible, leave subexpressions constant because then gradients need not be propagated until a non-constant term is encountered.

References

Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics 34 (1).