This is an old version, view current version.

5.6 Zero-inflated and hurdle models

Zero-inflated and hurdle models both provide mixtures of a Poisson and Bernoulli probability mass function to allow more flexibility in modeling the probability of a zero outcome. Zero-inflated models, as defined by Lambert (1992), add additional probability mass to the outcome of zero. Hurdle models, on the other hand, are formulated as pure mixtures of zero and non-zero outcomes.

Zero inflation and hurdle models can be formulated for discrete distributions other than the Poisson. Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior. Hurdle models can be formulated as combination of point mass at zero and continuous distribution for positive values.

Zero inflation

Consider the following example for zero-inflated Poisson distributions. There is a probability \(\theta\) of observing a zero, and a probability \(1 - \theta\) of observing a count with a \(\textsf{Poisson}(\lambda)\) distribution (now \(\theta\) is being used for mixing proportions because \(\lambda\) is the traditional notation for a Poisson mean parameter). Given the probability \(\theta\) and the intensity \(\lambda\), the distribution for \(y_n\) can be written as \[ y_n \sim \begin{cases} 0 & \quad\text{with probability } \theta, \text{ and}\\ \textsf{Poisson}(y_n \mid \lambda) & \quad\text{with probability } 1-\theta. \end{cases} \]

Stan does not support conditional sampling statements (with ~) conditional on some parameter, and we need to consider the corresponding likelihood \[ p(y_n \mid \theta,\lambda) = \begin{cases} \theta + (1 - \theta) \times \textsf{Poisson}(0 \mid \lambda) & \quad\text{if } y_n = 0, \text{ and}\\ (1-\theta) \times \textsf{Poisson}(y_n \mid \lambda) &\quad\text{if } y_n > 0. \end{cases} \] The log likelihood can be implemented directly in Stan (with target +=) as follows.

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> lambda;
}
model {
  for (n in 1:N) {
    if (y[n] == 0) {
      target += log_sum_exp(log(theta),
                            log1m(theta)
                              + poisson_lpmf(y[n] | lambda));
    } else {
      target += log1m(theta)
                  + poisson_lpmf(y[n] | lambda);
    }
  }
}

The log1m(theta) computes log(1-theta), but is more computationally stable. The log_sum_exp(lp1,lp2) function adds the log probabilities on the linear scale; it is defined to be equal to log(exp(lp1) + exp(lp2)), but is more computationally stable and faster.

Optimizing the zero-inflated Poisson model

The code given above to compute the zero-inflated Poisson redundantly calculates all of the Bernoulli terms and also poisson_lpmf(0 | lambda) every time the first condition body executes. The use of the redundant terms is conditioned on y, which is known when the data are read in. This allows the transformed data block to be used to compute some more convenient terms for expressing the log density each iteration.

The number of zero cases is computed and handled separately. Then the nonzero cases are collected into their own array for vectorization. The number of zeros is required to declare y_nonzero, so it must be computed in a function.

functions {
  int num_zeros(array[] int y) {
    int sum = 0;
    for (n in 1:size(y)) {
      sum += (y[n] == 0);
    }
    return sum;
  }
}
// ...
transformed data {
  int<lower=0> N_zero = num_zeros(y);
  array[N - N_zero] int<lower=1> y_nonzero;
  int N_nonzero = 0;
  for (n in 1:N) {
    if (y[n] == 0) continue;
    N_nonzero += 1;
    y_nonzero[N_nonzero] = y[n];
  }
}
// ...
model {
  // ...
   target
     += N_zero
          * log_sum_exp(log(theta),
                        log1m(theta)
                          + poisson_lpmf(0 | lambda));
   target += N_nonzero * log1m(theta);
   target += poisson_lpmf(y_nonzero | lambda);
  // ...
}

The boundary conditions of all zeros and no zero outcomes is handled appropriately; in the vectorized case, if y_nonzero is empty, N_nonzero will be zero, and the last two target increment terms will add zeros.

Hurdle models

The hurdle model is similar to the zero-inflated model, but more flexible in that the zero outcomes can be deflated as well as inflated. Given the probability \(\theta\) and the intensity \(\lambda\), the distribution for \(y_n\) can be written as \[ y_n \sim \begin{cases} 0 & \quad\text{with probability } \theta, \text{and }\\ \textsf{Poisson}_{x\neq 0}(y_n \mid \lambda) & \quad\text{with probability } 1-\theta, \end{cases} \] Where \(\textsf{Poisson}_{x\neq 0}\) is a truncated Poisson distribution, truncated at \(0\).

The corresponding likelihood function for the hurdle model is defined by \[ p(y\mid\theta,\lambda) = \begin{cases} \theta &\quad\text{if } y = 0, \text{ and}\\ (1 - \theta) \frac{\displaystyle \textsf{Poisson}(y \mid \lambda)} {\displaystyle 1 - \textsf{PoissonCDF}(0 \mid \lambda)} &\quad\text{if } y > 0, \end{cases} \] where \(\textsf{PoissonCDF}\) is the cumulative distribution function for the Poisson distribution and and \(1 - \textsf{PoissonCDF}(0 \mid \lambda)\) is the relative normalization term for the truncated Poisson (truncated at \(0\)).

The hurdle model is even more straightforward to program in Stan, as it does not require an explicit mixture.

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
            - poisson_lccdf(0 | lambda));
}

Julian King pointed out that because \[\begin{align*} \log \left( 1 - \textsf{PoissonCDF}(0 \mid \lambda) \right) &= \log \left( 1 - \textsf{Poisson}(0 \mid \lambda) \right) \\ &= \log(1 - \exp(-\lambda)) \end{align*}\] the CCDF in the else clause can be replaced with a simpler expression.

target += log1m(theta) + poisson_lpmf(y[n] | lambda)
          - log1m_exp(-lambda));

The resulting code is about 15% faster than the code with the CCDF.

This is an example where collecting counts ahead of time can also greatly speed up the execution speed without changing the density. For data size \(N=200\) and parameters \(\theta=0.3\) and \(\lambda = 8\), the speedup is a factor of 10; it will be lower for smaller \(N\) and greater for larger \(N\); it will also be greater for larger \(\theta\).

To achieve this speedup, it helps to have a function to count the number of non-zero entries in an array of integers,

functions {
  int num_zero(array[] int y) {
    int nz = 0;
    for (n in 1:size(y)) {
      if (y[n] == 0) {
        nz += 1;
      } 
    }
    return nz;
  }
}

Then a transformed data block can be used to store the sufficient statistics,

transformed data {
  int<lower=0, upper=N> N0 = num_zero(y);
  int<lower=0, upper=N> Ngt0 = N - N0;
  array[N - num_zero(y)] int<lower=1> y_nz;
  {
    int pos = 1;
    for (n in 1:N) {
      if (y[n] != 0) {
        y_nz[pos] = y[n];
        pos += 1;
      }
    }
  }
}

The model block is then reduced to three statements.

model {
  N0 ~ binomial(N, theta);
  y_nz ~ poisson(lambda);
  target += -Ngt0 * log1m_exp(-lambda);
}

The first statement accounts for the Bernoulli contribution to both the zero and non-zero counts. The second line is the Poisson contribution from the non-zero counts, which is now vectorized. Finally, the normalization for the truncation is a single line, so that the expression for the log CCDF at 0 isn’t repeated. Also note that the negation is applied to the constant Ngt0; whenever possible, leave subexpressions constant because then gradients need not be propagated until a non-constant term is encountered.

References

Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics 34 (1).