Finite Mixtures

Finite mixture models of an outcome assume that the outcome is drawn from one of several distributions, the identity of which is controlled by a categorical mixing distribution. Mixture models typically have multimodal densities with modes near the modes of the mixture components. Mixture models may be parameterized in several ways, as described in the following sections. Mixture models may be used directly for modeling data with multimodal distributions, or they may be used as priors for other parameters.

Relation to clustering

Clustering models, as discussed in the clustering chapter, are just a particular class of mixture models that have been widely applied to clustering in the engineering and machine-learning literature. The normal mixture model discussed in this chapter reappears in multivariate form as the statistical basis for the $K$ -means algorithm; the latent Dirichlet allocation model, usually applied to clustering problems, can be viewed as a mixed-membership multinomial mixture model.

Latent discrete parameterization

One way to parameterize a mixture model is with a latent categorical variable indicating which mixture component was responsible for the outcome. For example, consider $K$ normal distributions with locations $μ_{k} \in R$ and scales $σ_{k} \in (0, \infty)$ . Now consider mixing them in proportion $λ$ , where $λ_{k} \geq 0$ and $\sum_{k = 1}^{K} λ_{k} = 1$ (i.e., $λ$ lies in the unit $K$ -simplex). For each outcome $y_{n}$ there is a latent variable $z_{n}$ in ${1, \dots, K}$ with a categorical distribution parameterized by $λ$ , $z_{n} \sim categorical (λ) .$

The variable $y_{n}$ is distributed according to the parameters of the mixture component $z_{n}$ , $y_{n} \sim normal (μ_{z [n]}, σ_{z [n]}) .$

This model is not directly supported by Stan because it involves discrete parameters $z_{n}$ , but Stan can sample $μ$ and $σ$ by summing out the $z$ parameter as described in the next section.

Summing out the responsibility parameter

To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If $Y$ is a mixture of $K$ normal distributions with locations $μ_{k}$ and scales $σ_{k}$ with mixing proportions $λ$ in the unit $K$ -simplex, then $p_{Y} (y ∣ λ, μ, σ) = \sum_{k = 1}^{K} λ_{k} normal (y ∣ μ_{k}, σ_{k}) .$

Log sum of exponentials: linear Sums on the log scale

The log sum of exponentials function is used to define mixtures on the log scale. It is defined for two inputs by $log_sum_exp (a, b) = \log (\exp (a) + \exp (b)) .$

If $a$ and $b$ are probabilities on the log scale, then $\exp (a) + \exp (b)$ is their sum on the linear scale, and the outer log converts the result back to the log scale; to summarize, log_sum_exp does linear addition on the log scale. The reason to use Stan’s built-in log_sum_exp function is that it can prevent underflow and overflow in the exponentiation, by calculating the result as $\log (\exp (a) + \exp (b)) = c + \log (\exp (a - c) + \exp (b - c)),$ where $c = max (a, b)$ . In this evaluation, one of the terms, $a - c$ or $b - c$ , is zero and the other is negative, thus eliminating the possibility of overflow or underflow in the leading term while extracting the most arithmetic precision possible by pulling the $max (a, b)$ out of the log-exp round trip.

For example, the mixture of $normal (- 1, 2)$ with $normal (3, 1)$ , with mixing proportion $λ = [0.3, 0.7]^{⊤}$ , can be implemented in Stan as follows.

parameters {
  real y;
}
model {
  target += log_sum_exp(log(0.3) + normal_lpdf(y | -1, 2),
                        log(0.7) + normal_lpdf(y | 3, 1));
}

The log probability term is derived by taking $\begin{aligned} \log & p (y ∣ λ, μ, σ) \\ = \log (0.3 \times normal (y ∣ - 1, 2) + 0.7 \times normal (y ∣ 3, 1)) \\ = \log (\exp (\log (0.3 \times normal (y ∣ - 1, 2))) + \exp (\log (0.7 \times normal (y ∣ 3, 1)))) \\ = log_sum_exp (\log (0.3) + \log normal (y ∣ - 1, 2), \log (0.7) + \log normal (y ∣ 3, 1)) . \end{aligned}$

Dropping uniform mixture ratios

If a two-component mixture has a mixing ratio of 0.5, then the mixing ratios can be dropped, because

log_half = log(0.5);
for (n in 1:N) {
    target +=
        log_sum_exp(log_half + normal_lpdf(y[n] | mu[1], sigma[1]),
                    log_half + normal_lpdf(y[n] | mu[2], sigma[2]));
}

then the $\log 0.5$ term isn’t contributing to the proportional density, and the above can be replaced with the more efficient version

for (n in 1:N) {
  target += log_sum_exp(normal_lpdf(y[n] | mu[1], sigma[1]),
                        normal_lpdf(y[n] | mu[2], sigma[2]));
}

The same result holds if there are $K$ components and the mixing simplex $λ$ is symmetric, i.e., $λ = (\frac{1}{K}, \dots, \frac{1}{K}) .$

The result follows from the identity $log_sum_exp (c + a, c + b) = c + log_sum_exp (a, b)$ and the fact that adding a constant $c$ to the log density accumulator has no effect because the log density is only specified up to an additive constant in the first place. There is nothing specific to the normal distribution here; constants may always be dropped from the target.

Recovering posterior mixture proportions

The posterior $p (z_{n} ∣ y_{n}, μ, σ)$ over the mixture indicator $z_{n} \in 1 : K$ is often of interest as $p (z_{n} = k ∣ y, μ, σ)$ is the posterior probability that that observation $y_{n}$ was generated by mixture component $k$ . The posterior can be computed via Bayes’s rule, $\begin{aligned} Pr [z_{n} = k ∣ y_{n}, μ, σ, λ] & \propto p (y_{n} ∣ z_{n} = k, μ, σ) p (z_{n} = k ∣ λ) \\ = normal (y_{n} ∣ μ_{k}, σ_{k}) \cdot λ_{k} . \end{aligned}$

The normalization can be done via summation, because $z_{n} \in 1 : K$ only takes on finitely many values. In detail, $p (z_{n} = k ∣ y_{n}, μ, σ, λ) = \frac{p (y_{n} ∣ z_{n} = k, μ, σ) \cdot p (z_{n} = k ∣ λ)}{\sum_{k^{'} = 1}^{K} p (y_{n} ∣ z_{n} = k^{'}, μ, σ) \cdot p (z_{n} = k^{'} ∣ λ)} .$

On the log scale, the normalized probability is computed as $\begin{aligned} \log & Pr [z_{n} = k ∣ y_{n}, μ, σ, λ] \\ = \log p (y_{n} ∣ z_{n} = k, μ, σ) + \log Pr [z_{n} = k ∣ λ] \\ - log_sum_{exp}_{k^{'} = 1}^{K} (\log p (y_{n} ∣ z_{n} = k^{'}, μ, σ) + \log p (z_{n} = k^{'} ∣ λ)) . \end{aligned}$ This can be coded up directly in Stan; the change-point model in the change point section provides an example.

Estimating parameters of a mixture

Given the scheme for representing mixtures, it may be moved to an estimation setting, where the locations, scales, and mixture components are unknown. Further generalizing to a number of mixture components specified as data yields the following model.

data {
  int<lower=1> K;          // number of mixture components
  int<lower=1> N;          // number of data points
  array[N] real y;         // observations
}
parameters {
  simplex[K] theta;          // mixing proportions
  ordered[K] mu;             // locations of mixture components
  vector<lower=0>[K] sigma;  // scales of mixture components
}
model {
  vector[K] log_theta = log(theta);  // cache log calculation
  sigma ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  for (n in 1:N) {
    vector[K] lps = log_theta;
    for (k in 1:K) {
      lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
    }
    target += log_sum_exp(lps);
  }
}

The model involves K mixture components and N data points. The mixing proportion parameter theta is declared to be a unit $K$ -simplex, whereas the component location parameter mu and scale parameter sigma are both defined to be K-vectors.

The location parameter mu is declared to be an ordered vector in order to identify the model. This will not affect inferences that do not depend on the ordering of the components as long as the prior for the components mu[k] is symmetric, as it is here (each component has an independent $normal (0, 10)$ prior). It would even be possible to include a hierarchical prior for the components.

The values in the scale array sigma are constrained to be non-negative, and have a weakly informative prior given in the model chosen to avoid zero values and thus collapsing components.

The model declares a local array variable lps to be size K and uses it to accumulate the log contributions from the mixture components. The main action is in the loop over data points n. For each such point, the log of $θ_{k} \times normal (y_{n} ∣ μ_{k}, σ_{k})$ is calculated and added to the array lps. Then the log probability is incremented with the log sum of exponentials of those values.

Vectorizing mixtures

There is (currently) no way to vectorize mixture models at the observation level in Stan. This section is to warn users away from attempting to vectorize naively, as it results in a different model. A proper mixture at the observation level is defined as follows, where we assume that lambda, y[n], mu[1], mu[2], and sigma[1], sigma[2] are all scalars and lambda is between 0 and 1.

for (n in 1:N) {
  target += log_sum_exp(log(lambda)
                          + normal_lpdf(y[n] | mu[1], sigma[1]),
                        log1m(lambda)
                          + normal_lpdf(y[n] | mu[2], sigma[2]));

or equivalently

for (n in 1:N) {
  target += log_mix(lambda,
                    normal_lpdf(y[n] | mu[1], sigma[1]),
                    normal_lpdf(y[n] | mu[2], sigma[2]))
  };

This definition assumes that each observation $y_{n}$ may have arisen from either of the mixture components. The density is $p (y ∣ λ, μ, σ) = \prod_{n = 1}^{N} (λ \times normal (y_{n} ∣ μ_{1}, σ_{1}) + (1 - λ) \times normal (y_{n} ∣ μ_{2}, σ_{2})) .$

Contrast the previous model with the following (erroneous) attempt to vectorize the model.

target += log_sum_exp(log(lambda)
                        + normal_lpdf(y | mu[1], sigma[1]),
                      log1m(lambda)
                        + normal_lpdf(y | mu[2], sigma[2]));

or equivalently,

target += log_mix(lambda,
                  normal_lpdf(y | mu[1], sigma[1]),
                  normal_lpdf(y | mu[2], sigma[2]));

This second definition implies that the entire sequence $y_{1}, \dots, y_{n}$ of observations comes form one component or the other, defining a different density, $p (y ∣ λ, μ, σ) = λ \times \prod_{n = 1}^{N} normal (y_{n} ∣ μ_{1}, σ_{1}) + (1 - λ) \times \prod_{n = 1}^{N} normal (y_{n} ∣ μ_{2}, σ_{2}) .$

Inferences supported by mixtures

In many mixture models, the mixture components are underlyingly exchangeable in the model and thus not identifiable. This arises if the parameters of the mixture components have exchangeable priors and the mixture ratio gets a uniform prior so that the parameters of the mixture components are also exchangeable in the likelihood.

We have finessed this basic problem by ordering the parameters. This will allow us in some cases to pick out mixture components either ahead of time or after fitting (e.g., male vs. female, or Democrat vs. Republican).

In other cases, we do not care about the actual identities of the mixture components and want to consider inferences that are independent of indexes. For example, we might only be interested in posterior predictions for new observations.

Mixtures with unidentifiable components

As an example, consider the normal mixture from the previous section, which provides an exchangeable prior on the pairs of parameters $(μ_{1}, σ_{1})$ and $(μ_{2}, σ_{2})$ , $\begin{aligned} μ_{1}, μ_{2} & \sim normal (0, 10) \\ σ_{1}, σ_{2} & \sim halfnormal (0, 10) \end{aligned}$

The prior on the mixture ratio is uniform, $λ \sim uniform (0, 1),$ so that with the likelihood $p (y_{n} ∣ μ, σ) = λ \times normal (y_{n} ∣ μ_{1}, σ_{1}) + (1 - λ) \times normal (y_{n} ∣ μ_{2}, σ_{2}),$ the joint distribution $p (y, μ, σ, λ)$ is exchangeable in the parameters $(μ_{1}, σ_{1})$ and $(μ_{2}, σ_{2})$ with $λ$ flipping to $1 - λ$ .¹

Inference under label switching

In cases where the mixture components are not identifiable, it can be difficult to diagnose convergence of sampling or optimization algorithms because the labels will switch, or be permuted, in different MCMC chains or different optimization runs. Luckily, posterior inferences which do not refer to specific component labels are invariant under label switching and may be used directly. This subsection considers a pair of examples.

Posterior predictive distribution

Posterior predictive distribution for a new observation $\tilde{y}$ given the complete parameter vector $θ$ will be $p (\tilde{y} ∣ y) = \int_{θ} p (\tilde{y} ∣ θ) p (θ ∣ y) d θ .$

The normal mixture example from the previous section, with $θ = (μ, σ, λ)$ , shows that the model returns the same density under label switching and thus the predictive inference is sound. In Stan, that predictive inference can be done either by computing $p (\tilde{y} ∣ y)$ , which is more efficient statistically in terms of effective sample size, or simulating draws of $\tilde{y}$ , which is easier to plug into other inferences. Both approaches can be coded directly in the generated quantities block of the program. Here’s an example of the direct (non-sampling) approach.

data {
  int<lower=0> N_tilde;
  vector[N_tilde] y_tilde;
  // ...
}
generated quantities {
  vector[N_tilde] log_p_y_tilde;
  for (n in 1:N_tilde) {
    log_p_y_tilde[n]
      = log_mix(lambda,
                normal_lpdf(y_tilde[n] | mu[1], sigma[1])
                normal_lpdf(y_tilde[n] | mu[2], sigma[2]));
  }
}

It is a bit of a bother afterwards, because the logarithm function isn’t linear and hence doesn’t distribute through averages (Jensen’s inequality shows which way the inequality goes). The right thing to do is to apply log_sum_exp of the posterior draws of log_p_y_tilde. The average log predictive density is then given by subtracting log(N_new).

Clustering and similarity

Often a mixture model will be applied to a clustering problem and there might be two data items $y_{i}$ and $y_{j}$ for which there is a question of whether they arose from the same mixture component. If we take $z_{i}$ and $z_{j}$ to be the component responsibility discrete variables, then the quantity of interest is $z_{i} = z_{j}$ , which can be summarized as an event probability $Pr [z_{i} = z_{j} ∣ y] = \int_{θ} \frac{\sum_{k = 0}^{1} p (z_{i} = k, z_{j} = k, y_{i}, y_{j} ∣ θ)}{\sum_{k = 0}^{1} \sum_{m = 0}^{1} p (z_{i} = k, z_{j} = m, y_{i}, y_{j} ∣ θ)} p (θ ∣ y) d θ .$

As with other event probabilities, this can be calculated in the generated quantities block either by sampling $z_{i}$ and $z_{j}$ and using the indicator function on their equality, or by computing the term inside the integral as a generated quantity. As with posterior predictive distribute, working in expectation is more statistically efficient than sampling.

Zero-inflated and hurdle models

Zero-inflated and hurdle models both provide mixtures of a Poisson and Bernoulli probability mass function to allow more flexibility in modeling the probability of a zero outcome. Zero-inflated models, as defined by Lambert (1992), add additional probability mass to the outcome of zero. Hurdle models, on the other hand, are formulated as pure mixtures of zero and non-zero outcomes.

Zero inflation and hurdle models can be formulated for discrete distributions other than the Poisson. Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior. Hurdle models can be formulated as combination of point mass at zero and continuous distribution for positive values.

Zero inflation

Consider the following example for zero-inflated Poisson distributions. There is a probability $θ$ of observing a zero, and a probability $1 - θ$ of observing a count with a $Poisson (λ)$ distribution (now $θ$ is being used for mixing proportions because $λ$ is the traditional notation for a Poisson mean parameter). Given the probability $θ$ and the intensity $λ$ , the distribution for $y_{n}$ can be written as $\begin{aligned} y_{n} & = 0 & with probability θ, and \\ y_{n} & \sim Poisson (y_{n} ∣ λ) & with probability 1 - θ . \end{aligned}$

Stan does not support conditional distribution statements (with ~) conditional on some parameter, and we need to consider the corresponding likelihood $p (y_{n} ∣ θ, λ) = {\begin{cases} θ + (1 - θ) \times Poisson (0 ∣ λ) & if y_{n} = 0, and \\ (1 - θ) \times Poisson (y_{n} ∣ λ) & if y_{n} > 0. \end{cases}$ The log likelihood can be coded directly in Stan (with target +=) as follows.

data {
  int<lower=0> N;
  array[N] int<lower=0> y;
}
parameters {
  real<lower=0, upper=1> theta;
  real<lower=0> lambda;
}
model {
  for (n in 1:N) {
    if (y[n] == 0) {
      target += log_sum_exp(log(theta),
                            log1m(theta)
                              + poisson_lpmf(y[n] | lambda));
    } else {
      target += log1m(theta)
                  + poisson_lpmf(y[n] | lambda);
    }
  }
}

The log1m(theta) computes log(1-theta), but is more computationally stable. The log_sum_exp(lp1,lp2) function adds the log probabilities on the linear scale; it is defined to be equal to log(exp(lp1) + exp(lp2)), but is more computationally stable and faster.

Optimizing the zero-inflated Poisson model

The code given above to compute the zero-inflated Poisson redundantly calculates all of the Bernoulli terms and also poisson_lpmf(0 | lambda) every time the first condition body executes. The use of the redundant terms is conditioned on y, which is known when the data are read in. This allows the transformed data block to be used to compute some more convenient terms for expressing the log density each iteration.

The number of zero cases is computed and handled separately. Then the nonzero cases are collected into their own array for vectorization. The number of zeros is required to declare y_nonzero, so it must be computed in a function.

functions {
  int num_zeros(array[] int y) {
    int sum = 0;
    for (n in 1:size(y)) {
      sum += (y[n] == 0);
    }
    return sum;
  }
}
// ...
transformed data {
  int<lower=0> N_zero = num_zeros(y);
  array[N - N_zero] int<lower=1> y_nonzero;
  int N_nonzero = 0;
  for (n in 1:N) {
    if (y[n] == 0) continue;
    N_nonzero += 1;
    y_nonzero[N_nonzero] = y[n];
  }
}
// ...
model {
  // ...
   target
     += N_zero
          * log_sum_exp(log(theta),
                        log1m(theta)
                          + poisson_lpmf(0 | lambda));
   target += N_nonzero * log1m(theta);
   target += poisson_lpmf(y_nonzero | lambda);
  // ...
}

The boundary conditions of all zeros and no zero outcomes is handled appropriately; in the vectorized case, if y_nonzero is empty, N_nonzero will be zero, and the last two target increment terms will add zeros.

Hurdle models

The hurdle model is similar to the zero-inflated model, but more flexible in that the zero outcomes can be deflated as well as inflated. Given the probability $θ$ and the intensity $λ$ , the distribution for $y_{n}$ can be written as [ $\begin{aligned} y_{n} & = 0 with probability θ, and \\ y_{n} & \sim {Poisson}_{x \neq 0} (y_{n} ∣ λ) with probability 1 - θ, \end{aligned}$ ] Where ${Poisson}_{x \neq 0}$ is a truncated Poisson distribution, truncated at $0$ .

The corresponding likelihood function for the hurdle model is defined by $p (y ∣ θ, λ) = {\begin{cases} θ & if y = 0, and \\ (1 - θ) \frac{Poisson (y ∣ λ)}{1 - PoissonCDF (0 ∣ λ)} & if y > 0, \end{cases}$ where $PoissonCDF$ is the cumulative distribution function for the Poisson distribution and and $1 - PoissonCDF (0 ∣ λ)$ is the relative normalization term for the truncated Poisson (truncated at $0$ ).

The hurdle model is even more straightforward to program in Stan, as it does not require an explicit mixture.

if (y[n] == 0) {
  target += log(theta);
} else {
  target += log1m(theta) + poisson_lpmf(y[n] | lambda)
            - poisson_lccdf(0 | lambda));
}

Julian King pointed out that because $\begin{aligned} \log (1 - PoissonCDF (0 ∣ λ)) & = \log (1 - Poisson (0 ∣ λ)) \\ = \log (1 - \exp (- λ)) \end{aligned}$ the CCDF in the else clause can be replaced with a simpler expression.

target += log1m(theta) + poisson_lpmf(y[n] | lambda)
          - log1m_exp(-lambda));

The resulting code is about 15% faster than the code with the CCDF.

This is an example where collecting counts ahead of time can also greatly speed up the execution speed without changing the density. For data size $N = 200$ and parameters $θ = 0.3$ and $λ = 8$ , the speedup is a factor of 10; it will be lower for smaller $N$ and greater for larger $N$ ; it will also be greater for larger $θ$ .

To achieve this speedup, it helps to have a function to count the number of non-zero entries in an array of integers,

functions {
  int num_zero(array[] int y) {
    int nz = 0;
    for (n in 1:size(y)) {
      if (y[n] == 0) {
        nz += 1;
      }
    }
    return nz;
  }
}

Then a transformed data block can be used to store the sufficient statistics,

transformed data {
  int<lower=0, upper=N> N0 = num_zero(y);
  int<lower=0, upper=N> Ngt0 = N - N0;
  array[N - num_zero(y)] int<lower=1> y_nz;
  {
    int pos = 1;
    for (n in 1:N) {
      if (y[n] != 0) {
        y_nz[pos] = y[n];
        pos += 1;
      }
    }
  }
}

The model block is then reduced to three statements.

model {
  N0 ~ binomial(N, theta);
  y_nz ~ poisson(lambda);
  target += -Ngt0 * log1m_exp(-lambda);
}

The first statement accounts for the Bernoulli contribution to both the zero and non-zero counts. The second line is the Poisson contribution from the non-zero counts, which is now vectorized. Finally, the normalization for the truncation is a single line, so that the expression for the log CCDF at 0 isn’t repeated. Also note that the negation is applied to the constant Ngt0; whenever possible, leave subexpressions constant because then gradients need not be propagated until a non-constant term is encountered.

Priors and effective data size in mixture models

Suppose we have a two-component mixture model with mixing rate $λ \in (0, 1)$ . Because the likelihood for the mixture components is proportionally weighted by the mixture weights, the effective data size used to estimate each of the mixture components will also be weighted as a fraction of the overall data size. Thus although there are $N$ observations, the mixture components will be estimated with effective data sizes of $θ N$ and $(1 - θ) N$ for the two components for some $θ \in (0, 1)$ . The effective weighting size is determined by posterior responsibility, not simply by the mixing rate $λ$ .

Comparison to model averaging

In contrast to mixture models, which create mixtures at the observation level, model averaging creates mixtures over the posteriors of models separately fit with the entire data set. In this situation, the priors work as expected when fitting the models independently, with the posteriors being based on the complete observed data $y$ .

If different models are expected to account for different observations, we recommend building mixture models directly. If the models being mixed are similar, often a single expanded model will capture the features of both and may be used on its own for inferential purposes (estimation, decision making, prediction, etc.). For example, rather than fitting an intercept-only regression and a slope-only regression and averaging their predictions, even as a mixture model, we would recommend building a single regression with both a slope and an intercept. Model complexity, such as having more predictors than data points, can be tamed using appropriately regularizing priors. If computation becomes a bottleneck, the only recourse can be model averaging, which can be calculated after fitting each model independently (see Hoeting et al. (1999) and Gelman et al. (2013) for theoretical and computational details).

References

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.

Hoeting, Jennifer A., David Madigan, Adrian E Raftery, and Chris T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417.

Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics 34 (1).

Footnotes

Imposing a constraint such as $θ < 0.5$ will resolve the symmetry, but fundamentally changes the model and its posterior inferences.↩︎