5.3 Summing out the Responsibility Parameter

This is an old version, view current version.

5.3 Summing out the Responsibility Parameter

To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If \(Y\) is a mixture of \(K\) normal distributions with locations \(\mu_k\) and scales \(\sigma_k\) with mixing proportions \(\lambda\) in the unit \(K\)-simplex, then \[ p_Y(y | \lambda, \mu, \sigma) \ = \ \sum_{k=1}^K \lambda_k \, \mathsf{normal}(y \, | \, \mu_k, \sigma_k). \]

Log Sum of Exponentials: Linear Sums on the Log Scale

The log sum of exponentials function is used to define mixtures on the log scale. It is defined for two inputs by

\[ \mbox{log\_sum\_exp}(a, b) = \log (\exp(a) + \exp(b)). \]

If \(a\) and \(b\) are probabilities on the log scale, then \(\exp(a) + \exp(b)\) is their sum on the linear scale, and the outer log converts the result back to the log scale; to summarize, log_sum_exp does linear addition on the log scale. The reason to use Stan’s built-in log_sum_exp function is that it can prevent underflow and overflow in the exponentiation, by calculating the result as

\[ \log\left( \exp(a) + \exp(b)\right) = c + \log \left( \exp(a - c) + \exp(b - c) \right), \]

where \(c = \max(a, b)\). In this evaluation, one of the terms, \(a - c\) or \(b - c\), is zero and the other is negative, thus eliminating the possibility of overflow or underflow in the leading term while extracting the most arithmetic precision possible by pulling the \(\max(a, b)\) out of the log-exp round trip.

For example, the mixture of \(\mathsf{normal}(-1, 2)\) with \(\mathsf{normal}(3, 1)\), with mixing proportion \(\lambda = [0.3,0.7]^{\top}\), can be implemented in Stan as follows.

parameters {
  real y;
}
model {
  target += log_sum_exp(log(0.3) + normal_lpdf(y | -1, 2),
                        log(0.7) + normal_lpdf(y | 3, 1));
}

The log probability term is derived by taking

\[ \log p(y | \lambda,\mu,\sigma) = \log\!\left( 0.3 * \mathsf{normal}(y|-1,2) \, + \, 0.7 * \mathsf{normal}(y|3,1) \, \right) \\ = \log( \exp(\log(0.3 * \mathsf{normal}(y|-1,2))) + \exp(\log(0.7 * \mathsf{normal}(y|3,1))) ) \\ = \mbox{log\_sum\_exp}( \log(0.3) + \log \mathsf{normal}(y|-1,2), \log(0.7) + \log \mathsf{normal}(y|3,1) ). \]

Dropping uniform mixture ratios

If a two-component mixture has a mixing ratio of 0.5, then the mixing ratios can be dropped, because

log_half = log(0.5);
for (n in 1:N)
  target += log_sum_exp(neg_log_half
                        + normal_lpdf(y[n] | mu[1], sigma[1]),
                        log_half
            + normal_lpdf(y[n] | mu[2], sigma[2]));

then the \(\log 0.5\) term isn’t contributing to the proportional density, and the above can be replaced with the more efficient version

for (n in 1:N)
  target += log_sum_exp(normal_lpdf(y[n] | mu[1], sigma[1]),
                        normal_lpdf(y[n] | mu[2], sigma[2]));

The same result holds if there are \(K\) components and the mixing simplex \(\lambda\) is symmetric, i.e.,

\[ \lambda = \left( \frac{1}{K}, \ldots, \frac{1}{K} \right). \]

The result follows from the identity

\[ \mbox{log\_sum\_exp}(c + a, c + b) \ = \ c + \mbox{log\_sum\_exp(a, b)} \]

and the fact that adding a constant \(c\) to the log density accumulator has no effect because the log density is only specified up to an additive constant in the first place. There is nothing specific to the normal distribution here; constants may always be dropped from the target.

Recovering posterior mixture proportions

The posterior \(p(z_n \mid y_n, \mu, \sigma)\) over the mixture indicator \(z_n \in 1:K\) is often of interest as \(p(z_n = k \mid y, \mu, \sigma)\) is the posterior probability that that observation \(y_n\) was generated by mixture component \(k\). The posterior can be computed via Bayes’s rule, \[ \begin{array}{rcl} \mbox{Pr}(z_n = k \mid y_n, \mu, \sigma, \lambda) & \propto & p(y_n \mid z_n = k, \mu, \sigma) p(z_n = k \mid \lambda) \\ & = & \mathsf{normal}(y_n \mid \mu_k, \sigma_k) \cdot \lambda_k. \end{array} \]

The normalization can be done via summation, because \(z_n \in 1{:}K\) only takes on finitely many values. In detail, \[ p(z_n = k \mid y_n, \mu, \sigma, \lambda) \ = \ \frac{p(y_n \mid z_n = k, \mu, \sigma) \cdot p(z_n = k \mid \lambda)} {\sum_{k' = 1}^K p(y_n \mid z_n = k', \mu, \sigma) \cdot p(z_n = k' \mid \lambda)}. \]

On the log scale, the normalized probability is computed as \[ \begin{array}{l} \log \mbox{Pr}(z_n = k \mid y_n, \mu, \sigma, \lambda) \\ \mbox { } \ \ \ = \log p(y_n \mid z_n = k, \mu, \sigma) + \log \mbox{Pr}(z_n = k \mid \lambda) - \mathrm{log\_sum\_exp}_{k' = 1}^K \log p(y_n \mid z_n = k', \mu, \sigma) + \log p(z_n = k' \mid \lambda). \end{array} \] This can be coded up directly in Stan; the change-point model in the change point section provides an example.

Estimating Parameters of a Mixture

Given the scheme for representing mixtures, it may be moved to an estimation setting, where the locations, scales, and mixture components are unknown. Further generalizing to a number of mixture components specified as data yields the following model.

data {
  int<lower=1> K;          // number of mixture components
  int<lower=1> N;          // number of data points
  real y[N];               // observations
}
parameters {
  simplex[K] theta;          // mixing proportions
  ordered[K] mu;             // locations of mixture components
  vector<lower=0>[K] sigma;  // scales of mixture components
}
model {
  vector[K] log_theta = log(theta);  // cache log calculation
  sigma ~ lognormal(0, 2);
  mu ~ normal(0, 10);
  for (n in 1:N) {
    vector[K] lps = log_theta;
    for (k in 1:K)
      lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
    target += log_sum_exp(lps);
  }
}

The model involves K mixture components and N data points. The mixing proportion parameter theta is declared to be a unit \(K\)-simplex, whereas the component location parameter mu and scale parameter sigma are both defined to be K-vectors.

The location parameter mu is declared to be an ordered vector in order to identify the model. This will not affect inferences that do not depend on the ordering of the components as long as the prior for the components mu[k] is symmetric, as it is here (each component has an independent \(\mathsf{normal}(0, 10)\) prior). It would even be possible to include a hierarchical prior for the components.

The values in the scale array sigma are constrained to be non-negative, and have a weakly informative prior given in the model chosen to avoid zero values and thus collapsing components.

The model declares a local array variable lps to be size K and uses it to accumulate the log contributions from the mixture components. The main action is in the loop over data points n. For each such point, the log of \(\theta_k * \mathsf{normal}(y_n \, | \, \mu_k,\sigma_k)\) is calculated and added to the array lpps. Then the log probability is incremented with the log sum of exponentials of those values.