5.3 Summing out the Responsibility Parameter
To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If \(Y\) is a mixture of \(K\) normal distributions with locations \(\mu_k\) and scales \(\sigma_k\) with mixing proportions \(\lambda\) in the unit \(K\)-simplex, then \[ p_Y(y | \lambda, \mu, \sigma) \ = \ \sum_{k=1}^K \lambda_k \, \mathsf{normal}(y \, | \, \mu_k, \sigma_k). \]
Log Sum of Exponentials: Linear Sums on the Log Scale
The log sum of exponentials function is used to define mixtures on the log scale. It is defined for two inputs by
\[ \mbox{log\_sum\_exp}(a, b) = \log (\exp(a) + \exp(b)). \]
If \(a\) and \(b\) are probabilities on the log scale, then \(\exp(a) + \exp(b)\) is their sum on the linear scale, and the outer log converts
the result back to the log scale; to summarize, log_sum_exp does
linear addition on the log scale. The reason to use Stan’s built-in
log_sum_exp
function is that it can prevent underflow and overflow
in the exponentiation, by calculating the result as
\[ \log\left( \exp(a) + \exp(b)\right) = c + \log \left( \exp(a - c) + \exp(b - c) \right), \]
where \(c = \max(a, b)\). In this evaluation, one of the terms, \(a - c\)
or \(b - c\), is zero and the other is negative, thus eliminating the
possibility of overflow or underflow in the leading term while
extracting the most arithmetic precision possible by pulling the
max(a, b)
out of the log-exp round trip.
For example, the mixture of \(\mathsf{normal}(-1, 2)\) with \(\mathsf{normal}(3, 1)\), with mixing proportion \(\lambda = [0.3,0.7]^{\top}\), can be implemented in Stan as follows.
parameters {
real y;
}
model {
target += log_sum_exp(log(0.3) + normal_lpdf(y | -1, 2),
log(0.7) + normal_lpdf(y | 3, 1));
}
The log probability term is derived by taking
\[ \log p(y | \lambda,\mu,\sigma) = \log\!\left( 0.3 * \mathsf{normal}(y|-1,2) \, + \, 0.7 * \mathsf{normal}(y|3,1) \, \right) \\ = \log( \exp(\log(0.3 * \mathsf{normal}(y|-1,2))) + \exp(\log(0.7 * \mathsf{normal}(y|3,1))) ) \\ = \mbox{log\_sum\_exp}( \log(0.3) + \log \mathsf{normal}(y|-1,2), \log(0.7) + \log \mathsf{normal}(y|3,1) ). \]
Dropping uniform mixture ratios
If a two-component mixture has a mixing ratio of 0.5, then the mixing ratios can be dropped, because
neg_log_half = -log(0.5);
for (n in 1:N)
target += log_sum_exp(neg_log_half
+ normal_lpdf(y[n] | mu[1], sigma[1]),
neg_log_half
+ normal_lpdf(y[n] | mu[2], sigma[2]));
then the \(-\log 0.5\) term isn’t contributing to the proportional density, and the above can be replaced with the more efficient version
for (n in 1:N)
target += log_sum_exp(normal_lpdf(y[n] | mu[1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]));
The same result holds if there are \(K\) components and the mixing simplex \(\lambda\) is symmetric, i.e.,
\[ \lambda = \left( \frac{1}{K}, \ldots, \frac{1}{K} \right). \]
The result follows from the identity
\[ \mbox{log\_sum\_exp}(c + a, c + b) \ = \ c + \mbox{log\_sum\_exp(a, b)} \]
and the fact that adding a constant \(c\) to the log density accumulator has no effect because the log density is only specified up to an additive constant in the first place. There is nothing specific to the normal distribution here; constants may always be dropped from the target.
Recovering posterior mixture proportions
The posterior \(p(z_n \mid y_n, \mu, \sigma)\) over the mixture indicator \(z_n \in 1:K\) is often of interest as \(p(z_n = k \mid y, \mu, \sigma)\) is the posterior probability that that observation \(y_n\) was generated by mixture component \(k\). The posterior can be computed via Bayes’s rule, \[ \begin{array}{rcl} \mbox{Pr}(z_n = k \mid y_n, \mu, \sigma, \lambda) & \propto & p(y_n \mid z_n = k, \mu, \sigma) p(z_n = k \mid \lambda) \\ & = & \mathsf{normal}(y_n \mid \mu_k, \sigma_k) \cdot \lambda_k. \end{array} \]
The normalization can be done via summation, because \(z_n \in 1{:}K\) only takes on finitely many values. In detail, \[ p(z_n = k \mid y_n, \mu, \sigma, \lambda) \ = \ \frac{p(y_n \mid z_n = k, \mu, \sigma) \cdot p(z_n = k \mid \lambda)} {\sum_{k' = 1}^K p(y_n \mid z_n = k', \mu, \sigma) \cdot p(z_n = k' \mid \lambda)}. \]
On the log scale, the normalized probability is computed as \[ \begin{array}{l} \log \mbox{Pr}(z_n = k \mid y_n, \mu, \sigma, \lambda) \\ \mbox { } \ \ \ = \log p(y_n \mid z_n = k, \mu, \sigma) + \log \mbox{Pr}(z_n = k \mid \lambda) - \mathrm{log\_sum\_exp}_{k' = 1}^K \log p(y_n \mid z_n = k', \mu, \sigma) + \log p(z_n = k' \mid \lambda). \end{array} \] This can be coded up directly in Stan; the change-point model in the change point section provides an example.
Estimating Parameters of a Mixture
Given the scheme for representing mixtures, it may be moved to an estimation setting, where the locations, scales, and mixture components are unknown. Further generalizing to a number of mixture components specified as data yields the following model.
data {
int<lower=1> K; // number of mixture components
int<lower=1> N; // number of data points
real y[N]; // observations
}
parameters {
simplex[K] theta; // mixing proportions
ordered[K] mu; // locations of mixture components
vector<lower=0>[K] sigma; // scales of mixture components
}
model {
vector[K] log_theta = log(theta); // cache log calculation
sigma ~ lognormal(0, 2);
mu ~ normal(0, 10);
for (n in 1:N) {
vector[K] lps = log_theta;
for (k in 1:K)
lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
target += log_sum_exp(lps);
}
}
The model involves K
mixture components and N
data
points. The mixing proportion parameter theta
is declared to be
a unit \(K\)-simplex, whereas the component location parameter mu
and scale parameter sigma
are both defined to be
K
-vectors.
The location parameter mu
is declared to be an ordered vector
in order to identify the model. This will not affect inferences that
do not depend on the ordering of the components as long as the prior
for the components mu[k]
is symmetric, as it is here (each
component has an independent \(\mathsf{normal}(0, 10)\) prior). It
would even be possible to include a hierarchical prior for the components.
The values in the scale array sigma
are constrained to be
non-negative, and have a weakly informative prior given in the model
chosen to avoid zero values and thus collapsing components.
The model declares a local array variable lps
to be size
K
and uses it to accumulate the log contributions from the
mixture components. The main action is in the loop over data points
n
. For each such point, the log of \(\theta_k * \mathsf{normal}(y_n \, | \, \mu_k,\sigma_k)\) is calculated and added to the
array lpps
. Then the log probability is incremented with the log
sum of exponentials of those values.