5.5 Inferences supported by mixtures
In many mixture models, the mixture components are underlyingly exchangeable in the model and thus not identifiable. This arises if the parameters of the mixture components have exchangeable priors and the mixture ratio gets a uniform prior so that the parameters of the mixture components are also exchangeable in the likelihood.
We have finessed this basic problem by ordering the parameters. This will allow us in some cases to pick out mixture components either ahead of time or after fitting (e.g., male vs. female, or Democrat vs. Republican).
In other cases, we do not care about the actual identities of the mixture components and want to consider inferences that are independent of indexes. For example, we might only be interested in posterior predictions for new observations.
Mixtures with unidentifiable components
As an example, consider the normal mixture from the previous section, which provides an exchangeable prior on the pairs of parameters \((\mu_1, \sigma_1)\) and \((\mu_2, \sigma_2)\), \[\begin{align*} \mu_1, \mu_2 &\sim \textsf{normal}(0, 10) \\ \sigma_1, \sigma_2 &\sim \textsf{halfnormal}(0, 10) \\ \end{align*}\]
The prior on the mixture ratio is uniform, \[ \lambda \sim \textsf{uniform}(0, 1), \] so that with the likelihood \[ p\left(y_n \mid \mu, \sigma\right) = \lambda \times \textsf{normal}\left(y_n \mid \mu_1, \sigma_1\right) + (1 - \lambda) \times \textsf{normal}\left(y_n \mid \mu_2, \sigma_2\right), \] the joint distribution \(p(y, \mu, \sigma, \lambda)\) is exchangeable in the parameters \((\mu_1, \sigma_1)\) and \((\mu_2, \sigma_2)\) with \(\lambda\) flipping to \(1 - \lambda\).12
Inference under label switching
In cases where the mixture components are not identifiable, it can be difficult to diagnose convergence of sampling or optimization algorithms because the labels will switch, or be permuted, in different MCMC chains or different optimization runs. Luckily, posterior inferences which do not refer to specific component labels are invariant under label switching and may be used directly. This subsection considers a pair of examples.
Predictive likelihood
Predictive likelihood for a new observation \(\tilde{y}\) given the complete parameter vector \(\theta\) will be \[ p(\tilde{y} \mid y) = \int_{\theta} p(\tilde{y} \mid \theta) \, p(\theta \mid y) \, \textsf{d}\theta. \]
The normal mixture example from the previous section, with \(\theta = (\mu, \sigma, \lambda)\), shows that the likelihood returns the same density under label switching and thus the predictive inference is sound. In Stan, that predictive inference can be done either by computing \(p(\tilde{y} \mid y)\), which is more efficient statistically in terms of effective sample size, or simulating draws of \(\tilde{y}\), which is easier to plug into other inferences. Both approaches can be coded directly in the generated quantities block of the program. Here’s an example of the direct (non-sampling) approach.
data {
int<lower = 0> N_tilde;
vector[N_tilde] y_tilde;
...
generated quantities {
vector[N_tilde] log_p_y_tilde;
for (n in 1:N_tilde)
log_p_y_tilde[n]
= log_mix(lambda,
normal_lpdf(y_tilde[n] | mu[1], sigma[1])
normal_lpdf(y_tilde[n] | mu[2], sigma[2]));
}
It is a bit of a bother afterwards, because the logarithm function
isn’t linear and hence doesn’t distribute through averages (Jensen’s
inequality shows which way the inequality goes). The right thing to
do is to apply log_sum_exp
of the posterior draws of
log_p_y_tilde
. The average log predictive density is then
given by subtracting log(N_new)
.
Clustering and similarity
Often a mixture model will be applied to a clustering problem and there might be two data items \(y_i\) and \(y_j\) for which there is a question of whether they arose from the same mixture component. If we take \(z_i\) and \(z_j\) to be the component responsibility discrete variables, then the quantity of interest is \(z_i = z_j\), which can be summarized as an event probability \[ \operatorname{Pr}[z_i = z_j \mid y] = \int_{\theta} \frac{\sum_{k=0}^1 p(z_i=k, z_j = k, y_i, y_j \mid \theta)} {\sum_{k=0}^1 \sum_{m=0}^1 p(z_i = k, z_j = m, y_i, y_j \mid \theta)} \, p(\theta \mid y) \, \textsf{d}\theta. \]
As with other event probabilities, this can be calculated in the generated quantities block either by sampling \(z_i\) and \(z_j\) and using the indicator function on their equality, or by computing the term inside the integral as a generated quantity. As with predictive likelihood, working in expectation is more statistically efficient than sampling.
Imposing a constraint such as \(\theta < 0.5\) will resolve the symmetry, but fundamentally changes the model and its posterior inferences.↩