This is an old version, view current version.

20.2 Label Switching in Mixture Models

Where collinearity in regression models can lead to infinitely many posterior maxima, swapping components in a mixture model leads to finitely many posterior maxima.

Mixture Models

Consider a normal mixture model with two location parameters μ1 and μ2, a shared scale σ>0, a mixture ratio θ[0,1], and likelihood p(y|θ,μ1,μ2,σ)=Nn=1(θnormal(yn|μ1,σ)+(1θ)normal(yn|μ2,σ)). The issue here is exchangeability of the mixture components, because p(θ,μ1,μ2,σ|y)=p((1θ),μ2,μ1,σ|y). The problem is exacerbated as the number of mixture components K grows, as in clustering models, leading to K! identical posterior maxima.

Convergence Monitoring and Effective Sample Size

The analysis of posterior convergence and effective sample size is also difficult for mixture models. For example, the ˆR convergence statistic reported by Stan and the computation of effective sample size are both compromised by label switching. The problem is that the posterior mean, a key ingredient in these computations, is affected by label switching, resulting in a posterior mean for μ1 that is equal to that of μ2, and a posterior mean for θ that is always 1/2, no matter what the data are.

Some Inferences are Invariant

In some sense, the index (or label) of a mixture component is irrelevant. Posterior predictive inferences can still be carried out without identifying mixture components. For example, the log probability of a new observation does not depend on the identities of the mixture components. The only sound Bayesian inferences in such models are those that are invariant to label switching. Posterior means for the parameters are meaningless because they are not invariant to label switching; for example, the posterior mean for θ in the two component mixture model will always be 1/2.

Highly Multimodal Posteriors

Theoretically, this should not present a problem for inference because all of the integrals involved in posterior predictive inference will be well behaved. The problem in practice is computation.

Being able to carry out such invariant inferences in practice is an altogether different matter. It is almost always intractable to find even a single posterior mode, much less balance the exploration of the neighborhoods of multiple local maxima according to the probability masses. In Gibbs sampling, it is unlikely for μ1 to move to a new mode when sampled conditioned on the current values of μ2 and θ. For HMC and NUTS, the problem is that the sampler gets stuck in one of the two “bowls” around the modes and cannot gather enough energy from random momentum assignment to move from one mode to another.

Even with a proper posterior, all known sampling and inference techniques are notoriously ineffective when the number of modes grows super-exponentially as it does for mixture models with increasing numbers of components.

Hacks as Fixes

Several hacks (i.e., “tricks”) have been suggested and employed to deal with the problems posed by label switching in practice.

Parameter Ordering Constraints

One common strategy is to impose a constraint on the parameters that identifies the components. For instance, we might consider constraining μ1<μ2 in the two-component normal mixture model discussed above. A problem that can arise from such an approach is when there is substantial probability mass for the opposite ordering μ1>μ2. In these cases, the posteriors are affected by the constraint and true posterior uncertainty in μ1 and μ2 is not captured by the model with the constraint. In addition, standard approaches to posterior inference for event probabilities is compromised. For instance, attempting to use M posterior samples to estimate Pr[μ1>μ2], will fail, because the estimator Pr[μ1>μ2]Mm=1I(μ(m)1>μ(m)2) will result in an estimate of 0 because the posterior respects the constraint in the model.

Initialization around a Single Mode

Another common approach is to run a single chain or to initialize the parameters near realistic values.35

This can work better than the hard constraint approach if reasonable initial values can be found and the labels do not switch within a Markov chain. The result is that all chains are glued to a neighborhood of a particular mode in the posterior.


  1. Tempering methods may be viewed as automated ways to carry out such a search for modes, though most MCMC tempering methods continue to search for modes on an ongoing basis; see (Swendsen and Wang 1986; Neal 1996b).