Finite Mixtures
Finite mixture models of an outcome assume that the outcome is drawn from one of several distributions, the identity of which is controlled by a categorical mixing distribution. Mixture models typically have multimodal densities with modes near the modes of the mixture components. Mixture models may be parameterized in several ways, as described in the following sections. Mixture models may be used directly for modeling data with multimodal distributions, or they may be used as priors for other parameters.
Relation to clustering
Clustering models, as discussed in the clustering chapter, are just a particular class of mixture models that have been widely applied to clustering in the engineering and machine-learning literature. The normal mixture model discussed in this chapter reappears in multivariate form as the statistical basis for the
Latent discrete parameterization
One way to parameterize a mixture model is with a latent categorical variable indicating which mixture component was responsible for the outcome. For example, consider
The variable
This model is not directly supported by Stan because it involves discrete parameters
Summing out the responsibility parameter
To implement the normal mixture model outlined in the previous section in Stan, the discrete parameters can be summed out of the model. If
Log sum of exponentials: linear Sums on the log scale
The log sum of exponentials function is used to define mixtures on the log scale. It is defined for two inputs by
If log_sum_exp
function is that it can prevent underflow and overflow in the exponentiation, by calculating the result as
For example, the mixture of
parameters {
real y;
}model {
target += log_sum_exp(log(0.3) + normal_lpdf(y | -1, 2),
0.7) + normal_lpdf(y | 3, 1));
log( }
The log probability term is derived by taking
Dropping uniform mixture ratios
If a two-component mixture has a mixing ratio of 0.5, then the mixing ratios can be dropped, because
0.5);
log_half = log(for (n in 1:N) {
target +=
1], sigma[1]),
log_sum_exp(log_half + normal_lpdf(y[n] | mu[2], sigma[2]));
log_half + normal_lpdf(y[n] | mu[ }
then the
for (n in 1:N) {
target += log_sum_exp(normal_lpdf(y[n] | mu[1], sigma[1]),
2], sigma[2]));
normal_lpdf(y[n] | mu[ }
The same result holds if there are
The result follows from the identity
Recovering posterior mixture proportions
The posterior
The normalization can be done via summation, because
On the log scale, the normalized probability is computed as
Estimating parameters of a mixture
Given the scheme for representing mixtures, it may be moved to an estimation setting, where the locations, scales, and mixture components are unknown. Further generalizing to a number of mixture components specified as data yields the following model.
data {
int<lower=1> K; // number of mixture components
int<lower=1> N; // number of data points
array[N] real y; // observations
}parameters {
simplex[K] theta; // mixing proportions
ordered[K] mu; // locations of mixture components
vector<lower=0>[K] sigma; // scales of mixture components
}model {
vector[K] log_theta = log(theta); // cache log calculation
0, 2);
sigma ~ lognormal(0, 10);
mu ~ normal(for (n in 1:N) {
vector[K] lps = log_theta;
for (k in 1:K) {
lps[k] += normal_lpdf(y[n] | mu[k], sigma[k]);
}target += log_sum_exp(lps);
} }
The model involves K
mixture components and N
data points. The mixing proportion parameter theta
is declared to be a unit mu
and scale parameter sigma
are both defined to be K
-vectors.
The location parameter mu
is declared to be an ordered vector in order to identify the model. This will not affect inferences that do not depend on the ordering of the components as long as the prior for the components mu[k]
is symmetric, as it is here (each component has an independent
The values in the scale array sigma
are constrained to be non-negative, and have a weakly informative prior given in the model chosen to avoid zero values and thus collapsing components.
The model declares a local array variable lps
to be size K
and uses it to accumulate the log contributions from the mixture components. The main action is in the loop over data points n
. For each such point, the log of lps
. Then the log probability is incremented with the log sum of exponentials of those values.
Vectorizing mixtures
There is (currently) no way to vectorize mixture models at the observation level in Stan. This section is to warn users away from attempting to vectorize naively, as it results in a different model. A proper mixture at the observation level is defined as follows, where we assume that lambda
, y[n]
, mu[1], mu[2]
, and sigma[1], sigma[2]
are all scalars and lambda
is between 0 and 1.
for (n in 1:N) {
target += log_sum_exp(log(lambda)
1], sigma[1]),
+ normal_lpdf(y[n] | mu[
log1m(lambda)2], sigma[2])); + normal_lpdf(y[n] | mu[
or equivalently
for (n in 1:N) {
target += log_mix(lambda,
1], sigma[1]),
normal_lpdf(y[n] | mu[2], sigma[2]))
normal_lpdf(y[n] | mu[ };
This definition assumes that each observation
Contrast the previous model with the following (erroneous) attempt to vectorize the model.
target += log_sum_exp(log(lambda)
1], sigma[1]),
+ normal_lpdf(y | mu[
log1m(lambda)2], sigma[2])); + normal_lpdf(y | mu[
or equivalently,
target += log_mix(lambda,
1], sigma[1]),
normal_lpdf(y | mu[2], sigma[2])); normal_lpdf(y | mu[
This second definition implies that the entire sequence
Inferences supported by mixtures
In many mixture models, the mixture components are underlyingly exchangeable in the model and thus not identifiable. This arises if the parameters of the mixture components have exchangeable priors and the mixture ratio gets a uniform prior so that the parameters of the mixture components are also exchangeable in the likelihood.
We have finessed this basic problem by ordering the parameters. This will allow us in some cases to pick out mixture components either ahead of time or after fitting (e.g., male vs. female, or Democrat vs. Republican).
In other cases, we do not care about the actual identities of the mixture components and want to consider inferences that are independent of indexes. For example, we might only be interested in posterior predictions for new observations.
Mixtures with unidentifiable components
As an example, consider the normal mixture from the previous section, which provides an exchangeable prior on the pairs of parameters
The prior on the mixture ratio is uniform,
Inference under label switching
In cases where the mixture components are not identifiable, it can be difficult to diagnose convergence of sampling or optimization algorithms because the labels will switch, or be permuted, in different MCMC chains or different optimization runs. Luckily, posterior inferences which do not refer to specific component labels are invariant under label switching and may be used directly. This subsection considers a pair of examples.
Posterior predictive distribution
Posterior predictive distribution for a new observation
The normal mixture example from the previous section, with
data {
int<lower=0> N_tilde;
vector[N_tilde] y_tilde;
// ...
}generated quantities {
vector[N_tilde] log_p_y_tilde;
for (n in 1:N_tilde) {
log_p_y_tilde[n]
= log_mix(lambda,1], sigma[1])
normal_lpdf(y_tilde[n] | mu[2], sigma[2]));
normal_lpdf(y_tilde[n] | mu[
} }
It is a bit of a bother afterwards, because the logarithm function isn’t linear and hence doesn’t distribute through averages (Jensen’s inequality shows which way the inequality goes). The right thing to do is to apply log_sum_exp
of the posterior draws of log_p_y_tilde
. The average log predictive density is then given by subtracting log(N_new)
.
Clustering and similarity
Often a mixture model will be applied to a clustering problem and there might be two data items
As with other event probabilities, this can be calculated in the generated quantities block either by sampling
Zero-inflated and hurdle models
Zero-inflated and hurdle models both provide mixtures of a Poisson and Bernoulli probability mass function to allow more flexibility in modeling the probability of a zero outcome. Zero-inflated models, as defined by Lambert (1992), add additional probability mass to the outcome of zero. Hurdle models, on the other hand, are formulated as pure mixtures of zero and non-zero outcomes.
Zero inflation and hurdle models can be formulated for discrete distributions other than the Poisson. Zero inflation does not work for continuous distributions in Stan because of issues with derivatives; in particular, there is no way to add a point mass to a continuous distribution, such as zero-inflating a normal as a regression coefficient prior. Hurdle models can be formulated as combination of point mass at zero and continuous distribution for positive values.
Zero inflation
Consider the following example for zero-inflated Poisson distributions. There is a probability
Stan does not support conditional distribution statements (with ~
) conditional on some parameter, and we need to consider the corresponding likelihood target +=
) as follows.
data {
int<lower=0> N;
array[N] int<lower=0> y;
}parameters {
real<lower=0, upper=1> theta;
real<lower=0> lambda;
}model {
for (n in 1:N) {
if (y[n] == 0) {
target += log_sum_exp(log(theta),
log1m(theta)
+ poisson_lpmf(y[n] | lambda));else {
} target += log1m(theta)
+ poisson_lpmf(y[n] | lambda);
}
} }
The log1m(theta)
computes log(1-theta)
, but is more computationally stable. The log_sum_exp(lp1,lp2)
function adds the log probabilities on the linear scale; it is defined to be equal to log(exp(lp1) + exp(lp2))
, but is more computationally stable and faster.
Optimizing the zero-inflated Poisson model
The code given above to compute the zero-inflated Poisson redundantly calculates all of the Bernoulli terms and also poisson_lpmf(0 | lambda)
every time the first condition body executes. The use of the redundant terms is conditioned on y
, which is known when the data are read in. This allows the transformed data block to be used to compute some more convenient terms for expressing the log density each iteration.
The number of zero cases is computed and handled separately. Then the nonzero cases are collected into their own array for vectorization. The number of zeros is required to declare y_nonzero
, so it must be computed in a function.
functions {
int num_zeros(array[] int y) {
int sum = 0;
for (n in 1:size(y)) {
0);
sum += (y[n] ==
}return sum;
}
}// ...
transformed data {
int<lower=0> N_zero = num_zeros(y);
array[N - N_zero] int<lower=1> y_nonzero;
int N_nonzero = 0;
for (n in 1:N) {
if (y[n] == 0) continue;
1;
N_nonzero +=
y_nonzero[N_nonzero] = y[n];
}
}// ...
model {
// ...
target
+= N_zero
* log_sum_exp(log(theta),
log1m(theta)0 | lambda));
+ poisson_lpmf(target += N_nonzero * log1m(theta);
target += poisson_lpmf(y_nonzero | lambda);
// ...
}
The boundary conditions of all zeros and no zero outcomes is handled appropriately; in the vectorized case, if y_nonzero
is empty, N_nonzero
will be zero, and the last two target increment terms will add zeros.
Hurdle models
The hurdle model is similar to the zero-inflated model, but more flexible in that the zero outcomes can be deflated as well as inflated. Given the probability
The corresponding likelihood function for the hurdle model is defined by
The hurdle model is even more straightforward to program in Stan, as it does not require an explicit mixture.
if (y[n] == 0) {
target += log(theta);
else {
} target += log1m(theta) + poisson_lpmf(y[n] | lambda)
0 | lambda));
- poisson_lccdf( }
Julian King pointed out that because
target += log1m(theta) + poisson_lpmf(y[n] | lambda)
- log1m_exp(-lambda));
The resulting code is about 15% faster than the code with the CCDF.
This is an example where collecting counts ahead of time can also greatly speed up the execution speed without changing the density. For data size
To achieve this speedup, it helps to have a function to count the number of non-zero entries in an array of integers,
functions {
int num_zero(array[] int y) {
int nz = 0;
for (n in 1:size(y)) {
if (y[n] == 0) {
1;
nz +=
}
}return nz;
} }
Then a transformed data block can be used to store the sufficient statistics,
transformed data {
int<lower=0, upper=N> N0 = num_zero(y);
int<lower=0, upper=N> Ngt0 = N - N0;
array[N - num_zero(y)] int<lower=1> y_nz;
{int pos = 1;
for (n in 1:N) {
if (y[n] != 0) {
y_nz[pos] = y[n];1;
pos +=
}
}
} }
The model block is then reduced to three statements.
model {
N0 ~ binomial(N, theta);
y_nz ~ poisson(lambda);target += -Ngt0 * log1m_exp(-lambda);
}
The first statement accounts for the Bernoulli contribution to both the zero and non-zero counts. The second line is the Poisson contribution from the non-zero counts, which is now vectorized. Finally, the normalization for the truncation is a single line, so that the expression for the log CCDF at 0 isn’t repeated. Also note that the negation is applied to the constant Ngt0
; whenever possible, leave subexpressions constant because then gradients need not be propagated until a non-constant term is encountered.
Priors and effective data size in mixture models
Suppose we have a two-component mixture model with mixing rate
Comparison to model averaging
In contrast to mixture models, which create mixtures at the observation level, model averaging creates mixtures over the posteriors of models separately fit with the entire data set. In this situation, the priors work as expected when fitting the models independently, with the posteriors being based on the complete observed data
If different models are expected to account for different observations, we recommend building mixture models directly. If the models being mixed are similar, often a single expanded model will capture the features of both and may be used on its own for inferential purposes (estimation, decision making, prediction, etc.). For example, rather than fitting an intercept-only regression and a slope-only regression and averaging their predictions, even as a mixture model, we would recommend building a single regression with both a slope and an intercept. Model complexity, such as having more predictors than data points, can be tamed using appropriately regularizing priors. If computation becomes a bottleneck, the only recourse can be model averaging, which can be calculated after fitting each model independently (see Hoeting et al. (1999) and Gelman et al. (2013) for theoretical and computational details).
References
Footnotes
Imposing a constraint such as
will resolve the symmetry, but fundamentally changes the model and its posterior inferences.↩︎