Clustering Models

Unsupervised methods for organizing data into groups are collectively referred to as clustering. This chapter describes the implementation in Stan of two widely used statistical clustering models, soft $K$ -means and latent Dirichlet allocation (LDA). In addition, this chapter includes naive Bayesian classification, which can be viewed as a form of clustering which may be supervised. These models are typically expressed using discrete parameters for cluster assignments. Nevertheless, they can be implemented in Stan like any other mixture model by marginalizing out the discrete parameters (see the mixture modeling chapter).

Relation to finite mixture models

As mentioned in the clustering section, clustering models and finite mixture models are really just two sides of the same coin. The “soft” $K$ -means model described in the next section is a normal mixture model (with varying assumptions about covariance in higher dimensions leading to variants of $K$ -means). Latent Dirichlet allocation is a mixed-membership multinomial mixture.

Soft K-means

$K$ -means clustering is a method of clustering data represented as $D$ -dimensional vectors. Specifically, there will be $N$ items to be clustered, each represented as a vector $y_{n} \in R^{D}$ . In the “soft” version of $K$ -means, the assignments to clusters will be probabilistic.

Geometric hard K-means clustering

$K$ -means clustering is typically described geometrically in terms of the following algorithm, which assumes the number of clusters $K$ and data vectors $y$ as input.

For each $n$ in ${1, \dots, N}$ , randomly assign vector $y_{n}$ to a cluster in ${1, \dots, K}$ ;
Repeat
1. For each cluster $k$ in ${1, \dots, K}$ , compute the cluster centroid $μ_{k}$ by averaging the vectors assigned to that cluster;
2. For each $n$ in ${1, \dots, N}$ , reassign $y_{n}$ to the cluster $k$ for which the (Euclidean) distance from $y_{n}$ to $μ_{k}$ is smallest;
3. If no vectors changed cluster, return the cluster assignments.

This algorithm is guaranteed to terminate.

Soft K-means clustering

Soft $K$ -means clustering treats the cluster assignments as probability distributions over the clusters. Because of the connection between Euclidean distance and multivariate normal models with a fixed covariance, soft $K$ -means can be expressed (and coded in Stan) as a multivariate normal mixture model.

In the full generative model, each data point $n$ in ${1, \dots, N}$ is assigned a cluster $z_{n} \in {1, \dots, K}$ with symmetric uniform probability, $z_{n} \sim categorical (1 / K),$ where $1$ is the unit vector of $K$ dimensions, so that $1 / K$ is the symmetric $K$ -simplex. Thus the model assumes that each data point is drawn from a hard decision about cluster membership. The softness arises only from the uncertainty about which cluster generated a data point.

The data points themselves are generated from a multivariate normal distribution whose parameters are determined by the cluster assignment $z_{n}$ , $y_{n} \sim normal (μ_{z [n]}, Σ_{z [n]})$

The sample implementation in this section assumes a fixed unit covariance matrix shared by all clusters $k$ , $Σ_{k} = diag_matrix (1),$ so that the log multivariate normal can be implemented directly up to a proportion by $normal (y_{n} ∣ μ_{k}, diag_matrix (1)) \propto \exp (- \frac{1}{2} \sum_{d = 1}^{D} {(μ_{k, d} - y_{n, d})}^{2}) .$ The spatial perspective on $K$ -means arises by noting that the inner term is just half the negative Euclidean distance from the cluster mean $μ_{k}$ to the data point $y_{n}$ .

Stan implementation of soft K-means

Consider the following Stan program for implementing $K$ -means clustering.

data {
  int<lower=0> N;        // number of data points
  int<lower=1> D;        // number of dimensions
  int<lower=1> K;        // number of clusters
  array[N] vector[D] y;  // observations
}
transformed data {
  real<upper=0> neg_log_K;
  neg_log_K = -log(K);
}
parameters {
  array[K] vector[D] mu; // cluster means
}
transformed parameters {
  array[N, K] real<upper=0> soft_z; // log unnormalized clusters
  for (n in 1:N) {
    for (k in 1:K) {
      soft_z[n, k] = neg_log_K
                     - 0.5 * dot_self(mu[k] - y[n]);
    }
  }
}
model {
  // prior
  for (k in 1:K) {
    mu[k] ~ std_normal();
  }

  // likelihood
  for (n in 1:N) {
    target += log_sum_exp(soft_z[n]);
  }
}

There is an independent standard normal prior on the centroid parameters; this prior could be swapped with other priors, or even a hierarchical model to fit an overall problem scale and location.

The only parameter is mu, where mu[k] is the centroid for cluster $k$ . The transformed parameters soft_z[n] contain the log of the unnormalized cluster assignment probabilities. The vector soft_z[n] can be converted back to a normalized simplex using the softmax function (see the functions reference manual), either externally or within the model’s generated quantities block.

Generalizing soft K-means

The multivariate normal distribution with unit covariance matrix produces a log probability density proportional to Euclidean distance (i.e., $L_{2}$ distance). Other distributions relate to other geometries. For instance, replacing the normal distribution with the double exponential (Laplace) distribution produces a clustering model based on $L_{1}$ distance (i.e., Manhattan or taxicab distance).

Within the multivariate normal version of $K$ -means, replacing the unit covariance matrix with a shared covariance matrix amounts to working with distances defined in a space transformed by the inverse covariance matrix.

Although there is no global spatial analog, it is common to see soft $K$ -means specified with a per-cluster covariance matrix. In this situation, a hierarchical prior may be used for the covariance matrices.

The difficulty of Bayesian inference for clustering

Two problems make it pretty much impossible to perform full Bayesian inference for clustering models, the lack of parameter identifiability and the extreme multimodality of the posteriors. There is additional discussion related to the non-identifiability due to label switching in the label switching section.

Non-identifiability

Cluster assignments are not identified—permuting the cluster mean vectors mu leads to a model with identical likelihoods. For instance, permuting the first two indexes in mu and the first two indexes in each soft_z[n] leads to an identical likelihood (and prior).

The lack of identifiability means that the cluster parameters cannot be compared across multiple Markov chains. In fact, the only parameter in soft $K$ -means is not identified, leading to problems in monitoring convergence. Clusters can even fail to be identified within a single chain, with indices swapping if the chain is long enough or the data are not cleanly separated.

Multimodality

The other problem with clustering models is that their posteriors are highly multimodal. One form of multimodality is the non-identifiability leading to index swapping. But even without the index problems the posteriors are highly multimodal.

Bayesian inference fails in cases of high multimodality because there is no way to visit all of the modes in the posterior in appropriate proportions and thus no way to evaluate integrals involved in posterior predictive inference.

In light of these two problems, the advice often given in fitting clustering models is to try many different initializations and select the sample with the highest overall probability. It is also popular to use optimization-based point estimators such as expectation maximization or variational Bayes, which can be much more efficient than sampling-based approaches.

Naive Bayes classification and clustering

Naive Bayes is a kind of mixture model that can be used for classification or for clustering (or a mix of both), depending on which labels for items are observed.¹

Multinomial mixture models are referred to as “naive Bayes” because they are often applied to classification problems where the multinomial independence assumptions are clearly false.

Naive Bayes classification and clustering can be applied to any data with multinomial structure. A typical example of this is natural language text classification and clustering, which is used an example in what follows.

The observed data consists of a sequence of $M$ documents made up of bags of words drawn from a vocabulary of $V$ distinct words. A document $m$ has $N_{m}$ words, which are indexed as $w_{m, 1}, \dots, w_{m, N [m]} \in {1, \dots, V}$ . Despite the ordered indexing of words in a document, this order is not part of the model, which is clearly defective for natural human language data. A number of topics (or categories) $K$ is fixed.

The multinomial mixture model generates a single category $z_{m} \in {1, \dots, K}$ for each document $m \in {1, \dots, M}$ according to a categorical distribution, $z_{m} \sim categorical (θ) .$ The $K$ -simplex parameter $θ$ represents the prevalence of each category in the data.

Next, the words in each document are generated conditionally independently of each other and the words in other documents based on the category of the document, with word $n$ of document $m$ being generated as $w_{m, n} \sim categorical (ϕ_{z [m]}) .$ The parameter $ϕ_{z [m]}$ is a $V$ -simplex representing the probability of each word in the vocabulary in documents of category $z_{m}$ .

The parameters $θ$ and $ϕ$ are typically given symmetric Dirichlet priors. The prevalence $θ$ is sometimes fixed to produce equal probabilities for each category $k \in {1, \dots, K}$ .

Coding ragged arrays

The specification for naive Bayes in the previous sections have used a ragged array notation for the words $w$ . Because Stan does not support ragged arrays, the models are coded using an alternative strategy that provides an index for each word in a global list of words. The data is organized as follows, with the word arrays laid out in a column and each assigned to its document in a second column.

$\begin{array}{lll} n & w [n] & doc [n] \\ 1 & w_{1, 1} & 1 \\ 2 & w_{1, 2} & 1 \\ ⋮ & ⋮ & ⋮ \\ N_{1} & w_{1, N [1]} & 1 \\ N_{1} + 1 & w_{2, 1} & 2 \\ N_{1} + 2 & w_{2, 2} & 2 \\ ⋮ & ⋮ & ⋮ \\ N_{1} + N_{2} & w_{2, N [2]} & 2 \\ N_{1} + N_{2} + 1 & w_{3, 1} & 3 \\ ⋮ & ⋮ & ⋮ \\ N = \sum_{m = 1}^{M} N_{m} & w_{M, N [M]} & M \end{array}$

The relevant variables for the program are N, the total number of words in all the documents, the word array w, and the document identity array doc.

Estimation with category-labeled training data

A naive Bayes model for estimating the simplex parameters given training data with documents of known categories can be coded in Stan as follows

data {
  // training data
  int<lower=1> K;               // num topics
  int<lower=1> V;               // num words
  int<lower=0> M;               // num docs
  int<lower=0> N;               // total word instances
  array[M] int<lower=1, upper=K> z;    // topic for doc m
  array[N] int<lower=1, upper=V> w;    // word n
  array[N] int<lower=1, upper=M> doc;  // doc ID for word n
  // hyperparameters
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta;             // topic prevalence
  array[K] simplex[V] phi;      // word dist for topic k
}
model {
  theta ~ dirichlet(alpha);
  for (k in 1:K) {
    phi[k] ~ dirichlet(beta);
  }
  for (m in 1:M) {
    z[m] ~ categorical(theta);
  }
  for (n in 1:N) {
    w[n] ~ categorical(phi[z[doc[n]]]);
  }
}

The topic identifiers $z_{m}$ are declared as data and the latent category assignments are included as part of the likelihood function.

Estimation without category-labeled training data

Naive Bayes models can be used in an unsupervised fashion to cluster multinomial-structured data into a fixed number $K$ of categories. The data declaration includes the same variables as the model in the previous section excluding the topic labels z. Because z is discrete, it needs to be summed out of the model calculation. This is done for naive Bayes as for other mixture models. The parameters are the same up to the priors, but the likelihood is now computed as the marginal document probability

$\begin{aligned} \log & p (w_{m, 1}, \dots, w_{m, N_{m}} ∣ θ, ϕ) \\ = \log \sum_{k = 1}^{K} (categorical (k ∣ θ) \times \prod_{n = 1}^{N_{m}} categorical (w_{m, n} ∣ ϕ_{k})) \\ = \log \sum_{k = 1}^{K} \exp (\log categorical (k ∣ θ) + \sum_{n = 1}^{N_{m}} \log categorical (w_{m, n} ∣ ϕ_{k})) . \end{aligned}$

The last step shows how the log_sum_exp function can be used to stabilize the numerical calculation and return a result on the log scale.

model {
  array[M, K] real gamma;
  theta ~ dirichlet(alpha);
  for (k in 1:K) {
    phi[k] ~ dirichlet(beta);
  }
  for (m in 1:M) {
    for (k in 1:K) {
      gamma[m, k] = categorical_lpmf(k | theta);
    }
  }
  for (n in 1:N) {
    for (k in 1:K) {
      gamma[doc[n], k] = gamma[doc[n], k]
                         + categorical_lpmf(w[n] | phi[k]);
    }
  }
  for (m in 1:M) {
    target += log_sum_exp(gamma[m]);
  }
}

The local variable gamma[m, k] represents the value $γ_{m, k} = \log categorical (k ∣ θ) + \sum_{n = 1}^{N_{m}} \log categorical (w_{m, n} ∣ ϕ_{k}) .$

Given $γ$ , the posterior probability that document $m$ is assigned category $k$ is $Pr [z_{m} = k ∣ w, α, β] = \exp (γ_{m, k} - \log \sum_{k = 1}^{K} \exp (γ_{m, k})) .$

If the variable gamma were declared and defined in the transformed parameter block, its sampled values would be saved by Stan. The normalized posterior probabilities could also be defined as generated quantities.

Full Bayesian inference for naive Bayes

Full Bayesian posterior predictive inference for the naive Bayes model can be implemented in Stan by combining the models for labeled and unlabeled data. The estimands include both the model parameters and the posterior distribution over categories for the unlabeled data. The model is essentially a missing data model assuming the unknown category labels are missing completely at random; see Gelman et al. (2013) and Gelman and Hill (2007) for more information on missing data imputation. The model is also an instance of semisupervised learning because the unlabeled data contributes to the parameter estimations.

To specify a Stan model for performing full Bayesian inference, the model for labeled data is combined with the model for unlabeled data. A second document collection is declared as data, but without the category labels, leading to new variables M2 N2, w2, and doc2. The number of categories and number of words, as well as the hyperparameters are shared and only declared once. Similarly, there is only one set of parameters. Then the model contains a single set of statements for the prior, a set of statements for the labeled data, and a set of statements for the unlabeled data.

Prediction without model updates

An alternative to full Bayesian inference involves estimating a model using labeled data, then applying it to unlabeled data without updating the parameter estimates based on the unlabeled data. This behavior can be implemented by moving the definition of gamma for the unlabeled documents to the generated quantities block. Because the variables no longer contribute to the log probability, they no longer jointly contribute to the estimation of the model parameters.

Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a mixed-membership multinomial clustering model (Blei, Ng, and Jordan 2003) that generalizes naive Bayes. Using the topic and document terminology common in discussions of LDA, each document is modeled as having a mixture of topics, with each word drawn from a topic based on the mixing proportions.

The LDA Model

The basic model assumes each document is generated independently based on fixed hyperparameters. For document $m$ , the first step is to draw a topic distribution simplex $θ_{m}$ over the $K$ topics, $θ_{m} \sim Dirichlet (α) .$

The prior hyperparameter $α$ is fixed to a $K$ -vector of positive values. Each word in the document is generated independently conditional on the distribution $θ_{m}$ . First, a topic $z_{m, n} \in {1, \dots, K}$ is drawn for the word based on the document-specific topic-distribution, $z_{m, n} \sim categorical (θ_{m}) .$

Finally, the word $w_{m, n}$ is drawn according to the word distribution for topic $z_{m, n}$ , $w_{m, n} \sim categorical (ϕ_{z [m, n]}) .$ The distributions $ϕ_{k}$ over words for topic $k$ are also given a Dirichlet prior, $ϕ_{k} \sim Dirichlet (β)$

where $β$ is a fixed $V$ -vector of positive values.

Summing out the discrete parameters

Although Stan does not (yet) support discrete sampling, it is possible to calculate the marginal distribution over the continuous parameters by summing out the discrete parameters as in other mixture models. The marginal posterior of the topic and word variables is $\begin{aligned} p (θ, ϕ ∣ w, α, β) & \propto p (θ ∣ α) p (ϕ ∣ β) p (w ∣ θ, ϕ) \\ = \prod_{m = 1}^{M} p (θ_{m} ∣ α) \times \prod_{k = 1}^{K} p (ϕ_{k} ∣ β) \times \prod_{m = 1}^{M} \prod_{n = 1}^{M [n]} p (w_{m, n} ∣ θ_{m}, ϕ) . \end{aligned}$

The inner word-probability term is defined by summing out the topic assignments, $\begin{aligned} p (w_{m, n} ∣ θ_{m}, ϕ) & = \sum_{z = 1}^{K} p (z, w_{m, n} ∣ θ_{m}, ϕ) \\ = \sum_{z = 1}^{K} p (z ∣ θ_{m}) p (w_{m, n} ∣ ϕ_{z}) . \end{aligned}$

Plugging the distributions in and converting to the log scale provides a formula that can be implemented directly in Stan, $\begin{aligned} \log & p (θ, ϕ ∣ w, α, β) \\ = \sum_{m = 1}^{M} \log Dirichlet (θ_{m} ∣ α) + \sum_{k = 1}^{K} \log Dirichlet (ϕ_{k} ∣ β) \\ + \sum_{m = 1}^{M} \sum_{n = 1}^{N [m]} \log (\sum_{z = 1}^{K} categorical (z ∣ θ_{m}) \times categorical (w_{m, n} ∣ ϕ_{z})) \end{aligned}$

Implementation of LDA

Applying the marginal derived in the last section to the data structure described in this section leads to the following Stan program for LDA.

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  array[N] int<lower=1, upper=V> w;    // word n
  array[N] int<lower=1, upper=M> doc;  // doc ID for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  array[M] simplex[K] theta;    // topic dist for doc m
  array[K] simplex[V] phi;      // word dist for topic k
}
model {
  for (m in 1:M) {
    theta[m] ~ dirichlet(alpha);  // prior
  }
  for (k in 1:K) {
    phi[k] ~ dirichlet(beta);     // prior
  }
  for (n in 1:N) {
    array[K] real gamma;
    for (k in 1:K) {
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    }
    target += log_sum_exp(gamma);  // likelihood;
  }
}

As in the other mixture models, the log-sum-of-exponents function is used to stabilize the numerical arithmetic.

Correlated topic model

To account for correlations in the distribution of topics for documents, Blei and Lafferty (2007) introduced a variant of LDA in which the Dirichlet prior on the per-document topic distribution is replaced with a multivariate logistic normal distribution.

The authors treat the prior as a fixed hyperparameter. They use an $L_{1}$ -regularized estimate of covariance, which is equivalent to the maximum a posteriori estimate given a double-exponential prior. Stan does not (yet) support maximum a posteriori estimation, so the mean and covariance of the multivariate logistic normal must be specified as data.

Fixed hyperparameter correlated topic model

The Stan model in the previous section can be modified to implement the correlated topic model by replacing the Dirichlet topic prior alpha in the data declaration with the mean and covariance of the multivariate logistic normal prior.

data {
  // ... data as before without alpha ...
  vector[K] mu;          // topic mean
  cov_matrix[K] Sigma;   // topic covariance
}

Rather than drawing the simplex parameter theta from a Dirichlet, a parameter eta is drawn from a multivariate normal distribution and then transformed using softmax into a simplex.

parameters {
  array[K] simplex[V] phi;     // word dist for topic k
  array[M] vector[K] eta;      // topic dist for doc m
}
transformed parameters {
  array[M] simplex[K] theta;
  for (m in 1:M) {
    theta[m] = softmax(eta[m]);
  }
}
model {
  for (m in 1:M) {
    eta[m] ~ multi_normal(mu, Sigma);
  }
  // ... model as before w/o prior for theta ...
}

Full Bayes correlated topic model

By adding a prior for the mean and covariance, Stan supports full Bayesian inference for the correlated topic model. This requires moving the declarations of topic mean mu and covariance Sigma from the data block to the parameters block and providing them with priors in the model. A relatively efficient and interpretable prior for the covariance matrix Sigma may be encoded as follows.

// ... data block as before, but without alpha ...
parameters {
  vector[K] mu;              // topic mean
  corr_matrix[K] Omega;      // correlation matrix
  vector<lower=0>[K] sigma;  // scales
  array[M] vector[K] eta;    // logit topic dist for doc m
  array[K] simplex[V] phi;   // word dist for topic k
}
transformed parameters {
  // ... eta as above ...
  cov_matrix[K] Sigma;       // covariance matrix
  for (m in 1:K) {
    Sigma[m, m] = sigma[m] * sigma[m] * Omega[m, m];
  }
  for (m in 1:(K-1)) {
    for (n in (m+1):K) {
      Sigma[m, n] = sigma[m] * sigma[n] * Omega[m, n];
      Sigma[n, m] = Sigma[m, n];
    }
  }
}
model {
  mu ~ normal(0, 5);      // vectorized, diffuse
  Omega ~ lkj_corr(2.0);  // regularize to unit correlation
  sigma ~ cauchy(0, 5);   // half-Cauchy due to constraint
  // ... words sampled as above ...
}

The $LKJCorr$ distribution with shape $α > 0$ has support on correlation matrices (i.e., symmetric positive definite with unit diagonal). Its density is defined by $LkjCorr (Ω ∣ α) \propto \det (Ω)^{α - 1}$ With a scale of $α = 2$ , the weakly informative prior favors a unit correlation matrix. Thus the compound effect of this prior on the covariance matrix $Σ$ for the multivariate logistic normal is a slight concentration around diagonal covariance matrices with scales determined by the prior on sigma.

References

Blei, David M., and John D. Lafferty. 2007. “A Correlated Topic Model of Science.” The Annals of Applied Statistics 1 (1): 17–37.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

Footnotes

For clustering, the non-identifiability problems for all mixture models present a problem, whereas there is no such problem for classification. Despite the difficulties with full Bayesian inference for clustering, researchers continue to use it, often in an exploratory data analysis setting rather than for predictive modeling.↩︎