Clustering Models
Unsupervised methods for organizing data into groups are collectively referred to as clustering. This chapter describes the implementation in Stan of two widely used statistical clustering models, soft
Relation to finite mixture models
As mentioned in the clustering section, clustering models and finite mixture models are really just two sides of the same coin. The “soft”
Soft K-means
Geometric hard K-means clustering
- For each
in , randomly assign vector to a cluster in ; - Repeat
- For each cluster
in , compute the cluster centroid by averaging the vectors assigned to that cluster; - For each
in , reassign to the cluster for which the (Euclidean) distance from to is smallest; - If no vectors changed cluster, return the cluster assignments.
- For each cluster
This algorithm is guaranteed to terminate.
Soft K-means clustering
Soft
In the full generative model, each data point
The data points themselves are generated from a multivariate normal distribution whose parameters are determined by the cluster assignment
The sample implementation in this section assumes a fixed unit covariance matrix shared by all clusters
Stan implementation of soft K-means
Consider the following Stan program for implementing
data {
int<lower=0> N; // number of data points
int<lower=1> D; // number of dimensions
int<lower=1> K; // number of clusters
array[N] vector[D] y; // observations
}transformed data {
real<upper=0> neg_log_K;
neg_log_K = -log(K);
}parameters {
array[K] vector[D] mu; // cluster means
}transformed parameters {
array[N, K] real<upper=0> soft_z; // log unnormalized clusters
for (n in 1:N) {
for (k in 1:K) {
soft_z[n, k] = neg_log_K0.5 * dot_self(mu[k] - y[n]);
-
}
}
}model {
// prior
for (k in 1:K) {
mu[k] ~ std_normal();
}
// likelihood
for (n in 1:N) {
target += log_sum_exp(soft_z[n]);
} }
There is an independent standard normal prior on the centroid parameters; this prior could be swapped with other priors, or even a hierarchical model to fit an overall problem scale and location.
The only parameter is mu
, where mu[k]
is the centroid for cluster soft_z[n]
contain the log of the unnormalized cluster assignment probabilities. The vector soft_z[n]
can be converted back to a normalized simplex using the softmax function (see the functions reference manual), either externally or within the model’s generated quantities block.
Generalizing soft K-means
The multivariate normal distribution with unit covariance matrix produces a log probability density proportional to Euclidean distance (i.e.,
Within the multivariate normal version of
Although there is no global spatial analog, it is common to see soft
The difficulty of Bayesian inference for clustering
Two problems make it pretty much impossible to perform full Bayesian inference for clustering models, the lack of parameter identifiability and the extreme multimodality of the posteriors. There is additional discussion related to the non-identifiability due to label switching in the label switching section.
Non-identifiability
Cluster assignments are not identified—permuting the cluster mean vectors mu
leads to a model with identical likelihoods. For instance, permuting the first two indexes in mu
and the first two indexes in each soft_z[n]
leads to an identical likelihood (and prior).
The lack of identifiability means that the cluster parameters cannot be compared across multiple Markov chains. In fact, the only parameter in soft
Multimodality
The other problem with clustering models is that their posteriors are highly multimodal. One form of multimodality is the non-identifiability leading to index swapping. But even without the index problems the posteriors are highly multimodal.
Bayesian inference fails in cases of high multimodality because there is no way to visit all of the modes in the posterior in appropriate proportions and thus no way to evaluate integrals involved in posterior predictive inference.
In light of these two problems, the advice often given in fitting clustering models is to try many different initializations and select the sample with the highest overall probability. It is also popular to use optimization-based point estimators such as expectation maximization or variational Bayes, which can be much more efficient than sampling-based approaches.
Naive Bayes classification and clustering
Naive Bayes is a kind of mixture model that can be used for classification or for clustering (or a mix of both), depending on which labels for items are observed.1
Multinomial mixture models are referred to as “naive Bayes” because they are often applied to classification problems where the multinomial independence assumptions are clearly false.
Naive Bayes classification and clustering can be applied to any data with multinomial structure. A typical example of this is natural language text classification and clustering, which is used an example in what follows.
The observed data consists of a sequence of
The multinomial mixture model generates a single category
Next, the words in each document are generated conditionally independently of each other and the words in other documents based on the category of the document, with word
The parameters
Coding ragged arrays
The specification for naive Bayes in the previous sections have used a ragged array notation for the words
The relevant variables for the program are N
, the total number of words in all the documents, the word array w
, and the document identity array doc
.
Estimation with category-labeled training data
A naive Bayes model for estimating the simplex parameters given training data with documents of known categories can be coded in Stan as follows
data {
// training data
int<lower=1> K; // num topics
int<lower=1> V; // num words
int<lower=0> M; // num docs
int<lower=0> N; // total word instances
array[M] int<lower=1, upper=K> z; // topic for doc m
array[N] int<lower=1, upper=V> w; // word n
array[N] int<lower=1, upper=M> doc; // doc ID for word n
// hyperparameters
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}parameters {
simplex[K] theta; // topic prevalence
array[K] simplex[V] phi; // word dist for topic k
}model {
theta ~ dirichlet(alpha);for (k in 1:K) {
phi[k] ~ dirichlet(beta);
}for (m in 1:M) {
z[m] ~ categorical(theta);
}for (n in 1:N) {
w[n] ~ categorical(phi[z[doc[n]]]);
} }
The topic identifiers
Estimation without category-labeled training data
Naive Bayes models can be used in an unsupervised fashion to cluster multinomial-structured data into a fixed number z
. Because z
is discrete, it needs to be summed out of the model calculation. This is done for naive Bayes as for other mixture models. The parameters are the same up to the priors, but the likelihood is now computed as the marginal document probability
The last step shows how the log_sum_exp
function can be used to stabilize the numerical calculation and return a result on the log scale.
model {
array[M, K] real gamma;
theta ~ dirichlet(alpha);for (k in 1:K) {
phi[k] ~ dirichlet(beta);
}for (m in 1:M) {
for (k in 1:K) {
gamma[m, k] = categorical_lpmf(k | theta);
}
}for (n in 1:N) {
for (k in 1:K) {
gamma[doc[n], k] = gamma[doc[n], k]
+ categorical_lpmf(w[n] | phi[k]);
}
}for (m in 1:M) {
target += log_sum_exp(gamma[m]);
} }
The local variable gamma[m, k]
represents the value
Given
If the variable gamma
were declared and defined in the transformed parameter block, its sampled values would be saved by Stan. The normalized posterior probabilities could also be defined as generated quantities.
Full Bayesian inference for naive Bayes
Full Bayesian posterior predictive inference for the naive Bayes model can be implemented in Stan by combining the models for labeled and unlabeled data. The estimands include both the model parameters and the posterior distribution over categories for the unlabeled data. The model is essentially a missing data model assuming the unknown category labels are missing completely at random; see Gelman et al. (2013) and Gelman and Hill (2007) for more information on missing data imputation. The model is also an instance of semisupervised learning because the unlabeled data contributes to the parameter estimations.
To specify a Stan model for performing full Bayesian inference, the model for labeled data is combined with the model for unlabeled data. A second document collection is declared as data, but without the category labels, leading to new variables M2
N2
, w2
, and doc2
. The number of categories and number of words, as well as the hyperparameters are shared and only declared once. Similarly, there is only one set of parameters. Then the model contains a single set of statements for the prior, a set of statements for the labeled data, and a set of statements for the unlabeled data.
Prediction without model updates
An alternative to full Bayesian inference involves estimating a model using labeled data, then applying it to unlabeled data without updating the parameter estimates based on the unlabeled data. This behavior can be implemented by moving the definition of gamma
for the unlabeled documents to the generated quantities block. Because the variables no longer contribute to the log probability, they no longer jointly contribute to the estimation of the model parameters.
Latent Dirichlet allocation
Latent Dirichlet allocation (LDA) is a mixed-membership multinomial clustering model (Blei, Ng, and Jordan 2003) that generalizes naive Bayes. Using the topic and document terminology common in discussions of LDA, each document is modeled as having a mixture of topics, with each word drawn from a topic based on the mixing proportions.
The LDA Model
The basic model assumes each document is generated independently based on fixed hyperparameters. For document
The prior hyperparameter
Finally, the word
where
Summing out the discrete parameters
Although Stan does not (yet) support discrete sampling, it is possible to calculate the marginal distribution over the continuous parameters by summing out the discrete parameters as in other mixture models. The marginal posterior of the topic and word variables is
The inner word-probability term is defined by summing out the topic assignments,
Plugging the distributions in and converting to the log scale provides a formula that can be implemented directly in Stan,
Implementation of LDA
Applying the marginal derived in the last section to the data structure described in this section leads to the following Stan program for LDA.
data {
int<lower=2> K; // num topics
int<lower=2> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
array[N] int<lower=1, upper=V> w; // word n
array[N] int<lower=1, upper=M> doc; // doc ID for word n
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}parameters {
array[M] simplex[K] theta; // topic dist for doc m
array[K] simplex[V] phi; // word dist for topic k
}model {
for (m in 1:M) {
// prior
theta[m] ~ dirichlet(alpha);
}for (k in 1:K) {
// prior
phi[k] ~ dirichlet(beta);
}for (n in 1:N) {
array[K] real gamma;
for (k in 1:K) {
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
}target += log_sum_exp(gamma); // likelihood;
} }
As in the other mixture models, the log-sum-of-exponents function is used to stabilize the numerical arithmetic.
References
Footnotes
For clustering, the non-identifiability problems for all mixture models present a problem, whereas there is no such problem for classification. Despite the difficulties with full Bayesian inference for clustering, researchers continue to use it, often in an exploratory data analysis setting rather than for predictive modeling.↩︎