9.2 Soft \(K\)-Means

\(K\)-means clustering is a method of clustering data represented as \(D\)-dimensional vectors. Specifically, there will be \(N\) items to be clustered, each represented as a vector \(y_n \in \mathbb{R}^D\). In the “soft” version of \(K\)-means, the assignments to clusters will be probabilistic.

Geometric Hard \(K\)-Means Clustering

\(K\)-means clustering is typically described geometrically in terms of the following algorithm, which assumes the number of clusters \(K\) and data vectors \(y\) as input.

  1. For each \(n\) in \(1:N\), randomly assign vector \(y_n\) to a cluster in \(1{:}K\);
  2. Repeat
    1. For each cluster \(k\) in \(1{:}K\), compute the cluster centroid \(\mu_k\) by averaging the vectors assigned to that cluster;
    2. For each \(n\) in \(1:N\), reassign \(y_n\) to the cluster \(k\) for which the (Euclidean) distance from \(y_n\) to \(\mu_k\) is smallest;
    3. If no vectors changed cluster, return the cluster assignments.

This algorithm is guaranteed to terminate.

Soft \(K\)-Means Clustering

Soft \(K\)-means clustering treats the cluster assignments as probability distributions over the clusters. Because of the connection between Euclidean distance and multivariate normal models with a fixed covariance, soft \(K\)-means can be expressed (and coded in Stan) as a multivariate normal mixture model.

In the full generative model, each data point \(n\) in \(1{:}N\) is assigned a cluster \(z_n \in 1{:}K\) with symmetric uniform probability,

\[ z_n \sim \mathsf{Categorical}({\bf 1}/K), \] where \({\bf 1}\) is the unit vector of \(K\) dimensions, so that \({\bf 1}/K\) is the symmetric \(K\)-simplex. Thus the model assumes that each data point is drawn from a hard decision about cluster membership. The softness arises only from the uncertainty about which cluster generated a data point.

The data points themselves are generated from a multivariate normal distribution whose parameters are determined by the cluster assignment \(z_n\), \[ y_n \sim \mathsf{normal}(\mu_{z[n]},\Sigma_{z[n]}) \]

The sample implementation in this section assumes a fixed unit covariance matrix shared by all clusters \(k\), \[ \Sigma_k = \mbox{diag\_matrix}({\bf 1}), \] so that the log multivariate normal can be implemented directly up to a proportion by \[ \mbox{normal}\left( y_n | \mu_k, \mbox{diag\_matrix}({\bf 1}) \right) \propto \exp \left (- \frac{1}{2} \sum_{d=1}^D \left( \mu_{k,d} - y_{n,d} \right)^2 \right). \] The spatial perspective on \(K\)-means arises by noting that the inner term is just half the negative Euclidean distance from the cluster mean \(\mu_k\) to the data point \(y_n\).

Stan Implementation of Soft \(K\)-Means

Consider the following Stan program for implementing \(K\)-means clustering.

data {
  int<lower=0> N;  // number of data points
  int<lower=1> D;  // number of dimensions
  int<lower=1> K;  // number of clusters
  vector[D] y[N];  // observations
}
transformed data {
  real<upper=0> neg_log_K;
  neg_log_K = -log(K);
}
parameters {
  vector[D] mu[K]; // cluster means
}
transformed parameters {
  real<upper=0> soft_z[N, K]; // log unnormalized clusters
  for (n in 1:N)
    for (k in 1:K)
      soft_z[n, k] = neg_log_K
                     - 0.5 * dot_self(mu[k] - y[n]);
}
model {
  // prior
  for (k in 1:K)
    mu[k] ~ std_normal();

  // likelihood
  for (n in 1:N)
    target += log_sum_exp(soft_z[n]));
}

There is an independent standard normal prior on the centroid parameters; this prior could be swapped with other priors, or even a hierarchical model to fit an overall problem scale and location.

The only parameter is mu, where mu[k] is the centroid for cluster \(k\). The transformed parameters soft_z[n] contain the log of the unnormalized cluster assignment probabilities. The vector soft_z[n] can be converted back to a normalized simplex using the softmax function (see the functions reference manual), either externally or within the model’s generated quantities block.

Generalizing Soft \(K\)-Means

The multivariate normal distribution with unit covariance matrix produces a log probability density proportional to Euclidean distance (i.e., \(L_2\) distance). Other distributions relate to other geometries. For instance, replacing the normal distribution with the double exponential (Laplace) distribution produces a clustering model based on \(L_1\) distance (i.e., Manhattan or taxicab distance).

Within the multivariate normal version of \(K\)-means, replacing the unit covariance matrix with a shared covariance matrix amounts to working with distances defined in a space transformed by the inverse covariance matrix.

Although there is no global spatial analog, it is common to see soft \(K\)-means specified with a per-cluster covariance matrix. In this situation, a hierarchical prior may be used for the covariance matrices.