This is an old version, view current version.

## 9.2 Soft K-Means

$$K$$-means clustering is a method of clustering data represented as $$D$$-dimensional vectors. Specifically, there will be $$N$$ items to be clustered, each represented as a vector $$y_n \in \mathbb{R}^D$$. In the “soft” version of $$K$$-means, the assignments to clusters will be probabilistic.

### 9.2.1 Geometric Hard K-Means Clustering

$$K$$-means clustering is typically described geometrically in terms of the following algorithm, which assumes the number of clusters $$K$$ and data vectors $$y$$ as input.

1. For each $$n$$ in $$\{1,\dotsc,N\}$$, randomly assign vector $$y_n$$ to a cluster in $$\{1,\dotsc,K\}$$;
2. Repeat
1. For each cluster $$k$$ in $$\{1,\dotsc,K\}$$, compute the cluster centroid $$\mu_k$$ by averaging the vectors assigned to that cluster;
2. For each $$n$$ in $$\{1,\dotsc,N\}$$, reassign $$y_n$$ to the cluster $$k$$ for which the (Euclidean) distance from $$y_n$$ to $$\mu_k$$ is smallest;
3. If no vectors changed cluster, return the cluster assignments.

This algorithm is guaranteed to terminate.

### 9.2.2 Soft K-Means Clustering

Soft $$K$$-means clustering treats the cluster assignments as probability distributions over the clusters. Because of the connection between Euclidean distance and multivariate normal models with a fixed covariance, soft $$K$$-means can be expressed (and coded in Stan) as a multivariate normal mixture model.

In the full generative model, each data point $$n$$ in $$\{1,\dotsc,N\}$$ is assigned a cluster $$z_n \in \{1,\dotsc,K\}$$ with symmetric uniform probability, $z_n \sim \textsf{categorical}(1/K),$ where $$1$$ is the unit vector of $$K$$ dimensions, so that $$1/K$$ is the symmetric $$K$$-simplex. Thus the model assumes that each data point is drawn from a hard decision about cluster membership. The softness arises only from the uncertainty about which cluster generated a data point.

The data points themselves are generated from a multivariate normal distribution whose parameters are determined by the cluster assignment $$z_n$$, $y_n \sim \textsf{normal}(\mu_{z[n]},\Sigma_{z[n]})$

The sample implementation in this section assumes a fixed unit covariance matrix shared by all clusters $$k$$, $\Sigma_k = \mathrm{diag\_matrix}({\bf 1}),$ so that the log multivariate normal can be implemented directly up to a proportion by $\mathrm{normal}\left( y_n | \mu_k, \mathrm{diag\_matrix}({\bf 1}) \right) \propto \exp \left (- \frac{1}{2} \sum_{d=1}^D \left( \mu_{k,d} - y_{n,d} \right)^2 \right).$ The spatial perspective on $$K$$-means arises by noting that the inner term is just half the negative Euclidean distance from the cluster mean $$\mu_k$$ to the data point $$y_n$$.

### Stan Implementation of Soft K-Means

Consider the following Stan program for implementing $$K$$-means clustering.

data {
int<lower=0> N;  // number of data points
int<lower=1> D;  // number of dimensions
int<lower=1> K;  // number of clusters
vector[D] y[N];  // observations
}
transformed data {
real<upper=0> neg_log_K;
neg_log_K = -log(K);
}
parameters {
vector[D] mu[K]; // cluster means
}
transformed parameters {
real<upper=0> soft_z[N, K]; // log unnormalized clusters
for (n in 1:N)
for (k in 1:K)
soft_z[n, k] = neg_log_K
- 0.5 * dot_self(mu[k] - y[n]);
}
model {
// prior
for (k in 1:K)
mu[k] ~ std_normal();

// likelihood
for (n in 1:N)
target += log_sum_exp(soft_z[n]);
}

There is an independent standard normal prior on the centroid parameters; this prior could be swapped with other priors, or even a hierarchical model to fit an overall problem scale and location.

The only parameter is mu, where mu[k] is the centroid for cluster $$k$$. The transformed parameters soft_z[n] contain the log of the unnormalized cluster assignment probabilities. The vector soft_z[n] can be converted back to a normalized simplex using the softmax function (see the functions reference manual), either externally or within the model’s generated quantities block.

### Generalizing Soft K-Means

The multivariate normal distribution with unit covariance matrix produces a log probability density proportional to Euclidean distance (i.e., $$L_2$$ distance). Other distributions relate to other geometries. For instance, replacing the normal distribution with the double exponential (Laplace) distribution produces a clustering model based on $$L_1$$ distance (i.e., Manhattan or taxicab distance).

Within the multivariate normal version of $$K$$-means, replacing the unit covariance matrix with a shared covariance matrix amounts to working with distances defined in a space transformed by the inverse covariance matrix.

Although there is no global spatial analog, it is common to see soft $$K$$-means specified with a per-cluster covariance matrix. In this situation, a hierarchical prior may be used for the covariance matrices.