9.2 Soft K-Means
K-means clustering is a method of clustering data represented as D-dimensional vectors. Specifically, there will be N items to be clustered, each represented as a vector yn∈RD. In the “soft” version of K-means, the assignments to clusters will be probabilistic.
Geometric Hard K-Means Clustering
K-means clustering is typically described geometrically in terms of the following algorithm, which assumes the number of clusters K and data vectors y as input.
- For each n in 1:N, randomly assign vector yn to a cluster in 1:K;
- Repeat
- For each cluster k in 1:K, compute the cluster centroid μk by averaging the vectors assigned to that cluster;
- For each n in 1:N, reassign yn to the cluster k for which the (Euclidean) distance from yn to μk is smallest;
- If no vectors changed cluster, return the cluster assignments.
This algorithm is guaranteed to terminate.
Soft K-Means Clustering
Soft K-means clustering treats the cluster assignments as probability distributions over the clusters. Because of the connection between Euclidean distance and multivariate normal models with a fixed covariance, soft K-means can be expressed (and coded in Stan) as a multivariate normal mixture model.
In the full generative model, each data point n in 1:N is assigned a cluster zn∈1:K with symmetric uniform probability,
zn∼Categorical(1/K), where 1 is the unit vector of K dimensions, so that 1/K is the symmetric K-simplex. Thus the model assumes that each data point is drawn from a hard decision about cluster membership. The softness arises only from the uncertainty about which cluster generated a data point.
The data points themselves are generated from a multivariate normal distribution whose parameters are determined by the cluster assignment zn, yn∼normal(μz[n],Σz[n])
The sample implementation in this section assumes a fixed unit covariance matrix shared by all clusters k, Σk=diag\_matrix(1), so that the log multivariate normal can be implemented directly up to a proportion by normal(yn|μk,diag\_matrix(1))∝exp(−12D∑d=1(μk,d−yn,d)2). The spatial perspective on K-means arises by noting that the inner term is just half the negative Euclidean distance from the cluster mean μk to the data point yn.
Stan Implementation of Soft K-Means
Consider the following Stan program for implementing K-means clustering.
data {
int<lower=0> N; // number of data points
int<lower=1> D; // number of dimensions
int<lower=1> K; // number of clusters
vector[D] y[N]; // observations
}
transformed data {
real<upper=0> neg_log_K;
neg_log_K = -log(K);
}
parameters {
vector[D] mu[K]; // cluster means
}
transformed parameters {
real<upper=0> soft_z[N, K]; // log unnormalized clusters
for (n in 1:N)
for (k in 1:K)
soft_z[n, k] = neg_log_K
- 0.5 * dot_self(mu[k] - y[n]);
}
model {
// prior
for (k in 1:K)
mu[k] ~ std_normal();
// likelihood
for (n in 1:N)
target += log_sum_exp(soft_z[n]));
}
There is an independent standard normal prior on the centroid parameters; this prior could be swapped with other priors, or even a hierarchical model to fit an overall problem scale and location.
The only parameter is mu
, where mu[k]
is the centroid for cluster
k. The transformed parameters soft_z[n]
contain the log of the
unnormalized cluster assignment probabilities. The vector soft_z[n]
can be converted back to a normalized simplex using the softmax
function (see the functions reference manual), either externally or
within the model’s generated quantities block.
Generalizing Soft K-Means
The multivariate normal distribution with unit covariance matrix produces a log probability density proportional to Euclidean distance (i.e., L2 distance). Other distributions relate to other geometries. For instance, replacing the normal distribution with the double exponential (Laplace) distribution produces a clustering model based on L1 distance (i.e., Manhattan or taxicab distance).
Within the multivariate normal version of K-means, replacing the unit covariance matrix with a shared covariance matrix amounts to working with distances defined in a space transformed by the inverse covariance matrix.
Although there is no global spatial analog, it is common to see soft K-means specified with a per-cluster covariance matrix. In this situation, a hierarchical prior may be used for the covariance matrices.