9.5 Latent Dirichlet allocation
Latent Dirichlet allocation (LDA) is a mixed-membership multinomial clustering model (Blei, Ng, and Jordan 2003) that generalizes naive Bayes. Using the topic and document terminology common in discussions of LDA, each document is modeled as having a mixture of topics, with each word drawn from a topic based on the mixing proportions.
The LDA Model
The basic model assumes each document is generated independently based on fixed hyperparameters. For document \(m\), the first step is to draw a topic distribution simplex \(\theta_m\) over the \(K\) topics, \[ \theta_m \sim \textsf{Dirichlet}(\alpha). \]
The prior hyperparameter \(\alpha\) is fixed to a \(K\)-vector of positive values. Each word in the document is generated independently conditional on the distribution \(\theta_m\). First, a topic \(z_{m,n} \in \{1,\dotsc,K\}\) is drawn for the word based on the document-specific topic-distribution, \[ z_{m,n} \sim \textsf{categorical}(\theta_m). \]
Finally, the word \(w_{m,n}\) is drawn according to the word distribution for topic \(z_{m,n}\), \[ w_{m,n} \sim \textsf{categorical}(\phi_{z[m,n]}). \] The distributions \(\phi_k\) over words for topic \(k\) are also given a Dirichlet prior, \[ \phi_k \sim \textsf{Dirichlet}(\beta) \]
where \(\beta\) is a fixed \(V\)-vector of positive values.
Summing out the discrete parameters
Although Stan does not (yet) support discrete sampling, it is possible to calculate the marginal distribution over the continuous parameters by summing out the discrete parameters as in other mixture models. The marginal posterior of the topic and word variables is \[\begin{align*} p(\theta,\phi \mid w,\alpha,\beta) &\propto p(\theta \mid \alpha) \, p(\phi \mid \beta) \, p(w \mid \theta,\phi) \\ &= \prod_{m=1}^M p(\theta_m \mid \alpha) \times \prod_{k=1}^K p(\phi_k \mid \beta) \times \prod_{m=1}^M \prod_{n=1}^{M[n]} p(w_{m,n} \mid \theta_m,\phi). \end{align*}\]
The inner word-probability term is defined by summing out the topic assignments, \[\begin{align*} p(w_{m,n} \mid \theta_m,\phi) &= \sum_{z=1}^K p(z,w_{m,n} \mid \theta_m,\phi) \\ &= \sum_{z=1}^K p(z \mid \theta_m) \, p(w_{m,n} \mid \phi_z). \end{align*}\]
Plugging the distributions in and converting to the log scale provides a formula that can be implemented directly in Stan, \[\begin{align*} \log\, &p(\theta,\phi \mid w,\alpha,\beta) \\ &= \sum_{m=1}^M \log \textsf{Dirichlet}(\theta_m \mid \alpha) + \sum_{k=1}^K \log \textsf{Dirichlet}(\phi_k \mid \beta) \\ &\qquad + \sum_{m=1}^M \sum_{n=1}^{N[m]} \log \left( \sum_{z=1}^K \textsf{categorical}(z \mid \theta_m) \times \textsf{categorical}(w_{m,n} \mid \phi_z) \right) \end{align*}\]
Implementation of LDA
Applying the marginal derived in the last section to the data structure described in this section leads to the following Stan program for LDA.
data {
int<lower=2> K; // num topics
int<lower=2> V; // num words
int<lower=1> M; // num docs
int<lower=1> N; // total word instances
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta[M]; // topic dist for doc m
simplex[V] phi[K]; // word dist for topic k
}
model {
for (m in 1:M)
theta[m] ~ dirichlet(alpha); // prior
for (k in 1:K)
phi[k] ~ dirichlet(beta); // prior
for (n in 1:N) {
real gamma[K];
for (k in 1:K)
gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
target += log_sum_exp(gamma); // likelihood;
}
}
As in the other mixture models, the log-sum-of-exponents function is used to stabilize the numerical arithmetic.