9.5 Latent Dirichlet allocation

This is an old version, view current version.

Latent Dirichlet allocation (LDA) is a mixed-membership multinomial clustering model (Blei, Ng, and Jordan 2003) that generalizes naive Bayes. Using the topic and document terminology common in discussions of LDA, each document is modeled as having a mixture of topics, with each word drawn from a topic based on the mixing proportions.

The LDA Model

The basic model assumes each document is generated independently based on fixed hyperparameters. For document \(m\), the first step is to draw a topic distribution simplex \(\theta_m\) over the \(K\) topics, \[ \theta_m \sim \textsf{Dirichlet}(\alpha). \]

The prior hyperparameter \(\alpha\) is fixed to a \(K\)-vector of positive values. Each word in the document is generated independently conditional on the distribution \(\theta_m\). First, a topic \(z_{m,n} \in \{1,\dotsc,K\}\) is drawn for the word based on the document-specific topic-distribution, \[ z_{m,n} \sim \textsf{categorical}(\theta_m). \]

Finally, the word \(w_{m,n}\) is drawn according to the word distribution for topic \(z_{m,n}\), \[ w_{m,n} \sim \textsf{categorical}(\phi_{z[m,n]}). \] The distributions \(\phi_k\) over words for topic \(k\) are also given a Dirichlet prior, \[ \phi_k \sim \textsf{Dirichlet}(\beta) \]

where \(\beta\) is a fixed \(V\)-vector of positive values.

Summing out the discrete parameters

Although Stan does not (yet) support discrete sampling, it is possible to calculate the marginal distribution over the continuous parameters by summing out the discrete parameters as in other mixture models. The marginal posterior of the topic and word variables is \[\begin{align*} p(\theta,\phi \mid w,\alpha,\beta) &\propto p(\theta \mid \alpha) \, p(\phi \mid \beta) \, p(w \mid \theta,\phi) \\ &= \prod_{m=1}^M p(\theta_m \mid \alpha) \times \prod_{k=1}^K p(\phi_k \mid \beta) \times \prod_{m=1}^M \prod_{n=1}^{M[n]} p(w_{m,n} \mid \theta_m,\phi). \end{align*}\]

The inner word-probability term is defined by summing out the topic assignments, \[\begin{align*} p(w_{m,n} \mid \theta_m,\phi) &= \sum_{z=1}^K p(z,w_{m,n} \mid \theta_m,\phi) \\ &= \sum_{z=1}^K p(z \mid \theta_m) \, p(w_{m,n} \mid \phi_z). \end{align*}\]

Plugging the distributions in and converting to the log scale provides a formula that can be implemented directly in Stan, \[\begin{align*} \log\, &p(\theta,\phi \mid w,\alpha,\beta) \\ &= \sum_{m=1}^M \log \textsf{Dirichlet}(\theta_m \mid \alpha) + \sum_{k=1}^K \log \textsf{Dirichlet}(\phi_k \mid \beta) \\ &\qquad + \sum_{m=1}^M \sum_{n=1}^{N[m]} \log \left( \sum_{z=1}^K \textsf{categorical}(z \mid \theta_m) \times \textsf{categorical}(w_{m,n} \mid \phi_z) \right) \end{align*}\]

Implementation of LDA

Applying the marginal derived in the last section to the data structure described in this section leads to the following Stan program for LDA.

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;               // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta[M];          // topic dist for doc m
  simplex[V] phi[K];            // word dist for topic k
}
model {
  for (m in 1:M)
    theta[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] = log(theta[doc[n], k]) + log(phi[k, w[n]]);
    target += log_sum_exp(gamma);  // likelihood;
  }
}

As in the other mixture models, the log-sum-of-exponents function is used to stabilize the numerical arithmetic.

Correlated topic model

To account for correlations in the distribution of topics for documents, Blei and Lafferty (2007) introduced a variant of LDA in which the Dirichlet prior on the per-document topic distribution is replaced with a multivariate logistic normal distribution.

The authors treat the prior as a fixed hyperparameter. They use an \(L_1\)-regularized estimate of covariance, which is equivalent to the maximum a posteriori estimate given a double-exponential prior. Stan does not (yet) support maximum a posteriori estimation, so the mean and covariance of the multivariate logistic normal must be specified as data.

Fixed hyperparameter correlated topic model

The Stan model in the previous section can be modified to implement the correlated topic model by replacing the Dirichlet topic prior alpha in the data declaration with the mean and covariance of the multivariate logistic normal prior.

data {
  ... data as before without alpha ...
  vector[K] mu;          // topic mean
  cov_matrix[K] Sigma;   // topic covariance
}

Rather than drawing the simplex parameter theta from a Dirichlet, a parameter eta is drawn from a multivariate normal distribution and then transformed using softmax into a simplex.

parameters {
  simplex[V] phi[K];     // word dist for topic k
  vector[K] eta[M];      // topic dist for doc m
}
transformed parameters {
  simplex[K] theta[M];
  for (m in 1:M)
    theta[m] = softmax(eta[m]);
}
model {
  for (m in 1:M)
    eta[m] ~ multi_normal(mu, Sigma);
  ... model as before w/o prior for theta ...
}

Full Bayes correlated topic model

By adding a prior for the mean and covariance, Stan supports full Bayesian inference for the correlated topic model. This requires moving the declarations of topic mean mu and covariance Sigma from the data block to the parameters block and providing them with priors in the model. A relatively efficient and interpretable prior for the covariance matrix Sigma may be encoded as follows.

... data block as before, but without alpha ...
parameters {
  vector[K] mu;              // topic mean
  corr_matrix[K] Omega;      // correlation matrix
  vector<lower=0>[K] sigma;  // scales
  vector[K] eta[M];          // logit topic dist for doc m
  simplex[V] phi[K];         // word dist for topic k
}
transformed parameters {
  ... eta as above ...
  cov_matrix[K] Sigma;       // covariance matrix
  for (m in 1:K)
    Sigma[m, m] = sigma[m] * sigma[m] * Omega[m, m];
  for (m in 1:(K-1)) {
    for (n in (m+1):K) {
      Sigma[m, n] = sigma[m] * sigma[n] * Omega[m, n];
      Sigma[n, m] = Sigma[m, n];
    }
  }
}
model {
  mu ~ normal(0, 5);      // vectorized, diffuse
  Omega ~ lkj_corr(2.0);  // regularize to unit correlation
  sigma ~ cauchy(0, 5);   // half-Cauchy due to constraint
  ... words sampled as above ...
}

The \(\textsf{LKJCorr}\) distribution with shape \(\alpha > 0\) has support on correlation matrices (i.e., symmetric positive definite with unit diagonal). Its density is defined by \[ \mathsf{LkjCorr}(\Omega|\alpha) \propto \mathrm{det}(\Omega)^{\alpha - 1} \] With a scale of \(\alpha = 2\), the weakly informative prior favors a unit correlation matrix. Thus the compound effect of this prior on the covariance matrix \(\Sigma\) for the multivariate logistic normal is a slight concentration around diagonal covariance matrices with scales determined by the prior on sigma.

References

Blei, David M., and John D. Lafferty. 2007. “A Correlated Topic Model of Science.” The Annals of Applied Statistics 1 (1): 17–37.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.