9.4 Naive Bayes Classification and Clustering

This is an old version, view current version.

9.4 Naive Bayes Classification and Clustering

Naive Bayes is a kind of mixture model that can be used for classification or for clustering (or a mix of both), depending on which labels for items are observed.²²

Multinomial mixture models are referred to as “naive Bayes” because they are often applied to classification problems where the multinomial independence assumptions are clearly false.

Naive Bayes classification and clustering can be applied to any data with multinomial structure. A typical example of this is natural language text classification and clustering, which is used an example in what follows.

The observed data consists of a sequence of $M$ documents made up of bags of words drawn from a vocabulary of $V$ distinct words. A document $m$ has $N_m$ words, which are indexed as $w_{m,1}, \dotsc, w_{m,N[m]} \in \{1,\dotsc,V\}$ . Despite the ordered indexing of words in a document, this order is not part of the model, which is clearly defective for natural human language data. A number of topics (or categories) $K$ is fixed.

The multinomial mixture model generates a single category $z_m \in \{1,\dotsc,K\}$ for each document $m \in \{1,\dotsc,M\}$ according to a categorical distribution, $z_m \sim \textsf{categorical}(\theta).$ The $K$ -simplex parameter $\theta$ represents the prevalence of each category in the data.

Next, the words in each document are generated conditionally independently of each other and the words in other documents based on the category of the document, with word $n$ of document $m$ being generated as $w_{m,n} \sim \textsf{categorical}(\phi_{z[m]}).$ The parameter $\phi_{z[m]}$ is a $V$ -simplex representing the probability of each word in the vocabulary in documents of category $z_m$ .

The parameters $\theta$ and $\phi$ are typically given symmetric Dirichlet priors. The prevalence $\theta$ is sometimes fixed to produce equal probabilities for each category $k \in \{1,\dotsc,K\}$ .

Coding Ragged Arrays

The specification for naive Bayes in the previous sections have used a ragged array notation for the words $w$ . Because Stan does not support ragged arrays, the models are coded using an alternative strategy that provides an index for each word in a global list of words. The data is organized as follows, with the word arrays laid out in a column and each assigned to its document in a second column.

n	w[n]	doc[n]
1	$w_{1,1}$	1
2	$w_{1,2}$	1
$\vdots$	$\vdots$	$\vdots$
$N_1$	$w_{1,N[1]}$	1
$N_1 + 1$	$w_{2,1}$	2
$N_1 + 2$	$w_{2,2}$	2
$\vdots$	$\vdots$	$\vdots$
$N_1 + N_2$	$w_{2,N[2]}$	2
$N_1 + N_2 + 1$	$w_{3,1}$	3
$\vdots$	$\vdots$	$\vdots$
$N = \sum_{m=1}^M N_m$	$w_{M,N[M]}$	$M$

The relevant variables for the program are N, the total number of words in all the documents, the word array w, and the document identity array doc.

Estimation with Category-Labeled Training Data

A naive Bayes model for estimating the simplex parameters given training data with documents of known categories can be coded in Stan as follows

data {
  // training data
  int<lower=1> K;               // num topics
  int<lower=1> V;               // num words
  int<lower=0> M;               // num docs
  int<lower=0> N;               // total word instances
  int<lower=1,upper=K> z[M];    // topic for doc m
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc ID for word n
  // hyperparameters
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta;             // topic prevalence
  simplex[V] phi[K];            // word dist for topic k
}
model {
  theta ~ dirichlet(alpha);
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);
  for (m in 1:M)
    z[m] ~ categorical(theta);
  for (n in 1:N)
    w[n] ~ categorical(phi[z[doc[n]]]);
}

The topic identifiers $z_m$ are declared as data and the latent category assignments are included as part of the likelihood function.

Estimation without Category-Labeled Training Data

Naive Bayes models can be used in an unsupervised fashion to cluster multinomial-structured data into a fixed number $K$ of categories. The data declaration includes the same variables as the model in the previous section excluding the topic labels z. Because z is discrete, it needs to be summed out of the model calculation. This is done for naive Bayes as for other mixture models. The parameters are the same up to the priors, but the likelihood is now computed as the marginal document probability $\begin{align*} \log\, &p(w_{m,1},\dotsc,w_{m,N_m} \mid \theta,\phi) \\ &= \log \sum_{k=1}^K \left( \textsf{categorical}(k \mid \theta) \times \prod_{n=1}^{N_m} \textsf{categorical}(w_{m,n} \mid \phi_k) \right) \\ &= \log \sum_{k=1}^K \exp \left( \log \textsf{categorical}(k \mid \theta) + \sum_{n=1}^{N_m} \log \textsf{categorical}(w_{m,n} \mid \phi_k) \right). \end{align*}$

The last step shows how the log_sum_exp function can be used to stabilize the numerical calculation and return a result on the log scale.

model {
  real gamma[M, K];
  theta ~ dirichlet(alpha);
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);
  for (m in 1:M)
    for (k in 1:K)
      gamma[m, k] = categorical_lpmf(k \mid theta);
  for (n in 1:N)
    for (k in 1:K)
      gamma[doc[n], k] = gamma[doc[n], k]
                         + categorical_lpmf(w[n] \mid phi[k]);
  for (m in 1:M)
    target += log_sum_exp(gamma[m]);
}

The local variable gamma[m, k] represents the value $\gamma_{m,k} = \log \textsf{categorical}(k \mid \theta) + \sum_{n=1}^{N_m} \log \textsf{categorical}(w_{m,n} \mid \phi_k).$

Given $\gamma$ , the posterior probability that document $m$ is assigned category $k$ is $\mathrm{Pr}[z_m = k|w,\alpha,\beta] = \exp \left( \gamma_{m,k} - \log \sum_{k=1}^K \exp \left( \gamma_{m,k} \right) \right).$

If the variable gamma were declared and defined in the transformed parameter block, its sampled values would be saved by Stan. The normalized posterior probabilities could also be defined as generated quantities.

Full Bayesian Inference for Naive Bayes

Full Bayesian posterior predictive inference for the naive Bayes model can be implemented in Stan by combining the models for labeled and unlabeled data. The estimands include both the model parameters and the posterior distribution over categories for the unlabeled data. The model is essentially a missing data model assuming the unknown category labels are missing completely at random; see Gelman et al. (2013) and Gelman and Hill (2007) for more information on missing data imputation. The model is also an instance of semisupervised learning because the unlabeled data contributes to the parameter estimations.

To specify a Stan model for performing full Bayesian inference, the model for labeled data is combined with the model for unlabeled data. A second document collection is declared as data, but without the category labels, leading to new variables M2 N2, w2, and doc2. The number of categories and number of words, as well as the hyperparameters are shared and only declared once. Similarly, there is only one set of parameters. Then the model contains a single set of statements for the prior, a set of statements for the labeled data, and a set of statements for the unlabeled data.

Prediction without Model Updates

An alternative to full Bayesian inference involves estimating a model using labeled data, then applying it to unlabeled data without updating the parameter estimates based on the unlabeled data. This behavior can be implemented by moving the definition of gamma for the unlabeled documents to the generated quantities block. Because the variables no longer contribute to the log probability, they no longer jointly contribute to the estimation of the model parameters.

References

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

For clustering, the non-identifiability problems for all mixture models present a problem, whereas there is no such problem for classification. Despite the difficulties with full Bayesian inference for clustering, researchers continue to use it, often in an exploratory data analysis setting rather than for predictive modeling.↩︎