## 9.4 Naive Bayes Classification and Clustering

Naive Bayes is a kind of mixture model that can be used for
classification or for clustering (or a mix of both), depending on
which labels for items are observed.^{22}

Multinomial mixture models are referred to as “naive Bayes” because they are often applied to classification problems where the multinomial independence assumptions are clearly false.

Naive Bayes classification and clustering can be applied to any data with multinomial structure. A typical example of this is natural language text classification and clustering, which is used an example in what follows.

The observed data consists of a sequence of \(M\) documents made up of bags of words drawn from a vocabulary of \(V\) distinct words. A document \(m\) has \(N_m\) words, which are indexed as \(w_{m,1}, \ldots, w_{m,N[m]} \in 1{:}V\). Despite the ordered indexing of words in a document, this order is not part of the model, which is clearly defective for natural human language data. A number of topics (or categories) \(K\) is fixed.

The multinomial mixture model generates a single category \(z_m \in 1{:}K\) for each document \(m \in 1{:}M\) according to a categorical distribution, \[ z_m \sim \mathsf{Categorical}(\theta). \] The \(K\)-simplex parameter \(\theta\) represents the prevalence of each category in the data.

Next, the words in each document are generated conditionally independently of each other and the words in other documents based on the category of the document, with word \(n\) of document \(m\) being generated as \[ w_{m,n} \sim \mathsf{Categorical}(\phi_{z[m]}). \] The parameter \(\phi_{z[m]}\) is a \(V\)-simplex representing the probability of each word in the vocabulary in documents of category \(z_m\).

The parameters \(\theta\) and \(\phi\) are typically given symmetric Dirichlet priors. The prevalence \(\theta\) is sometimes fixed to produce equal probabilities for each category \(k \in 1:K\).

### Coding Ragged Arrays

The specification for naive Bayes in the previous sections have used a ragged array notation for the words \(w\). Because Stan does not support ragged arrays, the models are coded using an alternative strategy that provides an index for each word in a global list of words. The data is organized as follows, with the word arrays laid out in a column and each assigned to its document in a second column.

n | w[n] | doc[n] |
---|---|---|

1 | \(w_{1,1}\) | 1 |

2 | \(w_{1,2}\) | 1 |

\(\vdots\) | \(\vdots\) | \(\vdots\) |

\(N_1\) | \(w_{1,N[1]}\) | 1 |

\(N_1 + 1\) | \(w_{2,1}\) | 2 |

\(N_1 + 2\) | \(w_{2,2}\) | 2 |

\(\vdots\) | \(\vdots\) | \(\vdots\) |

\(N_1 + N_2\) | \(w_{2,N[2]}\) | 2 |

\(N_1 + N_2 + 1\) | \(w_{3,1}\) | 3 |

\(\vdots\) | \(\vdots\) | \(\vdots\) |

\(N = \sum_{m=1}^M N_m\) | \(w_{M,N[M]}\) | \(M\) |

The relevant variables for the program are `N`

, the total number
of words in all the documents, the word array `w`

, and the
document identity array `doc`

.

### Estimation with Category-Labeled Training Data

A naive Bayes model for estimating the simplex parameters given training data with documents of known categories can be coded in Stan as follows

```
data {
// training data
int<lower=1> K; // num topics
int<lower=1> V; // num words
int<lower=0> M; // num docs
int<lower=0> N; // total word instances
int<lower=1,upper=K> z[M]; // topic for doc m
int<lower=1,upper=V> w[N]; // word n
int<lower=1,upper=M> doc[N]; // doc ID for word n
// hyperparameters
vector<lower=0>[K] alpha; // topic prior
vector<lower=0>[V] beta; // word prior
}
parameters {
simplex[K] theta; // topic prevalence
simplex[V] phi[K]; // word dist for topic k
}
model {
theta ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
for (m in 1:M)
z[m] ~ categorical(theta);
for (n in 1:N)
w[n] ~ categorical(phi[z[doc[n]]]);
}
```

The topic identifiers \(z_m\) are declared as data and the latent category assignments are included as part of the likelihood function.

### Estimation without Category-Labeled Training Data

Naive Bayes models can be used in an unsupervised fashion to cluster
multinomial-structured data into a fixed number \(K\) of categories.
The data declaration includes the same variables as the model in the
previous section excluding the topic labels `z`

. Because
`z`

is discrete, it needs to be summed out of the model
calculation. This is done for naive Bayes as for other mixture
models. The parameters are the same up to the priors, but the
likelihood is now computed as the marginal document probability
\[
\begin{array}{l}
\log p(w_{m,1},\ldots,w_{m,N_m}|\theta,\phi)
\\[2pt]
\ \ \ = \
\log \sum_{k=1}^K
\left( \mathsf{Categorical}(k|\theta)
* \prod_{n=1}^{N_m} \mathsf{Categorical}(w_{m,n}|\phi_k)
\right)
\\[6pt]
\ \ \ = \
\log \sum_{k=1}^K \exp \left(
\log \mathsf{Categorical}(k|\theta)
+ \sum_{n=1}^{N_m} \log \mathsf{Categorical}(w_{m,n}|\phi_k)
\right).
\end{array}
\]

The last step shows how the `log_sum_exp`

function can be used
to stabilize the numerical calculation and return a result on the log
scale.

```
model {
real gamma[M, K];
theta ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
for (m in 1:M)
for (k in 1:K)
gamma[m, k] = categorical_lpmf(k | theta);
for (n in 1:N)
for (k in 1:K)
gamma[doc[n], k] = gamma[doc[n], k]
+ categorical_lpmf(w[n] | phi[k]);
for (m in 1:M)
target += log_sum_exp(gamma[m]);
}
```

The local variable `gamma[m, k]`

represents the value
\[
\gamma_{m,k} = \log \mathsf{Categorical}(k|\theta)
+ \sum_{n=1}^{N_m} \log \mathsf{Categorical}(w_{m,n}|\phi_k).
\]

Given \(\gamma\), the posterior probability that document \(m\) is assigned category \(k\) is \[ \mathrm{Pr}[z_m = k|w,\alpha,\beta] = \exp \left( \gamma_{m,k} - \log \sum_{k=1}^K \exp \left( \gamma_{m,k} \right) \right). \]

If the variable `gamma`

were declared and defined in the
transformed parameter block, its sampled values would be saved by
Stan. The normalized posterior probabilities could also be defined as
generated quantities.

### Full Bayesian Inference for Naive Bayes

Full Bayesian posterior predictive inference for the naive Bayes model can be implemented in Stan by combining the models for labeled and unlabeled data. The estimands include both the model parameters and the posterior distribution over categories for the unlabeled data. The model is essentially a missing data model assuming the unknown category labels are missing completely at random; see Gelman et al. (2013) and Gelman and Hill (2007) for more information on missing data imputation. The model is also an instance of semisupervised learning because the unlabeled data contributes to the parameter estimations.

To specify a Stan model for performing full Bayesian inference, the
model for labeled data is combined with the model for unlabeled data.
A second document collection is declared as data, but without the
category labels, leading to new variables `M2`

`N2`

,
`w2`

, and `doc2`

. The number of categories and number of
words, as well as the hyperparameters are shared and only declared
once. Similarly, there is only one set of parameters. Then the model
contains a single set of statements for the prior, a set of statements
for the labeled data, and a set of statements for the unlabeled data.

### Prediction without Model Updates

An alternative to full Bayesian inference involves estimating a model
using labeled data, then applying it to unlabeled data without
updating the parameter estimates based on the unlabeled data. This
behavior can be implemented by moving the definition of `gamma`

for the unlabeled documents to the generated quantities block.
Because the variables no longer contribute to the log probability,
they no longer jointly contribute to the estimation of the model
parameters.

For clustering, the non-identifiability problems for all mixture models present a problem, whereas there is no such problem for classification. Despite the difficulties with full Bayesian inference for clustering, researchers continue to use it, often in an exploratory data analysis setting rather than for predictive modeling.↩︎