## 2.6 Hidden Markov Models

A hidden Markov model (HMM) generates a sequence of \(T\) output variables \(y_t\) conditioned on a parallel sequence of latent categorical state variables \(z_t \in \{1,\ldots, K\}\). These “hidden” state variables are assumed to form a Markov chain so that \(z_t\) is conditionally independent of other variables given \(z_{t-1}\). This Markov chain is parameterized by a transition matrix \(\theta\) where \(\theta_k\) is a \(K\)-simplex for \(k \in \{ 1, \dotsc, K \}\). The probability of transitioning to state \(z_t\) from state \(z_{t-1}\) is \[ z_t \sim \textsf{categorical}(\theta_{z[t-1]}). \] The output \(y_t\) at time \(t\) is generated conditionally independently based on the latent state \(z_t\).

This section describes HMMs with a simple categorical model for outputs \(y_t \in \{ 1, \dotsc, V \}\). The categorical distribution for latent state \(k\) is parameterized by a \(V\)-simplex \(\phi_k\). The observed output \(y_t\) at time \(t\) is generated based on the hidden state indicator \(z_t\) at time \(t\), \[ y_t \sim \textsf{categorical}(\phi_{z[t]}). \] In short, HMMs form a discrete mixture model where the mixture component indicators form a latent Markov chain.

### Supervised Parameter Estimation

In the situation where the hidden states are known, the following naive model can be used to fit the parameters \(\theta\) and \(\phi\).

```
data {
int<lower=1> K; // num categories
int<lower=1> V; // num words
int<lower=0> T; // num instances
int<lower=1,upper=V> w[T]; // words
int<lower=1,upper=K> z[T]; // categories
vector<lower=0>[K] alpha; // transit prior
vector<lower=0>[V] beta; // emit prior
}
parameters {
simplex[K] theta[K]; // transit probs
simplex[V] phi[K]; // emit probs
}
model {
for (k in 1:K)
theta[k] ~ dirichlet(alpha);
for (k in 1:K)
phi[k] ~ dirichlet(beta);
for (t in 1:T)
w[t] ~ categorical(phi[z[t]]);
for (t in 2:T)
z[t] ~ categorical(theta[z[t - 1]]);
}
```

Explicit Dirichlet priors have been provided for \(\theta_k\) and \(\phi_k\); dropping these two statements would implicitly take the prior to be uniform over all valid simplexes.

### Start-State and End-State Probabilities

Although workable, the above description of HMMs is incomplete because the start state \(z_1\) is not modeled (the index runs from 2 to \(T\)). If the data are conceived as a subsequence of a long-running process, the probability of \(z_1\) should be set to the stationary state probabilities in the Markov chain. In this case, there is no distinct end to the data, so there is no need to model the probability that the sequence ends at \(z_T\).

An alternative conception of HMMs is as models of finite-length sequences. For example, human language sentences have distinct starting distributions (usually a capital letter) and ending distributions (usually some kind of punctuation). The simplest way to model the sequence boundaries is to add a new latent state \(K+1\), generate the first state from a categorical distribution with parameter vector \(\theta_{K+1}\), and restrict the transitions so that a transition to state \(K+1\) is forced to occur at the end of the sentence and is prohibited elsewhere.

### Calculating Sufficient Statistics

The naive HMM estimation model presented above can be sped up dramatically by replacing the loops over categorical distributions with a single multinomial distribution.

The data are declared as before. The transformed data block computes the sufficient statistics for estimating the transition and emission matrices.

```
transformed data {
int<lower=0> trans[K, K];
int<lower=0> emit[K, V];
for (k1 in 1:K)
for (k2 in 1:K)
trans[k1, k2] = 0;
for (t in 2:T)
trans[z[t - 1], z[t]] += 1;
for (k in 1:K)
for (v in 1:V)
emit[k,v] = 0;
for (t in 1:T)
emit[z[t], w[t]] += 1;
}
```

The likelihood component of the model based on looping over the input is replaced with multinomials as follows.

```
model {
...
for (k in 1:K)
trans[k] ~ multinomial(theta[k]);
for (k in 1:K)
emit[k] ~ multinomial(phi[k]);
}
```

In a continuous HMM with normal emission probabilities could be sped up in the same way by computing sufficient statistics.

### Analytic Posterior

With the Dirichlet-multinomial HMM, the posterior can be computed analytically because the Dirichlet is the conjugate prior to the multinomial. The following example illustrates how a Stan model can define the posterior analytically. This is possible in the Stan language because the model only needs to define the conditional probability of the parameters given the data up to a proportion, which can be done by defining the (unnormalized) joint probability or the (unnormalized) conditional posterior, or anything in between.

The model has the same data and parameters as the previous models, but now computes the posterior Dirichlet parameters in the transformed data block.

```
transformed data {
vector<lower=0>[K] alpha_post[K];
vector<lower=0>[V] beta_post[K];
for (k in 1:K)
alpha_post[k] = alpha;
for (t in 2:T)
alpha_post[z[t-1], z[t]] += 1;
for (k in 1:K)
beta_post[k] = beta;
for (t in 1:T)
beta_post[z[t], w[t]] += 1;
}
```

The posterior can now be written analytically as follows.

```
model {
for (k in 1:K)
theta[k] ~ dirichlet(alpha_post[k]);
for (k in 1:K)
phi[k] ~ dirichlet(beta_post[k]);
}
```

### Semisupervised Estimation

HMMs can be estimated in a fully unsupervised fashion without any data for which latent states are known. The resulting posteriors are typically extremely multimodal. An intermediate solution is to use semisupervised estimation, which is based on a combination of supervised and unsupervised data. Implementing this estimation strategy in Stan requires calculating the probability of an output sequence with an unknown state sequence. This is a marginalization problem, and for HMMs, it is computed with the so-called forward algorithm.

In Stan, the forward algorithm is coded as follows. First, two additional data variable are declared for the unsupervised data.

```
data {
...
int<lower=1> T_unsup; // num unsupervised items
int<lower=1,upper=V> u[T_unsup]; // unsup words
...
```

The model for the supervised data does not change; the unsupervised data are handled with the following Stan implementation of the forward algorithm.

```
model {
...
real acc[K];
real gamma[T_unsup, K];
for (k in 1:K)
gamma[1, k] = log(phi[k, u[1]]);
for (t in 2:T_unsup) {
for (k in 1:K) {
for (j in 1:K)
acc[j] = gamma[t-1, j] + log(theta[j, k]) + log(phi[k, u[t]]);
gamma[t, k] = log_sum_exp(acc);
}
}
target += log_sum_exp(gamma[T_unsup]);
}
```

The forward values `gamma[t, k]`

are defined to be the log
marginal probability of the inputs `u[1],...,u[t]`

up to time
`t`

and the latent state being equal to `k`

at time
`t`

; the previous latent states are marginalized out. The first
row of `gamma`

is initialized by setting `gamma[1, k]`

equal
to the log probability of latent state `k`

generating the first
output `u[1]`

; as before, the probability of the first latent
state is not itself modeled. For each subsequent time `t`

and
output `j`

, the value `acc[j]`

is set to the probability of
the latent state at time `t-1`

being `j`

, plus the log
transition probability from state `j`

at time `t-1`

to state
`k`

at time `t`

, plus the log probability of the output
`u[t]`

being generated by state `k`

. The
`log_sum_exp`

operation just multiplies the probabilities for
each prior state `j`

on the log scale in an arithmetically stable
way.

The brackets provide the scope for the local variables `acc`

and
`gamma`

; these could have been declared earlier, but it is
clearer to keep their declaration near their use.

### Predictive Inference

Given the transition and emission parameters, \(\theta_{k, k'}\) and \(\phi_{k,v}\) and an observation sequence \(u_1, \dotsc, u_T \in \{ 1, \dotsc, V \}\), the Viterbi (dynamic programming) algorithm computes the state sequence which is most likely to have generated the observed output \(u\).

The Viterbi algorithm can be coded in Stan in the generated quantities
block as follows. The predictions here is the most likely state
sequence `y_star[1], ..., y_star[T_unsup]`

underlying the
array of observations `u[1], ..., u[T_unsup]`

. Because this
sequence is determined from the transition probabilities
`theta`

and emission probabilities `phi`

, it may be
different from sample to sample in the posterior.

```
generated quantities {
int<lower=1,upper=K> y_star[T_unsup];
real log_p_y_star;
{
int back_ptr[T_unsup, K];
real best_logp[T_unsup, K];
real best_total_logp;
for (k in 1:K)
best_logp[1, k] = log(phi[k, u[1]]);
for (t in 2:T_unsup) {
for (k in 1:K) {
best_logp[t, k] = negative_infinity();
for (j in 1:K) {
real logp;
logp = best_logp[t-1, j]
+ log(theta[j, k]) + log(phi[k, u[t]]);
if (logp > best_logp[t, k]) {
back_ptr[t, k] = j;
best_logp[t, k] = logp;
}
}
}
}
log_p_y_star = max(best_logp[T_unsup]);
for (k in 1:K)
if (best_logp[T_unsup, k] == log_p_y_star)
y_star[T_unsup] = k;
for (t in 1:(T_unsup - 1))
y_star[T_unsup - t] = back_ptr[T_unsup - t + 1,
y_star[T_unsup - t + 1]];
}
}
```

The bracketed block is used to make the three variables
`back_ptr`

, `best_logp`

, and `best_total_logp`

local so they will not be output. The variable `y_star`

will
hold the label sequence with the highest probability given the input
sequence `u`

. Unlike the forward algorithm, where the
intermediate quantities were total probability, here they consist of
the maximum probability `best_logp[t, k]`

for the sequence up to
time `t`

with final output category `k`

for time `t`

,
along with a backpointer to the source of the link. Following the
backpointers from the best final log probability for the final time
`t`

yields the optimal state sequence.

This inference can be run for the same unsupervised outputs `u`

as are used to fit the semisupervised model. The above code can be
found in the same model file as the unsupervised fit. This is the
Bayesian approach to inference, where the data being reasoned about is
used in a semisupervised way to train the model. It is not
“cheating” because the underlying states for `u`

are never
observed — they are just estimated along with all of the other
parameters.

If the outputs `u`

are not used for semisupervised estimation but
simply as the basis for prediction, the result is equivalent to what
is represented in the BUGS modeling language via the cut operation.
That is, the model is fit independently of `u`

, then those
parameters used to find the most likely state to have generated
`u`

.