Introduction

This case study shows how we can apply Bayesian inference to Hidden Markov Models (HMMs) using Stan to extract useful information from basketball player tracking data. Specifically we show how to tag drive events and how to determine defensive assignment. Before diving into basketball data we show how to fit an HMM in Stan using a simple example. This should help build some intuition for those who are unfamiliar with HMMs and will also show how to specify an HMM using Stan.

The code used to build this document is available in the following GitHub repository: https://github.com/imadmali/bball-hmm.

Simple HMM Example

HMMs enable you to model a series of observed values. The generative model produces an observation for each latent state, so you also have a series of states that corresponds to the series of observed values. The state series exhibits the Markov property so the value of the state at time \(t\) only depends on the value of the state at time \(t-1\). Often the states are hidden so the goal of inference using an HMM is to,

  1. Estimate the parameters that allow you to transition from one state to the next and (given the state) the parameters involved in generating the observation.
  2. Predict the most likely state sequence based on the observed data and the parameters estimated in (1).

The plot below outlines an HMM that we simulated. The model involves one sequence of observed outcomes generated from the normal distribution and two states. At each time step we are in one of the two states. Each of the states corresponds to a location parameter from the normal distribution. This means that the observed value is generated depending on which one of two location parameters is selected, which in turn depends on which state you are in. In more complicated data you could have multiple observations and many more states at each time step.

You can see how state 1 corresponds to smaller values of the outcome while state 2 corresponds to relatively higher values of the outcome. In most real-word situations you do not know the state value at each time step because it is hidden. So in order to fit the model and infer the state sequence you first need to make an assumption as to how many possible states there might be at each time step.

hmm_data <- readRDS("../data/hmm_example.RDS")
z <- hmm_data$z
y <- hmm_data$y
par(mfrow=c(2,1))
plot(hmm_data$z, type="s",
     main = "Hidden States",
     ylab = "State Value",
     xlab = "Time",
     ylim = c(0.5,2.5), yaxt = "n")
axis(2, 1:2, 1:2)
plot(hmm_data$y, type = "l",
     main = "Observed Output",
     ylab = "Observation Value",
     xlab = "Time")
y_plt <- hmm_data$y
y_plt[hmm_data$z==1] <- NA
lines(y_plt, lwd = 3)
legend("bottomright", c("State 1","State 2"), lty = c(1,1), lwd = c(1,3), cex = 0.8)

Suppose we don’t know how many states there are at each time step \(t\), but we assume that there are two states \(z_t \in [1,2]\). The full generative model can be defined as,

\[ \begin{align*} &y_t \sim \mathcal{N}(\mu_{z_t}, 1) \\ &z_t \sim \mathcal{Categorical}(\theta_{z_{[t-1]}}) \\ &\mbox{priors on } \mu_k \mbox{ and } \boldsymbol{\theta}\\ \end{align*} \]

This says that the distribution of each observation \(y_t\) is normal with location parameter \(\mu_{z_t}\) and constant scale. There are two values that \(z_t\) can take at each time step which means that there are two values of \(\mu\). So we have \(\mu_{z_t} \in [\mu_1,\mu_2]\). The model also says that the state variable follows the categorical distribution which is parameterized by \(\theta_{z_[t-1]}\). Since \(z_t\) can take two values, \(\theta_{z_[t-1]}\) also takes two values. Given the properties of the categorical distribution \(\theta_{z_[t-1]}\) sums to 1. This means that \(\boldsymbol{\theta}\) is a \(2 \times 2\) stochastic matrix (also known as a transition matrix).

A graphical representation of the model is provided below.

The parameters that we want to estimate in this model are the transition probabilities (\(\boldsymbol{\theta}\)) and the parameters associated with the emission probabilities (\(\mu_1\),\(\mu_2\)).

We can interpret an element in the stochastic matrix \(\boldsymbol{\theta}\) as the probability of going from the row state to the column state from time \(t-1\) to time \(t\). So the diagonal elements give the probability of staying in the same state, and the off-diagonal elements give the probability of transitioning from one state to the next. Continuing with our example we have,

Since the data in this example are normally distributed the emission probabilities come from the normal distribution. The emission probabilities depend on the location and scale parameter associated with each state (i.e. the emission parameters). In our model we assume the scale parameter is known and only have to estimate the location parameters \(\mu_1\) and \(\mu_2\) for states 1 and 2.

In order to estimate the parameters we need to define the posterior distribution which requires us to specify the likelihood of the data and the priors. The likelihood is defined by the probability of observing that particular sequence of outcome variables. Using the forward algorithm we can efficiently calculate the likelihood of the data. The forward algorithm is computing the following marginalization efficiently, \[ p(y | \theta, \mu) = \sum_{z} p(y,z | \theta, \mu) \]

The priors are defined on the stochastic matrix and on the emission parameters. Since each row of the stochastic matrix sums to 1, a natural prior choice is the Dirichlet distribution on each row of the matrix.

We can then use Stan’s variant of the Hamiltonian Monte Carlo algorithm to estimate the stochastic matrix and emission parameters. Once we estimate the parameters we can determine the most probable state sequence that generated the sequence of observations. This can be computed with the Viterbi algorithm.

Specifying the Model

Below we represent the HMM in Stan (also available in models/hmm_example.stan). The model {} block specifies the priors and the forward algorithm to determine the most likely state at each point in time, and the generated quantities {} block specifies the Viterbi algorithm which enables us to determine the most likely state sequence. This model was adapted from the Stan User’s Guide.

Notice that we have chosen to define \(\mu\) as positive_ordered[K] mu instead of real mu[K]. This constraint is applied in order to enforce a strict separation between the two \(\mu\) values associated with the two states. While in theory weak priors that order the parameters are sufficient to identify the model, in practice Stan needs better parameterization so we enforce an ordering. If we don’t do this the sampling algorithm may struggle to find convergence among the parameter chains (see Betancourt 2017 for more detail). An example of omitting the constraint specification is provided in models/hmm_example_bad.stan.

data {
  int<lower=0> N;
  int<lower=0> K;
  real y[N];
}

parameters {
  simplex[K] theta[K];
  // real mu[K];
  positive_ordered[K] mu;
}

model {
  // priors
  target+= normal_lpdf(mu[1] | 3, 1);
  target+= normal_lpdf(mu[2] | 10, 1);
  // forward algorithm
  {
  real acc[K];
  real gamma[N, K];
  for (k in 1:K)
    gamma[1, k] = normal_lpdf(y[1] | mu[k], 1);
  for (t in 2:N) {
    for (k in 1:K) {
      for (j in 1:K)
        acc[j] = gamma[t-1, j] + log(theta[j, k]) + normal_lpdf(y[t] | mu[k], 1);
      gamma[t, k] = log_sum_exp(acc);
    }
  }
  target += log_sum_exp(gamma[N]);
  }
}

generated quantities {
  int<lower=1,upper=K> z_star[N];
  real log_p_z_star;
  {
    int back_ptr[N, K];
    real best_logp[N, K];
    for (k in 1:K)
      best_logp[1, k] = normal_lpdf(y[1] | mu[k], 1);
    for (t in 2:N) {
      for (k in 1:K) {
        best_logp[t, k] = negative_infinity();
        for (j in 1:K) {
          real logp;
          logp = best_logp[t-1, j] + log(theta[j, k]) + normal_lpdf(y[t] | mu[k], 1);
          if (logp > best_logp[t, k]) {
            back_ptr[t, k] = j;
            best_logp[t, k] = logp;
          }
        }
      }
    }
    log_p_z_star = max(best_logp[N]);
    for (k in 1:K)
      if (best_logp[N, k] == log_p_z_star)
        z_star[N] = k;
    for (t in 1:(N - 1))
      z_star[N - t] = back_ptr[N - t + 1, z_star[N - t + 1]];
  }
}

Fitting the Model

We fit the model to the data provided above in order to estimate the parameters theta and mu along with the hidden state sequence z_star.

# code available in hmm_example.R
stan_data <- list(N = length(hmm_data$y),
                  K = 2,
                  y = hmm_data$y)
hmm_fit <- stan("../models/hmm_example.stan", data = stan_data, iter = 1e3, chains = 4)

Post-estimation Validation

The post estimation steps that we take to validate are,

  1. Diagnostics: Make sure that each parameter sample converged. This can be evaluated by examining the R-hat and effective sample size values for each parameter.
  2. Predictions: Specifically we perform a posterior predictive check. Using the estimated parameters and the predicted state sequence we can predict multiple output sequences and see if they line up with the observed output sequence. Additionally, since we know the true state values, we can check to make sure the predicted state values line up the true state values.

We go through these steps below.

Diagnostics

For the transition probabilities \(\boldsymbol\theta{}\) and the parameters associated with the emission probabilities \(\mu\) we have the following parameter estimates.

print(hmm_fit, pars = "z_star", include = FALSE, probs = c(0.05,0.95))
## Inference for Stan model: hmm_example.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##                 mean se_mean   sd      5%     95% n_eff Rhat
## theta[1,1]      0.66    0.00 0.10    0.50    0.81  1647    1
## theta[1,2]      0.34    0.00 0.10    0.19    0.50  1647    1
## theta[2,1]      0.07    0.00 0.03    0.03    0.12  1638    1
## theta[2,2]      0.93    0.00 0.03    0.88    0.97  1638    1
## mu[1]           3.03    0.01 0.23    2.67    3.40  1046    1
## mu[2]           8.83    0.00 0.11    8.64    9.01  2305    1
## log_p_z_star -166.18    0.04 1.34 -168.85 -164.60   882    1
## lp__         -170.20    0.05 1.39 -173.00 -168.57   772    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Oct  5 14:02:48 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

With Rhat values close to 1 and a high enough effective sample size (n_eff) we can conclude that our parameter estimates converged.

We can also look at the traceplots to get an idea of what convergence among chains looks like.

mcmc_trace(as.array(hmm_fit), regex_pars = "^theta\\[|^mu\\[", facet_args = list(nrow = 2))

Below we have diagnostics from a situation where priors are defined but an ordering is not enforced (see hmm_example_bad_fit.R for the code). You can see how the parameter chains look like they have not converged. Notice how we have Rhat values different from 1 and small effective sample sizes; all indicators of bad convergence.

# code available in hmm_example_bad.R
hmm_bad_stan <- readRDS("../results/hmm_example_bad.RDS")
print(hmm_bad_stan$fit, pars = "z_star", include = FALSE, probs = c(0.05,0.95))
## Inference for Stan model: hmm_example_bad.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##                 mean se_mean    sd      5%     95% n_eff  Rhat
## theta[1,1]      0.80    0.09  0.14    0.55    0.96     2  2.18
## theta[1,2]      0.20    0.09  0.14    0.04    0.45     2  2.18
## theta[2,1]      0.20    0.09  0.15    0.04    0.47     3  2.13
## theta[2,2]      0.80    0.09  0.15    0.53    0.96     3  2.13
## mu[1]           5.88    2.02  2.86    2.73    8.88     2 16.61
## mu[2]           6.10    1.93  2.73    3.08    8.96     2 16.71
## log_p_z_star -166.91    0.42  1.86 -170.48 -164.72    20  1.07
## lp__         -192.63   13.78 19.54 -214.04 -171.64     2 15.08
## 
## Samples were drawn using NUTS(diag_e) at Tue Oct  1 23:22:14 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
mcmc_trace(as.array(hmm_bad_stan$fit), regex_pars = "^theta\\[|^mu\\[", facet_args = list(ncol = 2))

Posterior Predictions

Below we plot 100 predicted outcome sequences given the predicted states and emission parameter estimates. (These predictions could also be implemented in the generated quantities {} block of our Stan model.) Notice how these posterior predictions based on the data line up nicely with the observed output values. This is one indication that the data generation process was appropriately modeled.

# extract samples
samples <- as.matrix(hmm_fit)
theta <- samples[,grep("^theta",colnames(samples))]
mu <- samples[,grep("^mu",colnames(samples))]
z_star <- samples[,grep("^z_star",colnames(samples))]

# simulate observations for each iteration in the sample
y_hat <- list()
for (i in 1:nrow(samples)) {
  psi_seq <- sapply(z_star[i,], function(x){mu[i,x]})
  y_hat[[i]] <- rnorm(length(psi_seq), psi_seq, 1)
}

# plot
indxs <- sample(length(y_hat), 100, replace = FALSE)
plot(hmm_data$y, type = "n",
     main = "Observed vs Predicted Output",
     ylab = "Observation Value",
     xlab = "Time",
     ylim = c(0,11))
for (i in indxs) {
  lines(y_hat[[i]], col = "#ff668890")
}
lines(hmm_data$y, lwd = 2)
legend("bottomright", c("Observed","Predicted"), col = c("#000000","#ff668890"), lty = c(1,1), lwd = c(2,1), cex = 0.8)

Below we overlay the predicted states with the true states. Similar to the conclusion above, our predictions line up nicely with the true values. Note that we can only do this because we know the true value of the hidden states. In situations where the states are truly hidden this step is infeasible.

# visualization
par(mfrow=c(2,1))
plot(hmm_data$z, type="s",
     main = "Latent States",
     ylab = "State Value",
     xlab = "Time",
     ylim = c(0.5,2.5), yaxt = "n")
axis(2, 1:2, 1:2)
points(colMeans(z_star), cex = 0.5)
legend("bottomright", c("Actual","Predicted"), pch = c(NA,1), lty = c(1,NA), cex = 0.5)
plot(hmm_data$y, type = "l",
     main = "Observed Output",
     ylab = "Observation Value",
     xlab = "Time")
y_plt <- hmm_data$y
y_plt[hmm_data$z==1] <- NA
lines(y_plt, lwd = 3)
legend("bottomright", c("State 1","State 2"), lty = c(1,1), lwd = c(1,3), cex = 0.8)

We can now adapt the methodology discsussed in this section to identify a drive event and defensive assignment using basketball player tracking data.

Tagging Drive Events

For those who are unfamiliar with basketball, a drive occurs when an offensive player dribbles the ball towards the hoop for a shot attempt (often a layup). We can translate a drive into two types of events that are happening simultaneously over time until the shot attempt,

The video below illustrates what a drive event looks like in the player tracking data (see drive_data.R for code). This drive possession was attributed to Zach LaVine in the Minnesota Timberwolves v Boston Celtics game on 12/21/2015.

Pre-process Data

Using the player tracking data we can construct the speed and distance metrics associated with LaVine’s drive event. We define speed as distance over time and use Euclidean distance to determine the player’s distance from the hoop at each time step. Below we show these metrics along with the player tracking data (see data/pt_data_evt140_drive.R for code). Notice how LaVine decreases his distance from the basket and increases his speed as he drives to the hoop.

It’s apparent that the speed metric is pretty noisy. This may be attributed to the fact that time is measured in 25 hertz (1/25th of a second) and that location is determined by a computer vision algorithm and not a tracking chip attached to the player. There are many methods that can be used to smooth the data (e.g. splines). Here we use a basic rolling mean with a window of three time steps. In our example the data is not so noisy that it would affect the performance of our model so the smoothing is mostly for aesthetics and ease of interpretation. If the noise in the series was more extreme then we might want consider implementing a better smoothing method before fitting our model.

drive_data <- readRDS("../data/evt140_0021500411.RDS")

lavine_speed_smooth <- rep(drive_data$game$lavine_speed[2],2)
for (i in 3:nrow(drive_data$game))
  lavine_speed_smooth[i] <- mean(drive_data$game$lavine_speed[(i-2):i], na.rm=TRUE)

drive_data <- readRDS("../data/evt140_0021500411.RDS")
par(mfrow = c(3,1))
plot(drive_data$game$lavine_dist, type = "l",
     main = "Distance from Hoop",
     xlab = "Time (25hz)", ylab = "Distance from Hoop")
plot(drive_data$game$lavine_speed, type = "l",
     main = "Raw Speed",
     xlab = "Time (25hz)", ylab = "Speed")
plot(lavine_speed_smooth, type = "l",
     main = "Smooth Speed",
     xlab = "Time (25hz)", ylab = "Speed")

Now that we have the transformed the data appropriately we can specify and fit the model.

Specifying and Fitting the Model

Here our HMM needs to infer two hidden states: drive and none, using both the speed and distance observed sequences. One approach would be to model 1/speed in order to get the speed and distance sequences to trend in the same direction as the drive and non-drive state (Keshri et al 2017). In this case a drive would be defined by,

  • The player reduces their 1/speed (equivalent to increasing speed).
  • The player reduces the distance between himself and the basket.

If the data are transformed in this way a natural modeling approach for the emission probabilities would be to use the exponential distribution function. Unfortunately this poses two issues:

  1. Scale of the data: Both the distance and the speed metric are on very different scales. It would make sense to take the log of these data to normalize them but unfortunately this is infeasible since the support of the exponential distribution must be greater than or equal to 0.
  2. Computationally unstable: The 1/speed transformation contains values that are mostly close to 0. This makes it difficult to use distributions that have most of their density around 0, such as the exponential distribution. The reason being that it will be difficult to discriminate between the “drive” and “none” states, since modeling 1/speed with the exponential distribution will attribute high probabilities to both states.

While it is mathematically more tractable to use the exponential distribution, we argue that such a choice to model the data generation process will be computationally unstable. Instead, we opt to use the normal distribution on both the 1/speed and distance metrics, enabling us to model the data on the log scale.

With the normal distributions defined over the observed data, our HMM is defined as follows,

\[ \begin{align*} u_t &\sim \mathcal{N}(\phi_{z_t}, \tau) \\ v_t &\sim \mathcal{N}(\lambda_{z_t}, \rho) \\ z_t &\sim \mathcal{Categorical}(\theta_{z_{[t-1]}}) \\ \theta_{z_t} &\sim \mathcal{Dir}(\alpha_{z_t}) \\ \end{align*} \]

where \(z_t \in [1,2]\), \(\tau=\rho=0.1\), and \(\alpha = [[4,2],[2,4]]\).

Below is a graphical representation of the process that we are trying to model.

Now we can fit the model to LaVine’s data for the single drive possession. We provide the model results and diagnostics below.

# code available in drive_1.R
drive_stan <- readRDS("../results/drive_1.RDS")
print(drive_stan$fit, pars = 'z_star', include = FALSE, probs = c(0.05,0.95))
## Inference for Stan model: drive_1.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##                  mean se_mean   sd       5%      95% n_eff Rhat
## theta[1,1]       0.97    0.00 0.02     0.93     0.99  2129 1.00
## theta[1,2]       0.03    0.00 0.02     0.01     0.07  2129 1.00
## theta[2,1]       0.01    0.00 0.01     0.00     0.02  1954 1.00
## theta[2,2]       0.99    0.00 0.01     0.98     1.00  1954 1.00
## phi[1]          -2.34    0.00 0.01    -2.36    -2.33  1501 1.00
## phi[2]          -0.74    0.00 0.01    -0.75    -0.74  2150 1.00
## lambda[1]        2.43    0.00 0.01     2.41     2.45  1321 1.00
## lambda[2]        3.54    0.00 0.01     3.53     3.55  2262 1.00
## log_p_z_star -6905.79    0.07 2.19 -6910.07 -6902.96  1026 1.00
## lp__         -6932.75    0.07 1.82 -6936.19 -6930.45   757 1.01
## 
## Samples were drawn using NUTS(diag_e) at Sun Sep 29 00:05:49 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
mcmc_trace(as.array(drive_stan$fit), regex_pars = "^theta\\[|^psi\\[|^lambda\\[", facet_args = list(nrow = 2))