This is an old version, view current version.

7.4 Data Coding and Diagnostic Accuracy Models

Although seemingly disparate tasks, the rating/coding/annotation of items with categories and diagnostic testing for disease or other conditions share several characteristics which allow their statistical properties to modeled similarly.

Diagnostic Accuracy

Suppose you have diagnostic tests for a condition of varying sensitivity and specificity. Sensitivity is the probability a test returns positive when the patient has the condition and specificity is the probability that a test returns negative when the patient does not have the condition. For example, mammograms and puncture biopsy tests both test for the presence of breast cancer. Mammograms have high sensitivity and low specificity, meaning lots of false positives, whereas puncture biopsies are the opposite, with low sensitivity and high specificity, meaning lots of false negatives.

There are several estimands of interest in such studies. An epidemiological study may be interested in the prevalence of a kind of infection, such as malaria, in a population. A test development study might be interested in the diagnostic accuracy of a new test. A health care worker performing tests might be interested in the disease status of a particular patient.

Data Coding

Humans are often given the task of coding (equivalently rating or annotating) data. For example, journal or grant reviewers rate submissions, a political study may code campaign commercials as to whether they are attack ads or not, a natural language processing study might annotate Tweets as to whether they are positive or negative in overall sentiment, or a dentist looking at an X-ray classifies a patient as having a cavity or not. In all of these cases, the data coders play the role of the diagnostic tests and all of the same estimands are in play — data coder accuracy and bias, true categories of items being coded, or the prevalence of various categories of items in the data.

Noisy Categorical Measurement Model

In this section, only categorical ratings are considered, and the challenge in the modeling for Stan is to marginalize out the discrete parameters.

Dawid and Skene (1979) introduce a noisy-measurement model for coding and apply it in the epidemiologial setting of coding what say aboutpatient histories; the same model can be used for diagnostic procedures.


The data for the model consists of \(J\) raters (diagnostic tests), \(I\) items (patients), and \(K\) categories (condition statuses) to annotate, with \(y_{i, j} \in 1{:}K\) being the rating provided by rater \(j\) for item \(i\). In a diagnostic test setting for a particular condition, the raters are diagnostic procedures and often \(K=2\), with values signaling the presence or absence of the condition.20

It is relatively straightforward to extend Dawid and Skene’s model to deal with the situation where not every rater rates each item exactly once.

Model Parameters

The model is based on three parameters, the first of which is discrete:

  • \(z_i\) : a value in \(1{:}K\) indicating the true category of item \(i\),
  • \(\pi\) : a \(K\)-simplex for the prevalence of the \(K\) categories in the population, and
  • \(\theta_{j,k}\) : a \(K\)-simplex for the response of annotator \(j\) to an item of true category \(k\).

Noisy Measurement Model

The true category of an item is assumed to be generated by a simple categorical distribution based on item prevalence, \[ z_i \sim \mathsf{Categorical}(\pi). \]

The rating \(y_{i, j}\) provided for item \(i\) by rater \(j\) is modeled as a categorical response of rater \(i\) to an item of category \(z_i\),21

\[ y_{i, j} \sim \mathsf{Categorical}(\theta_{j,\pi_{z[i]}}). \]

Priors and Hierarchical Modeling

Dawid and Skene provided maximum likelihood estimates for \(\theta\) and \(\pi\), which allows them to generate probability estimates for each \(z_i\).

To mimic Dawid and Skene’s maximum likelihood model, the parameters \(\theta_{j,k}\) and \(\pi\) can be given uniform priors over \(K\)-simplexes. It is straightforward to generalize to Dirichlet priors, \[ \pi \sim \mathsf{Dirichlet}(\alpha) \] and \[ \theta_{j,k} \sim \mathsf{Dirichlet}(\beta_k) \] with fixed hyperparameters \(\alpha\) (a vector) and \(\beta\) (a matrix or array of vectors). The prior for \(\theta_{j,k}\) must be allowed to vary in \(k\), so that, for instance, \(\beta_{k,k}\) is large enough to allow the prior to favor better-than-chance annotators over random or adversarial ones.

Because there are \(J\) coders, it would be natural to extend the model to include a hierarchical prior for \(\beta\) and to partially pool the estimates of coder accuracy and bias.

Marginalizing out the True Category

Because the true category parameter \(z\) is discrete, it must be marginalized out of the joint posterior in order to carry out sampling or maximum likelihood estimation in Stan. The joint posterior factors as \[ p(y, \theta, \pi) = p(y | \theta,\pi) \, p(\pi) \, p(\theta), \] where \(p(y | \theta,\pi)\) is derived by marginalizing \(z\) out of

\[ p(z, y | \theta, \pi) \ = \ \prod_{i=1}^I \left( \mathsf{Categorical}(z_i | \pi) \prod_{j=1}^J \mathsf{Categorical}(y_{i, j}|\theta_{j, z[i]}) \right). \]

This can be done item by item, with \[ p(y | \theta, \pi) \ = \ \prod_{i=1}^I \sum_{k=1}^K \left( \mathsf{Categorical}(z_i | \pi) \prod_{j=1}^J \mathsf{Categorical}(y_{i, j}|\theta_{j, z[i]}) \right). \]

In the missing data model, only the observed labels would be used in the inner product.

Dawid and Skene (1979) derive exactly the same equation in their Equation~(2.7), required for the E-step in their expectation maximization (EM) algorithm. Stan requires the marginalized probability function on the log scale, \[ \begin{array}{l} \mbox{ } \ \log p(y | \theta, \pi) \\[3pt] \mbox{ } \ = \ \sum_{i=1}^I \log \left( \sum_{k=1}^K \exp \left( \log \mathsf{Categorical}(z_i | \pi) + \sum_{j=1}^J \log \mathsf{Categorical}(y_{i, j}|\theta_{j, z[i]}) \right) \right), \end{array} \] which can be directly coded using Stan’s built-in log_sum_exp function.

Stan Implementation

The Stan program for the Dawid and Skene model is provided below Dawid and Skene (1979).

data {
  int<lower=2> K;
  int<lower=1> I;
  int<lower=1> J;

  int<lower=1,upper=K> y[I, J];

  vector<lower=0>[K] alpha;
  vector<lower=0>[K] beta[K];
parameters {
  simplex[K] pi;
  simplex[K] theta[J, K];
transformed parameters {
  vector[K] log_q_z[I];
  for (i in 1:I) {
    log_q_z[i] = log(pi);
    for (j in 1:J)
      for (k in 1:K)
        log_q_z[i, k] = log_q_z[i, k]
                         + log(theta[j, k, y[i, j]]);
model {
  pi ~ dirichlet(alpha);
  for (j in 1:J)
    for (k in 1:K)
      theta[j, k] ~ dirichlet(beta[k]);

  for (i in 1:I)
    target += log_sum_exp(log_q_z[i]);


The model marginalizes out the discrete parameter \(z\), storing the unnormalized conditional probability \(\log q(z_i=k|\theta,\pi)\) in

The Stan model converges quickly and mixes well using NUTS starting at diffuse initial points, unlike the equivalent model implemented with Gibbs sampling over the discrete parameter. Reasonable weakly informative priors are \(\alpha_k = 3\) and \(\beta_{k,k} = 2.5 K\) and \(\beta_{k,k'} = 1\) if \(k \neq k'\). Taking \(\alpha\) and \(\beta_k\) to be unit vectors and applying optimization will produce the same answer as the expectation maximization (EM) algorithm of Dawid and Skene (1979).

Inference for the True Category

The quantity log_q_z[i] is defined as a transformed parameter. It encodes the (unnormalized) log of \(p(z_i | \theta, \pi)\). Each iteration provides a value conditioned on that iteration’s values for \(\theta\) and \(\pi\). Applying the softmax function to log_q_z[i] provides a simplex corresponding to the probability mass function of \(z_i\) in the posterior. These may be averaged across the iterations to provide the posterior probability distribution over each \(z_i\).


Dawid, A. P., and A. M. Skene. 1979. “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1): 20–28.