8.1 Sparse Data Structures

This is an old version, view current version.

Coding sparse data structures is as easy as moving from a matrix-like data structure to a database-like data structure. For example, consider the coding of sparse data for the IRT models discussed in the item-response model section. There are \(J\) students and \(K\) questions, and if every student answers every question, then it is practical to declare the data as a \(J \times K\) array of answers.

data {
  int<lower=1> J;
  int<lower=1> K;
  int<lower=0,upper=1> y[J, K];
  ...
model {
  for (j in 1:J)
    for (k in 1:K)
      y[j, k] ~ bernoulli_logit(delta[k] * (alpha[j] - beta[k]));
  ...

\[ y = \left[ \begin{array}{cccc} 0 & 1 & \mbox{NA} & 1 \\ 0 & \mbox{NA} & \mbox{NA} & 1 \\ \mbox{NA} & 0 & \mbox{NA} & \mbox{NA} \end{array} \right] \]

\(jj\)	\(kk\)	\(y\)
1	1	0
1	2	1
1	4	1
2	1	0
2	4	1
3	2	0

On the left is a definition of a sparse matrix \(y\) using the NA notation from R (which is not supported by Stan). On the right is a database-like encoding of the same sparse matrix \(y\) that can be used directly in Stan. The first two columns, \(jj\) and \(kk\), denote the indexes and the final column, \(y\), the value. For example, the fifth row of the database-like data structure on the right indicates that \(y_{2,4} = 1\).

When not every student is given every question, the dense array coding will no longer work, because Stan does not support undefined values. The sparse data example shows an example with \(J=3\) and \(K=4\), with missing responses shown as NA, as in R. There is no support within Stan for R’s NA values, so this data structure cannot be used directly. Instead, it must be converted to a “long form” as in a database, with columns indicating the \(j\) and \(k\) indexes along with the value. For instance, with \(jj\) and \(kk\) used for the indexes (following Gelman and Hill (2007)), the data structure can be coded as in the right-hand example in the example. This says that \(y_{1,1} = 0\), \(y_{1,2} = 1\), and so on, up to \(y_{3,2} = 1\), with all other entries undefined.

Letting \(N\) be the number of \(y\) that are defined, here \(N=6\), the data and model can be formulated as follows.

data {
  ...
  int<lower=1> N;
  int<lower=1,upper=J> jj[N];
  int<lower=1,upper=K> kk[N];
  int<lower=0,upper=1> y[N];
  ...
model {
  for (n in 1:N)
    y[n] ~ bernoulli_logit(delta[kk[n]]
                           * (alpha[jj[n]] - beta[kk[n]]));
  ...

In the situation where there are no missing values, the two model formulations produce exactly the same log posterior density.

References

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.