8.2 Ragged Data Structures

This is an old version, view current version.

Ragged arrays are arrays that are not rectangular, but have different sized entries. This kind of structure crops up when there are different numbers of observations per entry.

A general approach to dealing with ragged structure is to move to a full database-like data structure as discussed in the previous section. A more compact approach is possible with some indexing into a linear array.

For example, consider a data structure for three groups, each of which has a different number of observations.

\(y_1 = \left[1.3 \ \ 2.4 \ \ 0.9\right]\)

\(y_2 = \left[-1.8 \ \ -0.1\right]\)

\(y_3 = \left[12.9 \ \ 18.7 \ \ 42.9 \ \ 4.7\right]\)

\(z = [1.3 \ \ 2.4 \ \ 0.9 \ \ -1.8 \ \ -0.1 \ \ 12.9 \ \ 18.7 \ \ 42.9 \ \ 4.7]\)

\(s = \{ 3 \ \ 2 \ \ 4 \}\)

On the left is the definition of a ragged data structure \(y\) with three rows of different sizes (\(y_1\) is size 3, \(y_2\) size 2, and \(y_3\) size 4). On the right is an example of how to code the data in Stan, using a single vector \(y\) to hold all the values and a separate array of integers \(s\) to hold the group row sizes. In this example, \(y_1 = z_{1:3}\), \(y_2 = z_{4:5}\), and \(y_3 = z_{6:9}\).

Suppose the model is a simple varying intercept model, which, using vectorized notation, would yield a log-likelihood \[ \sum_{n=1}^3 \log \textsf{normal}(y_n \mid \mu_n, \sigma). \] There’s no direct way to encode this in Stan.

A full database type structure could be used, as in the sparse example, but this is inefficient, wasting space for unnecessary indices and not allowing vector-based density operations. A better way to code this data is as a single list of values, with a separate data structure indicating the sizes of each subarray. This is indicated on the right of the example. This coding uses a single array for the values and a separate array for the sizes of each row.

The model can then be coded up using slicing operations as follows.

data {
  int<lower=0> N;   // # observations
  int<lower=0> K;   // # of groups
  vector[N] y;      // observations
  int s[K];         // group sizes
  ...
model {
  int pos;
  pos = 1;
  for (k in 1:K) {
    segment(y, pos, s[k]) ~ normal(mu[k], sigma);
    pos = pos + s[k];
  }

This coding allows for efficient vectorization, which is worth the copy cost entailed by the segment() vector slicing operation.