Sparse and Ragged Data Structures
Stan does not directly support either sparse or ragged data structures, though both can be accommodated with some programming effort. The sparse matrices chapter introduces a special-purpose sparse matrix times dense vector multiplication, which should be used where applicable; this chapter covers more general data structures.
Sparse data structures
Coding sparse data structures is as easy as moving from a matrix-like data structure to a database-like data structure. For example, consider the coding of sparse data for the IRT models discussed in the item-response model section. There are
data {
int<lower=1> J;
int<lower=1> K;
array[J, K] int<lower=0, upper=1> y;
// ...
model {
for (j in 1:J) {
for (k in 1:K) {
y[j, k] ~ bernoulli_logit(delta[k] * (alpha[j] - beta[k]));
}
}// ...
}
When not every student is given every question, the dense array coding will no longer work, because Stan does not support undefined values.
The following missing data example shows an example with
There is no support within Stan for R’s NA values, so this data structure cannot be used directly. Instead, it must be converted to a “long form” as in a database, with columns indicating the indices along with the value. With columns
jj | kk | y |
---|---|---|
1 | 1 | 0 |
1 | 2 | 1 |
1 | 4 | 1 |
2 | 1 | 0 |
2 | 4 | 1 |
3 | 2 | 0 |
Letting
data {
// ...
int<lower=1> N;
array[N] int<lower=1, upper=J> jj;
array[N] int<lower=1, upper=K> kk;
array[N] int<lower=0, upper=1> y;
// ...
}model {
for (n in 1:N) {
y[n] ~ bernoulli_logit(delta[kk[n]]
* (alpha[jj[n]] - beta[kk[n]]));
}// ...
}
In the situation where there are no missing values, the two model formulations produce exactly the same log posterior density.
Ragged data structures
Ragged arrays are arrays that are not rectangular, but have different sized entries. This kind of structure crops up when there are different numbers of observations per entry.
A general approach to dealing with ragged structure is to move to a full database-like data structure as discussed in the previous section. A more compact approach is possible with some indexing into a linear array.
For example, consider a data structure for three groups, each of which has a different number of observations.
On the left is the definition of a ragged data structure
Suppose the model is a simple varying intercept model, which, using vectorized notation, would yield a log-likelihood
A full database type structure could be used, as in the sparse example, but this is inefficient, wasting space for unnecessary indices and not allowing vector-based density operations. A better way to code this data is as a single list of values, with a separate data structure indicating the sizes of each subarray. This is indicated on the right of the example. This coding uses a single array for the values and a separate array for the sizes of each row.
The model can then be coded up using slicing operations as follows.
data {
int<lower=0> N; // # observations
int<lower=0> K; // # of groups
vector[N] y; // observations
array[K] int s; // group sizes
// ...
}model {
int pos;
1;
pos = for (k in 1:K) {
segment(y, pos, s[k]) ~ normal(mu[k], sigma);
pos = pos + s[k]; }
This coding allows for efficient vectorization, which is worth the copy cost entailed by the segment()
vector slicing operation.