21 RDump Format for CmdStan

NOTE: Although the RDump format is still supported, I/O with JSON is faster and recommended. See the chapter on JSON for more details.

RDump format can be used to represent values for Stan variables. This format was introduced in SPLUS and is used in R, JAGS, and in BUGS (but with a different ordering).

A dump file is structured as a sequence of variable definitions. Each variable is defined in terms of its dimensionality and its values. There are three kinds of variable declarations: - scalars - sequences - general arrays

21.1 Creating dump files

Dump files can be created from R using RStan, via the rstan package function stan_rdump. Stan RDump files must be created via stan_rdump and not by R’s native dump function because R’s dump function uses a richer syntax than is supported by the underlying Stan i/o libraries.

21.2 Scalar variables

A simple scalar value can be thought of as having an empty list of dimensions. Its declaration in the dump format follows the SPLUS assignment syntax. For example, the following would constitute a valid dump file defining a single scalar variable y with value \(17.2\):

y <- 17.2

21.3 Sequence variables

One-dimensional arrays may be specified directly using the SPLUS sequence notation. The following example defines an integer-value and a real-valued sequence.

n <- c(1,2,3) y <- c(2.0,3.0,9.7)

Arrays are provided without a declaration of dimensionality because the reader just counts the number of entries to determine the size of the array.

Sequence variables may alternatively be represented with R’s colon-based notation. For instance, the first example above could equivalently be written as

n <- 1:3

The sequence denoted by 1:3 is of length \(3\), running from \(1\) to \(3\) inclusive. The colon notation allows sequences going from high to low. The following are equivalent:

n <- 2:-2
n <- c(2,1,0,-1,-2)

As a special case, a sequence of zeros can also be represented in the dump format by integer(x) and double(x), for type int and double, respectively. Here x is a non-negative integer to specify the length. If x is \(0\), it can be omitted. The following are some examples.

x1 <- integer()
x2 <- integer(0)
x3 <- integer(2)
y1 <- double()
y2 <- double(0)
y3 <- double(2)

21.4 Array variables

For more than one dimension, the dump format uses a dimensionality specification. For example, the following defines a \(2 \times 3\) array:

y <- structure(c(1,2,3,4,5,6), .Dim = c(2,3))

Data is stored column-major, thus the values for y will be:

y[1,1] = 1
y[1,2] = 3
y[1,3] = 5
y[2,1] = 2
y[2,2] = 4
y[2,3] = 6

The structure keyword just wraps a sequence of values and a dimensionality declaration, which is itself just a sequence of non-negative integer values. The product of the dimensions must equal the length of the array.

If the values happen to form a contiguous sequence of integers, they may be written with colon notation. Thus the example above is equivalent to the following.

y <- structure(1:6, .Dim = c(2,3))

Sequence notation can be used within any call to the generic c() function in R. In the above example, c(2,3) could be written as c(2:3).

The generalization of column-major indexing is last-index major indexing. Arrays of more than two dimensions are written in a last-index major form. For example,

z <- structure(1:24, .Dim = c(2,3,4))

produces a three-dimensional int (assignable to real) array z with values:

z[1,1,1] = 1
z[2,1,1] = 2
z[1,2,1] = 3
z[2,2,1] = 4
z[1,3,1] = 5
z[2,3,1] = 6
z[1,1,2] = 7
z[2,1,2] = 8
z[1,2,2] = 9
z[2,2,2] = 10
z[1,3,2] = 11
z[2,3,2] = 12
z[1,1,3] = 13
z[2,1,3] = 14
z[1,2,3] = 15
z[2,2,3] = 16
z[1,3,3] = 17
z[2,3,3] = 18
z[1,1,4] = 19
z[2,1,4] = 20
z[1,2,4] = 21
z[2,2,4] = 22
z[1,3,4] = 23
z[2,3,4] = 24

If the underlying 3-D array is stored as a 1-D array in last-index major format, the innermost array elements will be contiguous.

The sequence of values inside structure can also be integer(x) or double(x). In particular, if one or more dimensions is zero, integer() can be put inside structure. For instance, the following example is supported by the dump format.

y <- structure(integer(), .Dim = c(2,0))

21.5 Matrix- and vector-valued variables

The dump format for matrices and vectors, including arrays of matrices and vectors, is the same as that for arrays of the same shape.

21.5.1 Vector dump format

The following three declarations have the same dump format for their data.

real a[K];
vector[K] b;
row_vector[K] c;

21.5.2 Matrix dump format

The following declarations have the same dump format.

real a[M,N];
matrix[M,N] b;

21.5.3 Arrays of vectors and matrices

The key to understanding arrays is that the array indexing comes before any of the container indexing. That is, an array of vectors is just that: each array element is a vector. See the chapter on array and matrix types in the user’s guide section of the language manual for more information.

For the dump data format, the following declarations have the same arrangement.

real a[M,N];
matrix[M,N] b;
vector[N] c[M];
row_vector[N] d[M];

Similarly, the following also have the same dump format.

real a[P,M,N];
matrix[M,N] b[P];
vector[N] c[P,M];
row_vector[N] d[P,M];

21.6 Integer- and real-valued variables

There is no declaration in a dump file that distinguishes integer versus continuous values. If a value in a dump file’s definition of a variable contains a decimal point (e.g., \(132.3\)) or uses scientific notation (e.g., \(1.323e2\)), Stan assumes that the values are real.

For a single value, if there is no decimal point, it may be assigned to an int or real variable in Stan. An array value may only be assigned to an int array if there is no decimal point or scientific notation in any of the values. This convention is compatible with the way R writes data.

The following dump file declares an integer value for y.

y <- 2

This definition can be used for a Stan variable y declared as real or as int. Assigning an integer value to a real variable automatically promotes the integer value to a real value.

Integer values may optionally be followed by L or l, denoting long integer values. The following example, where the type is explicit, is equivalent to the above.

y <- 2L

The following dump file provides a real value for y.

y <- 2.0

Even though this is a round value, the occurrence of the decimal point in the value, \(2.0\), causes Stan to infer that y is real valued. This dump file may only be used for variables y declared as real in Stan.

21.6.1 Scientific notation

Numbers written in scientific notation may only be used for real values in Stan. R will write out the integer one million as \(1e+06\).

21.6.2 Infinite and not-a-number values

Stan’s reader supports infinite and not-a-number values for scalar quantities (see the section of the reference manual section of the language manual for more information on Stan’s numerical data types). Both infinite and not-a-number values are supported by Stan’s dump-format readers.

Value Preferred Form Alternative Forms
positive infinity Inf Infinity, infinity
negative infinity -Inf -Infinity, -infinity
not a number NaN

These strings are not case sensitive, so inf may also be used for positive infinity, or NAN for not-a-number.

21.7 Quoted variable names

In order to support JAGS data files, variables may be double quoted. For instance, the following definition is legal in a dump file.

"y" <- c(1,2,3) \end{Verbatim}

21.8 Line breaks

The line breaks in a dump file are required to be consistent with the way R reads in data. Both of the following declarations are legal.

y <- 2
y <-
3

Also following R, breaking before the assignment arrow are not allowed, so the following is invalid.

y
<- 2 # Syntax Error

Lines may also be broken in the middle of sequences declared using the c(...) notation., as well as between the comma following a sequence definition and the dimensionality declaration. For example, the following declaration of a \(2 \times 2 \times 3\) array is valid.

y <-
structure(c(1,2,3,
4,5,6,7,8,9,10,11,
12), .Dim = c(2,2,
3))

Because there are no decimal points in the values, the resulting dump file may be used for three-dimensional array variables declared as int or real.

21.9 BNF grammar for dump data

A more precise definition of the dump data format is provided by the following (mildly templated) Backus-Naur form grammar.

definition ::= name <- value optional_semicolon

name ::= char*     | ''' char* '''     | '"' char* '"'

value ::= value<int> | value<double>

value<T> ::= T       | seq<T>       | zero_array<T>       |
'structure' '(' seq<T> ',' ".Dim" '=' seq<int> ')'       | 'structure'
'(' zero_array<T> ',' ".Dim" '=' seq<int> ')'

seq<int> ::= int ':' int       | cseq<int>

zero_array<int> ::= "integer" '(' <non-negative int>? ')'

zero_array<real> ::= "double" '(' <non-negative int>? ')'

seq<real> ::= cseq<real>

cseq<T> ::= 'c' '(' vseq<T> ')'

vseq<T> ::= T      | T ',' vseq<T>

The template parameters T will be set to either int or real. Because Stan allows promotion of integer values to real values, an integer sequence specification in the dump data format may be assigned to either an integer- or real-based variable in Stan.