RDump Format for CmdStan
NOTE: Although the RDump format is still supported, I/O with JSON is faster and recommended. See the chapter on JSON for more details.
RDump format can be used to represent values for Stan variables. This format was introduced in SPLUS and is used in R, JAGS, and in BUGS (but with a different ordering).
A dump file is structured as a sequence of variable definitions. Each variable is defined in terms of its dimensionality and its values. There are three kinds of variable declarations: - scalars - sequences - general arrays
Creating dump files
Dump files can be created from R using RStan, via the rstan
package function stan_rdump
. Stan RDump files must be created via stan_rdump
and not by R’s native dump
function because R’s dump
function uses a richer syntax than is supported by the underlying Stan i/o libraries.
Scalar variables
A simple scalar value can be thought of as having an empty list of dimensions. Its declaration in the dump format follows the SPLUS assignment syntax. For example, the following would constitute a valid dump file defining a single scalar variable y
with value \(17.2\):
y <- 17.2
Sequence variables
One-dimensional arrays may be specified directly using the SPLUS sequence notation. The following example defines an integer-value and a real-valued sequence.
n <- c(1,2,3) y <- c(2.0,3.0,9.7)
Arrays are provided without a declaration of dimensionality because the reader just counts the number of entries to determine the size of the array.
Sequence variables may alternatively be represented with R’s colon-based notation. For instance, the first example above could equivalently be written as
n <- 1:3
The sequence denoted by 1:3
is of length \(3\), running from \(1\) to \(3\) inclusive. The colon notation allows sequences going from high to low. The following are equivalent:
n <- 2:-2
n <- c(2,1,0,-1,-2)
As a special case, a sequence of zeros can also be represented in the dump format by integer(x)
and double(x)
, for type int and double, respectively. Here x
is a non-negative integer to specify the length. If x
is \(0\), it can be omitted. The following are some examples.
x1 <- integer()
x2 <- integer(0)
x3 <- integer(2)
y1 <- double()
y2 <- double(0)
y3 <- double(2)
Array variables
For more than one dimension, the dump format uses a dimensionality specification. For example, the following defines a \(2 \times 3\) array:
y <- structure(c(1,2,3,4,5,6), .Dim = c(2,3))
Data is stored column-major, thus the values for y
will be:
y[1, 1] = 1
y[1, 2] = 3
y[1, 3] = 5
y[2, 1] = 2
y[2, 2] = 4
y[2, 3] = 6
The structure
keyword just wraps a sequence of values and a dimensionality declaration, which is itself just a sequence of non-negative integer values. The product of the dimensions must equal the length of the array.
If the values happen to form a contiguous sequence of integers, they may be written with colon notation. Thus the example above is equivalent to the following.
y <- structure(1:6, .Dim = c(2,3))
Sequence notation can be used within any call to the generic c()
function in R. In the above example, c(2,3) could be written as c(2:3)
.
The generalization of column-major indexing is last-index major indexing. Arrays of more than two dimensions are written in a last-index major form. For example,
z <- structure(1:24, .Dim = c(2,3,4))
produces a three-dimensional int
(assignable to real
) array z
with values:
z[1, 1, 1] = 1
z[2, 1, 1] = 2
z[1, 2, 1] = 3
z[2, 2, 1] = 4
z[1, 3, 1] = 5
z[2, 3, 1] = 6
z[1, 1, 2] = 7
z[2, 1, 2] = 8
z[1, 2, 2] = 9
z[2, 2, 2] = 10
z[1, 3, 2] = 11
z[2, 3, 2] = 12
z[1, 1, 3] = 13
z[2, 1, 3] = 14
z[1, 2, 3] = 15
z[2, 2, 3] = 16
z[1, 3, 3] = 17
z[2, 3, 3] = 18
z[1, 1, 4] = 19
z[2, 1, 4] = 20
z[1, 2, 4] = 21
z[2, 2, 4] = 22
z[1, 3, 4] = 23
z[2, 3, 4] = 24
If the underlying 3-D array is stored as a 1-D array in last-index major format, the innermost array elements will be contiguous.
The sequence of values inside structure
can also be integer(x)
or double(x)
. In particular, if one or more dimensions is zero, integer()
can be put inside structure
. For instance, the following example is supported by the dump format.
y <- structure(integer(), .Dim = c(2,0))
Matrix- and vector-valued variables
The dump format for matrices and vectors, including arrays of matrices and vectors, is the same as that for arrays of the same shape.
Vector dump format
The following three declarations have the same dump format for their data.
array[K] real a;
vector[K] b;
row_vector[K] c;
Matrix dump format
The following declarations have the same dump format.
array[M, N] real a;
matrix[M, N] b;
Arrays of vectors and matrices
The key to understanding arrays is that the array indexing comes before any of the container indexing. That is, an array of vectors is just that: each array element is a vector. See the chapter on array and matrix types in the user’s guide section of the language manual for more information.
For the dump data format, the following declarations have the same arrangement.
array[M, N] real a;
matrix[M, N] b;
array[M] vector[N] c;
array[M] row_vector[N] d;
Similarly, the following also have the same dump format.
array[P, M, N] real a;
array[P] matrix[M, N] b;
array[P, M] vector[N] c;
array[P, M] row_vector[N] d;
Complex-valued variables
At this time, there is no support for complex number input through the R dump format. As an alternative, the JSON input format supports complex numbers.
Integer- and real-valued variables
There is no declaration in a dump file that distinguishes integer versus continuous values. If a value in a dump file’s definition of a variable contains a decimal point (e.g., \(132.3\)) or uses scientific notation (e.g., \(1.323e2\)), Stan assumes that the values are real.
For a single value, if there is no decimal point, it may be assigned to an int
or real
variable in Stan. An array value may only be assigned to an int
array if there is no decimal point or scientific notation in any of the values. This convention is compatible with the way R writes data.
The following dump file declares an integer value for y
.
y <- 2
This definition can be used for a Stan variable y
declared as real
or as int
. Assigning an integer value to a real variable automatically promotes the integer value to a real value.
Integer values may optionally be followed by L
or l
, denoting long integer values. The following example, where the type is explicit, is equivalent to the above.
y <- 2L
The following dump file provides a real value for y
.
y <- 2.0
Even though this is a round value, the occurrence of the decimal point in the value, \(2.0\), causes Stan to infer that y
is real valued. This dump file may only be used for variables y
declared as real in Stan.
Scientific notation
Numbers written in scientific notation may only be used for real values in Stan. R will write out the integer one million as \(1e+06\).
Infinite and not-a-number values
Stan’s reader supports infinite and not-a-number values for scalar quantities (see the section of the reference manual section of the language manual for more information on Stan’s numerical data types). Both infinite and not-a-number values are supported by Stan’s dump-format readers.
Value | Preferred Form | Alternative Forms |
---|---|---|
positive infinity | Inf | Infinity, infinity |
negative infinity | -Inf | -Infinity, -infinity |
not a number | NaN |
These strings are not case sensitive, so inf
may also be used for positive infinity, or NAN
for not-a-number.
Quoted variable names
In order to support JAGS data files, variables may be double quoted. For instance, the following definition is legal in a dump file.
"y" <- c(1,2,3) \end{Verbatim}
Line breaks
The line breaks in a dump file are required to be consistent with the way R reads in data. Both of the following declarations are legal.
y <- 2
y <-
3
Also following R, breaking before the assignment arrow are not allowed, so the following is invalid.
y
<- 2 # Syntax Error
Lines may also be broken in the middle of sequences declared using the c(...)
notation., as well as between the comma following a sequence definition and the dimensionality declaration. For example, the following declaration of a \(2 \times 2 \times 3\) array is valid.
y <-
structure(c(1,2,3,
4,5,6,7,8,9,10,11,
12), .Dim = c(2,2,
3))
Because there are no decimal points in the values, the resulting dump file may be used for three-dimensional array variables declared as int
or real
.
BNF grammar for dump data
A more precise definition of the dump data format is provided by the following (mildly templated) Backus-Naur form grammar.
definition ::= name <- value optional_semicolon
name ::= char* | ''' char* ''' | '"' char* '"'
value ::= value<int> | value<double>
value<T> ::= T | seq<T> | zero_array<T> |
'structure' '(' seq<T> ',' ".Dim" '=' seq<int> ')' | 'structure'
'(' zero_array<T> ',' ".Dim" '=' seq<int> ')'
seq<int> ::= int ':' int | cseq<int>
zero_array<int> ::= "integer" '(' <non-negative int>? ')'
zero_array<real> ::= "double" '(' <non-negative int>? ')'
seq<real> ::= cseq<real>
cseq<T> ::= 'c' '(' vseq<T> ')'
vseq<T> ::= T | T ',' vseq<T>
The template parameters T
will be set to either int
or real
. Because Stan allows promotion of integer values to real values, an integer sequence specification in the dump data format may be assigned to either an integer- or real-based variable in Stan.