Language Syntax
This chapter defines the basic syntax of the Stan modeling language using a Backus-Naur form (BNF) grammar plus extra-grammatical constraints on function typing and operator precedence and associativity.
BNF grammars
Syntactic conventions
In the following BNF grammars, tokens are represented in ALLCAPS. Grammar non-terminals are surrounded by <
and >
. A square brackets ([A]
) indicates optionality of A
. A postfixed Kleene star (A*
) indicates zero or more occurrences of A
. Parenthesis can be used to group symbols together in productions.
Finally, this grammar uses the concept of “parameterized nonterminals” as used in the parsing library Menhir. A rule like <list(x)> ::= x (COMMA x)*
declares a generic list rule, which can later be applied to others by the symbol <list(<expression>)>
.
The following representation is constructed directly from the OCaml reference parser using a tool called Obelisk. The raw output is available here.
Programs
<program> ::= [<function_block>] [<data_block>] [<transformed_data_block>]
[<parameters_block>] [<transformed_parameters_block>]
[<model_block>] [<generated_quantities_block>] EOF
<functions_only> ::= <function_def>* EOF
<function_block> ::= FUNCTIONBLOCK LBRACE <function_def>* RBRACE
<data_block> ::= DATABLOCK LBRACE <top_var_decl_no_assign>* RBRACE
<transformed_data_block> ::= TRANSFORMEDDATABLOCK LBRACE
<top_vardecl_or_statement>* RBRACE
<parameters_block> ::= PARAMETERSBLOCK LBRACE <top_var_decl_no_assign>*
RBRACE
<transformed_parameters_block> ::= TRANSFORMEDPARAMETERSBLOCK LBRACE
<top_vardecl_or_statement>* RBRACE
<model_block> ::= MODELBLOCK LBRACE <vardecl_or_statement>* RBRACE
<generated_quantities_block> ::= GENERATEDQUANTITIESBLOCK LBRACE
<top_vardecl_or_statement>* RBRACE
Function declarations and definitions
<function_def> ::= <return_type> <decl_identifier> LPAREN [<arg_decl> (COMMA
<arg_decl>)*] RPAREN <statement>
<return_type> ::= VOID
| <unsized_type>
<arg_decl> ::= [DATABLOCK] <unsized_type> <decl_identifier>
<unsized_type> ::= ARRAY <unsized_dims> <basic_type>
| ARRAY <unsized_dims> <unsized_tuple_type>
| <basic_type>
| <unsized_tuple_type>
<unsized_tuple_type> ::= TUPLE LPAREN <unsized_type> COMMA <unsized_type>
(COMMA <unsized_type>)* RPAREN
<basic_type> ::= INT
| REAL
| COMPLEX
| VECTOR
| ROWVECTOR
| MATRIX
| COMPLEXVECTOR
| COMPLEXROWVECTOR
| COMPLEXMATRIX
<unsized_dims> ::= LBRACK COMMA* RBRACK
Variable declarations and compound definitions
<identifier> ::= IDENTIFIER
| TRUNCATE
<decl_identifier> ::= <identifier>
<no_assign> ::= UNREACHABLE
<optional_assignment(rhs)> ::= [ASSIGN rhs]
<id_and_optional_assignment(rhs)> ::= <decl_identifier>
<optional_assignment(rhs)>
<decl(type_rule, rhs)> ::= type_rule <decl_identifier> <dims>
<optional_assignment(rhs)> SEMICOLON
| <higher_type(type_rule)>
<id_and_optional_assignment(rhs)> (COMMA
<id_and_optional_assignment(rhs)>)* SEMICOLON
<higher_type(type_rule)> ::= <array_type(type_rule)>
| <tuple_type(type_rule)>
| type_rule
<array_type(type_rule)> ::= <arr_dims> type_rule
| <arr_dims> <tuple_type(type_rule)>
<tuple_type(type_rule)> ::= TUPLE LPAREN <higher_type(type_rule)> COMMA
<higher_type(type_rule)> (COMMA
<higher_type(type_rule)>)* RPAREN
<var_decl> ::= <decl(<sized_basic_type>, <expression>)>
<top_var_decl> ::= <decl(<top_var_type>, <expression>)>
<top_var_decl_no_assign> ::= <decl(<top_var_type>, <no_assign>)>
| SEMICOLON
<sized_basic_type> ::= INT
| REAL
| COMPLEX
| VECTOR LBRACK <expression> RBRACK
| ROWVECTOR LBRACK <expression> RBRACK
| MATRIX LBRACK <expression> COMMA <expression> RBRACK
| COMPLEXVECTOR LBRACK <expression> RBRACK
| COMPLEXROWVECTOR LBRACK <expression> RBRACK
| COMPLEXMATRIX LBRACK <expression> COMMA <expression>
RBRACK
<top_var_type> ::= INT [LABRACK <range> RABRACK]
| REAL <type_constraint> | TUPLE
| COMPLEX <type_constraint>
| VECTOR <type_constraint> LBRACK <expression> RBRACK
| ROWVECTOR <type_constraint> LBRACK <expression> RBRACK
| MATRIX <type_constraint> LBRACK <expression> COMMA
<expression> RBRACK
| COMPLEXVECTOR <type_constraint> LBRACK <expression> RBRACK
| COMPLEXROWVECTOR <type_constraint> LBRACK <expression>
RBRACK
| COMPLEXMATRIX <type_constraint> LBRACK <expression> COMMA
<expression> RBRACK
| ORDERED LBRACK <expression> RBRACK
| POSITIVEORDERED LBRACK <expression> RBRACK
| SIMPLEX LBRACK <expression> RBRACK
| UNITVECTOR LBRACK <expression> RBRACK
| CHOLESKYFACTORCORR LBRACK <expression> RBRACK
| CHOLESKYFACTORCOV LBRACK <expression> [COMMA <expression>]
RBRACK
| CORRMATRIX LBRACK <expression> RBRACK
| COVMATRIX LBRACK <expression> RBRACK
<type_constraint> ::= [LABRACK <range> RABRACK]
| LABRACK <offset_mult> RABRACK
<range> ::= LOWER ASSIGN <constr_expression> COMMA UPPER ASSIGN
<constr_expression>
| UPPER ASSIGN <constr_expression> COMMA LOWER ASSIGN
<constr_expression>
| LOWER ASSIGN <constr_expression>
| UPPER ASSIGN <constr_expression>
<offset_mult> ::= OFFSET ASSIGN <constr_expression> COMMA MULTIPLIER ASSIGN
<constr_expression>
| MULTIPLIER ASSIGN <constr_expression> COMMA OFFSET ASSIGN
<constr_expression>
| OFFSET ASSIGN <constr_expression>
| MULTIPLIER ASSIGN <constr_expression>
<arr_dims> ::= ARRAY LBRACK <expression> (COMMA <expression>)* RBRACK
Expressions
<expression> ::= <expression> QMARK <expression> COLON <expression>
| <expression> <infixOp> <expression>
| <prefixOp> <expression>
| <expression> <postfixOp>
| <common_expression>
<constr_expression> ::= <constr_expression> <arithmeticBinOp>
<constr_expression>
| <prefixOp> <constr_expression>
| <constr_expression> <postfixOp>
| <common_expression>
<common_expression> ::= <identifier>
| INTNUMERAL
| REALNUMERAL
| DOTNUMERAL
| IMAGNUMERAL
| LBRACE <expression> (COMMA <expression>)* RBRACE
| LBRACK [<expression> (COMMA <expression>)*] RBRACK
| <identifier> LPAREN [<expression> (COMMA
<expression>)*] RPAREN
| TARGET LPAREN RPAREN
| <identifier> LPAREN <expression> BAR [<expression>
(COMMA <expression>)*] RPAREN
| LPAREN <expression> COMMA <expression> (COMMA
<expression>)* RPAREN
| <common_expression> DOTNUMERAL
| <common_expression> LBRACK <indexes> RBRACK
| LPAREN <expression> RPAREN
<prefixOp> ::= BANG
| MINUS
| PLUS
<postfixOp> ::= TRANSPOSE
<infixOp> ::= <arithmeticBinOp>
| <logicalBinOp>
<arithmeticBinOp> ::= PLUS
| MINUS
| TIMES
| DIVIDE
| IDIVIDE
| MODULO
| LDIVIDE
| ELTTIMES
| ELTDIVIDE
| HAT
| ELTPOW
<logicalBinOp> ::= OR
| AND
| EQUALS
| NEQUALS
| LABRACK
| LEQ
| RABRACK
| GEQ
<indexes> ::= epsilon
| COLON
| <expression>
| <expression> COLON
| COLON <expression>
| <expression> COLON <expression>
| <indexes> COMMA <indexes>
<printables> ::= <expression>
| <string_literal>
| <printables> COMMA <printables>
Statements
<statement> ::= <atomic_statement>
| <nested_statement>
<atomic_statement> ::= <common_expression> <assignment_op> <expression>
SEMICOLON
| <identifier> LPAREN [<expression> (COMMA
<expression>)*] RPAREN SEMICOLON
| <expression> TILDE <identifier> LPAREN [<expression>
(COMMA <expression>)*] RPAREN [<truncation>] SEMICOLON
| TARGET PLUSASSIGN <expression> SEMICOLON
| BREAK SEMICOLON
| CONTINUE SEMICOLON
| PRINT LPAREN <printables> RPAREN SEMICOLON
| REJECT LPAREN <printables> RPAREN SEMICOLON
| RETURN <expression> SEMICOLON
| RETURN SEMICOLON
| SEMICOLON
<assignment_op> ::= ASSIGN
| PLUSASSIGN
| MINUSASSIGN
| TIMESASSIGN
| DIVIDEASSIGN
| ELTTIMESASSIGN
| ELTDIVIDEASSIGN
<string_literal> ::= STRINGLITERAL
<truncation> ::= TRUNCATE LBRACK [<expression>] COMMA [<expression>] RBRACK
<nested_statement> ::= IF LPAREN <expression> RPAREN <vardecl_or_statement>
ELSE <vardecl_or_statement>
| IF LPAREN <expression> RPAREN <vardecl_or_statement>
| WHILE LPAREN <expression> RPAREN
<vardecl_or_statement>
| FOR LPAREN <identifier> IN <expression> COLON
<expression> RPAREN <vardecl_or_statement>
| FOR LPAREN <identifier> IN <expression> RPAREN
<vardecl_or_statement>
| PROFILE LPAREN <string_literal> RPAREN LBRACE
<vardecl_or_statement>* RBRACE
| LBRACE <vardecl_or_statement>* RBRACE
<vardecl_or_statement> ::= <statement>
| <var_decl>
<top_vardecl_or_statement> ::= <statement>
| <top_var_decl>
Tokenizing rules
Many of the tokens used in the BNF grammars follow obviously from their names: DATABLOCK
is the literal string ‘data’, COMMA
is a single ‘,’ character, etc. The literal representation of each operator is additionally provided in the operator precedence table.
A few tokens are not so obvious, and are defined here in regular expressions:
IDENTIFIER = [a-zA-Z] [a-zA-Z0-9_]*
STRINGLITERAL = ".*"
INTNUMERAL = [0-9]+ (_ [0-9]+)*
EXPLITERAL = [eE] [+-]? INTNUMERAL
REALNUMERAL = INTNUMERAL \. INTNUMERAL? EXPLITERAL?
| \. INTNUMERAL EXPLITERAL
| INTNUMERAL EXPLITERAL
IMAGNUMERAL = (REALNUMERAL | INTNUMERAL) i
DOTNUMERAL = \. INTNUMERAL
Extra-grammatical constraints
Type constraints
A well-formed Stan program must satisfy the type constraints imposed by functions and distributions. For example, the binomial distribution requires an integer total count parameter and integer variate and when truncated would require integer truncation points. If these constraints are violated, the program will be rejected during compilation with an error message indicating the location of the problem.
Operator precedence and associativity
In the Stan grammar provided in this chapter, the expression 1 + 2 * 3
has two parses. As described in the operator precedence table, Stan disambiguates between the meaning \(1
+ (2 \times 3)\) and the meaning \((1 + 2) \times 3\) based on operator precedences and associativities.
Typing of compound declaration and definition
In a compound variable declaration and definition, the type of the right-hand side expression must be assignable to the variable being declared. The assignability constraint restricts compound declarations and definitions to local variables and variables declared in the transformed data, transformed parameters, and generated quantities blocks.
Typing of array expressions
The types of expressions used for elements in array expressions ('{' expressions '}'
) must all be of the same type or a mixture of scalar (int
, real
and complex
) types (in which case the result is promoted to be of the highest type on the int -> real -> complex
hierarchy).
Forms of numbers
Integer literals longer than one digit may not start with 0 and real literals cannot consist of only a period or only an exponent.
Conditional arguments
Both the conditional if-then-else statement and while-loop statement require the expression denoting the condition to be a primitive type, integer or real.
For loop containers
The for loop statement requires that we specify in addition to the loop identifier, either a range consisting of two expressions denoting an integer, separated by ‘:’, or a single expression denoting a container. The loop variable will be of type integer in the former case and of the contained type in the latter case. Furthermore, the loop variable must not be in scope (i.e., there is no masking of variables).
Print arguments
The arguments to a print statement cannot be void.
Only break and continue in loops
The break
and continue
statements may only be used within the body of a for-loop or while-loop.
Block-specific restrictions
Some constructs in the Stan language are only allowed in certain blocks or in certain kinds of user-defined functions.
PRNG functions
Functions ending in _rng
may only be called in the transformed data
and generated quantities
block, and within the bodies of user-defined functions with names ending in _rng
.
Unnormalized distributions
Unnormalized distributions (with suffixes _lupmf
or _lupdf
) may only be called in the model
block, user-defined probability functions, or within the bodies of user defined functions which end in _lp
.
Incrementing and accessing target
target +=
statements can only be used inside of the model
block or user-defined functions which end in _lp
.
User defined functions which end in _lp
and the target()
function can only be used in the model
block, transformed parameters
block, and in the bodies of other user defined functions which end in _lp
.
Sampling statements (using ~
) can only be used in the model
block or in the bodies of user-defined functions which end in _lp
.
Probability function naming
A probability function literal must have one of the following suffixes: _lpdf
, _lpmf
, _lcdf
, or _lccdf
.
Indexes
Standalone expressions used as indexes must denote either an integer (int
) or an integer array (array[] int
). Expressions participating in range indexes (e.g., a
and b
in a : b
) must denote integers (int
).
A second condition is that there not be more indexes provided than dimensions of the underlying expression (in general) or variable (on the left side of assignments) being indexed. A vector or row vector adds 1 to the array dimension and a matrix adds 2. That is, the type array[ , , ] matrix
, a three-dimensional array of matrices, has five index positions: three for the array, one for the row of the matrix and one for the column.