Stan Program Style Guide

This chapter describes the preferred style for laying out Stan models. These are not rules of the language, but simply recommendations for laying out programs in a text editor. Although these recommendations may seem arbitrary, they are similar to those of many teams for many programming languages. Like rules for typesetting text, the goal is to achieve readability without wasting white space either vertically or horizontally. This is the style used in the Stan documentation, and should align with the auto-formatting ability of stanc3.

Choose a consistent style

The most important point of style is consistency. Consistent coding style makes it easier to read not only a single program, but multiple programs. So when departing from this style guide, the number one recommendation is to do so consistently.

Line length

Line lengths should not exceed 80 characters.¹

This is a typical recommendation for many programming language style guides because it makes it easier to lay out text edit windows side by side and to view the code on the web without wrapping, easier to view diffs from version control, etc. About the only thing that is sacrificed is laying out expressions on a single line.

File extensions

The recommended file extension for Stan model files is .stan. Files which contain only function definitions (intended for use with #include) should be given the .stanfunctions extension. A .stanfunctions file only includes the function definition and does not require the functions{} block wrapped around the function. A simple example of usage where the function is defined and saved in the file foo.stanfunctions:

real foo(real x, real y) {
  return sqrt(x * log(y));
}

The function foo can be accessed in the Stan program by including the path to the foo.stanfunctions file as:

functions {
  #include foo.stanfunctions;
}
// ...body...

For Stan data dump files, the recommended extension is .R, or more informatively, .data.R. For JSON output, the recommended extension is .json.

Variable naming

The recommended variable naming is to follow C/C++ naming conventions, in which variables are lowercase, with the underscore character (_) used as a separator. Thus it is preferred to use sigma_y, rather than the run together sigmay, camel-case sigmaY, or capitalized camel-case SigmaY. An exception is often made for terms appearing in mathematical expressions with standard names, like A for a matrix.

Another exception to the lowercasing recommendation, which follows the C/C++ conventions, is for size constants, for which the recommended form is a single uppercase letter. The reason for this is that it allows the loop variables to match. So loops over the indices of an $M \times N$ matrix $a$ would look as follows.

for (m in 1:M) {
  for (n in 1:N) {
     a[m, n] = ...
  }
}

Local variable scope

Declaring local variables in the block in which they are used aids in understanding programs because it cuts down on the amount of text scanning or memory required to reunite the declaration and definition.

The following Stan program corresponds to a direct translation of a BUGS model, which uses a different element of mu in each iteration.

model {
  array[N] real mu;
  for (n in 1:N) {
    mu[n] = alpha * x[n] + beta;
    y[n] ~ normal(mu[n],sigma);
  }
}

Because variables can be reused in Stan and because they should be declared locally for clarity, this model should be recoded as follows.

model {
  for (n in 1:N) {
    real mu;
    mu = alpha * x[n] + beta;
    y[n] ~ normal(mu,sigma);
  }
}

The local variable can be eliminated altogether, as follows.

model {
  for (n in 1:N) {
    y[n] ~ normal(alpha * x[n] + beta, sigma);
  }
}

There is unlikely to be any measurable efficiency difference between the last two implementations, but both should be a bit more efficient than the BUGS translation.

Scope of compound structures with componentwise assignment

In the case of local variables for compound structures, such as arrays, vectors, or matrices, if they are built up component by component rather than in large chunks, it can be more efficient to declare a local variable for the structure outside of the block in which it is used. This allows it to be allocated once and then reused.

model {
  vector[K] mu;
  for (n in 1:N) {
    for (k in 1:K) {
      mu[k] = // ...
    }
    y[n] ~ multi_normal(mu,Sigma);
}

In this case, the vector mu will be allocated outside of both loops, and used a total of N times.

Parentheses and brackets

Braces for single-statement blocks

Single-statement blocks can be rendered in several ways. The preferred style is fully bracketed with the statement appearing on its own line, as follows.

for (n in 1:N) {
  y[n] ~ normal(mu,1);
}

The use of loops and conditionals without brackets can be dangerous. For instance, consider this program.

for (n in 1:N)
  z[n] ~ normal(nu,1);
  y[n] ~ normal(mu,1);

Because Stan ignores whitespace and the parser completes a statement as eagerly as possible (just as in C++), the previous program is equivalent to the following program.

for (n in 1:N) {
  z[n] ~ normal(nu,1);
}
y[n] ~ normal(mu,1);

Therefore, one should prefer to use braces. The only exception is when nesting if-else clauses, where the else branch contains exactly one conditional. Then, it is preferred to place the following if on the same line, as in the following.

if (x) {
  // ...
} else if (y) {
  // ...
} else {
  // ...
}

Parentheses in nested operator expressions

The preferred style for operators minimizes parentheses. This reduces clutter in code that can actually make it harder to read expressions. For example, the expression a + b * c is preferred to the equivalent a + (b * c) or (a + (b * c)). The operator precedences and associativities follow those of pretty much every programming language including Fortran, C++, R, and Python; full details are provided in the reference manual.

Similarly, comparison operators can usually be written with minimal bracketing, with the form y[n] > 0 || x[n] != 0 preferred to the bracketed form (y[n] > 0) || (x[n] != 0).

No open brackets on own line

Vertical space is valuable as it controls how much of a program you can see. The preferred Stan style is with the opening brace appearing at the end of a line.

for (n in 1:N) {
  y[n] ~ normal(mu,1);
}

This also goes for parameters blocks, transformed data blocks, which should look as follows.

transformed parameters {
  real sigma;
  // ...
}

The exception to this rule is local blocks which only exist for scoping reasons. The opening brace of these blocks is not associated with any control flow or block structure, so it should appear on its own line.

Conditionals

While Stan supports the full C++-style conditional syntax, allowing real or integer values to act as conditions, real values should be avoided. For a real-valued x, one should use

if (x != 0) { ...

in place of

if (x) { ...

Beyond stylistic choices, one should be careful using real values in a conditional expression, as direct comparison can have unexpected results due to numerical accuracy.

Functions

Functions are laid out the same way as in languages such as Java and C++. For example,

real foo(real x, real y) {
  return sqrt(x * log(y));
}

The return type is flush left, the parentheses for the arguments are adjacent to the arguments and function name, and there is a space after the comma for arguments after the first. The open curly brace for the body is on the same line as the function name, following the layout of loops and conditionals. The body itself is indented; here we use two spaces. The close curly brace appears on its own line.

If function names or argument lists are long, they can be written as

matrix
function_to_do_some_hairy_algebra(matrix thingamabob,
                                  vector doohickey2) {
  // ...body...
}

The function starts a new line, under the type. The arguments are aligned under each other.

Function documentation should follow the Javadoc and Doxygen styles. Here’s an example repeated from the documenting functions section.

/**
 * Return a data matrix of specified size with rows
 * corresponding to items and the first column filled
 * with the value 1 to represent the intercept and the
 * remaining columns randomly filled with unit-normal draws.
 *
 * @param N Number of rows correspond to data items
 * @param K Number of predictors, counting the intercept, per
 *          item.
 * @return Simulated predictor matrix.
 */
matrix predictors_rng(int N, int K) {
  // ...
}

The open comment is /**, asterisks are aligned below the first asterisk of the open comment, and the end comment */ is also aligned on the asterisk. The tags @param and @return are used to label function arguments (i.e., parameters) and return values.

White space

Stan allows spaces between elements of a program. The white space characters allowed in Stan programs include the space (ASCII 0x20), line feed (ASCII 0x0A), carriage return (0x0D), and tab (0x09). Stan treats all whitespace characters interchangeably, with any sequence of whitespace characters being syntactically equivalent to a single space character. Nevertheless, effective use of whitespace is the key to good program layout.

Line breaks between statements and declarations

Each statement of a program should appear on its own line. Declaring multiple variables of the same type can be accomplished in a single statement with the syntax

real mu, sigma;

No tabs

Stan programs should not contain tab characters. Using tabs to layout a program is highly unportable because the number of spaces represented by a single tab character varies depending on which program is doing the rendering and how it is configured.

Two-character indents

Stan has standardized on two space characters of indentation, which is the standard convention for C/C++ code.

Space between `if`, `{` and condition

Use a space after ifs. For instance, use if (x < y) {..., not if(x < y){ ....

No space for function calls

There should not be space between a function name and the arguments it applies to. For instance, use normal(0, 1), not normal (0,1).

Spaces around operators

There should be spaces around binary operators. For instance, use y[1] = x, not y[1]=x, use (x + y) * z not (x+y)*z.

Unary operators are written without a space, such as in -x, !y.

No spaces in type constraints

Another exception to the above rule is when the assignment operator (=) is used inside a type constraint, such as

real<lower=0> x;

Spaces should still be used in arithmetic and following commas, as in

real<lower=0, upper=a * x + b> x;

Breaking expressions across lines

Sometimes expressions are too long to fit on a single line. In that case, the recommended form is to break before an operator,² aligning the operator to a term above to indicate scoping. For example, use the following form

vector[J] p_distance = Phi((distance_tolerance - overshot)
                           ./ ((x + overshot) * sigma_distance))
                       - Phi((-overshot)
                             ./ ((x + overshot) * sigma_distance));

Here, the elementwise division operator (./) is aligned to clearly signal the division is occurring inside the parethenesis, while the subtraction indicates it is between the function applications (Phi).

For functions with multiple arguments, break after a comma and line the next argument up underneath as follows.

y[n] ~ normal(alpha + beta * x + gamma * y,
              pow(tau,-0.5));

Spaces after commas

Commas should always be followed by spaces, including in function arguments, sequence literals, between variable declarations, etc.

For example,

normal(alpha * x[n] + beta, sigma);

is preferred over

normal(alpha * x[n] + beta,sigma);

Unix newlines

Wherever possible, Stan programs should use a single line feed character to separate lines. All of the Stan developers (so far, at least) work on Unix-like operating systems and using a standard newline makes the programs easier for us to read and share.

Platform specificity of newlines

Newlines are signaled in Unix-like operating systems such as Linux and Mac OS X with a single line-feed (LF) character (ASCII code point 0x0A). Newlines are signaled in Windows using two characters, a carriage return (CR) character (ASCII code point 0x0D) followed by a line-feed (LF) character.

Footnotes

Even 80 characters may be too many for rendering in print; for instance, in this manual, the number of code characters that fit on a line is about 65.↩︎
This is the usual convention in both typesetting and other programming languages. Neither R nor BUGS allows breaks before an operator because they allow newlines to signal the end of an expression or statement.↩︎