34.5 Optimization

This is an old version, view current version.

The stanc3 compiler can optimize the code of Stan model during compilation. The optimized model code behaves the same as unoptimized code, but it may be faster, more memory efficient, or more numerically stable.

This section introduces the available optimization options and describes their effect.

To print out a representation of the optimized Stan program, use the stanc3 command-line flag --debug-optimized-mir-pretty. To print an analagous representation of the Stan program prior to optimization, use the flag --debug-transformed-mir-pretty.

34.5.1 Optimization levels

To turn optimizations on, the user specifies the desired optimization level. The level specifies the set of optimizations to use. The chosen optimizations are used in a specific order, with some of them applied repeatedly.

Optimization levels are specified by the numbers 0 and 1 and the ‘experimental’ tag:

O0 No optimizations are applied.
O1 Optimizations that are simple, do not dramatically change the program, and are unlikely to noticeably slow down compile times are applied.
Oexperimental All optimizations are applied. Some of these are not thorougly tested and may not always improve a programs performance.

O0 is the default setting.

The levels include these optimizations:

O0 includes no optimizations.
O1 includes:
Oexperimental includes optimizations specified by O1 and also:

In addition, Oexperimental will apply more repetitions of the optimizations, which may increase compile times.

34.5.2 O1 Optimizations

34.5.2.1 Dead code elimination

Dead code is code that does not affect the behavior of the program. Code is not dead if it affects target, the value of any outside-observable variable like transformed parameters or generated quantities, or side effects such as print statements. Removing dead code can speed up a program by avoiding unnecessary computations.

Example Stan program:

model {
  int i;
  i = 5;
  for (j in 1:10);
  if (0) {
    print("Dead code");
  } else {
    print("Hi!");
  }
}

Compiler representation of program before dead code elimination (simplified from the output of --debug-transformed-mir-pretty):

log_prob {
  int i = 5;
  for(j in 1:10) {
    ;
  }
  if(0) {
    FnPrint__("Dead code");
  } else {
    FnPrint__("Hi!");
  }
}

Compiler representation of program after dead code elimination (simplified from the output of --debug-optimized-mir-pretty):

log_prob {
  int i;
  FnPrint__("Hi!");
}

34.5.2.2 Constant propagation

Constant propagation replaces uses of a variable which is known to have a constant value C with that constant C. This removes the overhead of looking up the variable, and also makes many other optimizations possible (such as static loop unrolling and partial evaluation).

Example Stan program:

transformed data {
  int n = 100;
  int a[n];
  for (i in 1:n) {
    a[i] = i;
  }
}

Compiler representation of program before constant propagation (simplified from the output of --debug-transformed-mir-pretty):

prepare_data {
  data int n = 100;
  data array[int, n] a;
  for(i in 1:n) {
    a[i] = i;
  }
}

Compiler representation of program after constant propagation (simplified from the output of --debug-optimized-mir-pretty):

prepare_data {
  data int n = 100;
  data array[int, 100] a;
  for(i in 1:100) {
    a[i] = i;
  }
}

34.5.2.3 Copy propagation

Copy propagation is similar to expression propagation, but only propagates variables rather than arbitrary expressions. This can reduce the complexity of the code for other optimizations such as expression propagation.

Example Stan program:

model {
  int i = 1;
  int j = i;
  int k = i + j;
}

Compiler representation of program before copy propagation (simplified from the output of --debug-transformed-mir-pretty):

log_prob {
    int i = 1;
    int j = i;
    int k = (i + j);
}

Compiler representation of program after copy propagation (simplified from the output of --debug-optimized-mir-pretty):

log_prob {
  int i = 1;
  int j = i;
  int k = (i + i);
}

34.5.2.4 Automatic-differentiation level optimization

Stan variables can have two auto-differentiation (AD) levels: AD or non-AD. AD variables carry gradient information with them, which allows Stan to calculate the log-density gradient, but they also have more overhead than non-AD variables. It is therefore inefficient for a variable to be AD unnecessarily. AD-level optimization sets every variable to be a floating point type unless its gradient is necessary.

Example Stan program:

data {
  real y;
}
model {
  real x = y + 1;
}

Compiler representation of program before AD-level optimization (simplified from the output of --debug-transformed-mir-pretty):

input_vars {
  real y;
}

log_prob {
  real x = (y + 1);
}

Compiler representation of program after AD-level optimization (simplified from the output of --debug-optimized-mir-pretty):

input_vars {
  real y;
}

log_prob {
  data real x = (y + 1);
}

34.5.2.5 Partial evaluation

Partial evaluation searches for expressions that we can replace with a faster, simpler, more memory efficient, or more numerically stable expression with the same meaning.

Example Stan program:

model {
  real a = 1 + 1;
  real b = log(1 - a);
  real c = a + b * 5;
}

Compiler representation of program before partial evaluation (simplified from the output of --debug-transformed-mir-pretty):

log_prob {
  real a = (1 + 1);
  real b = log((1 - a));
  real c = (a + (b * 5));
}

Compiler representation of program after partial evaluation (simplified from the output of --debug-optimized-mir-pretty):

log_prob {
  real a = 2;
  real b = log1m(a);
  real c = fma(b, 5, a);
}

34.5.3 Oexperimental Optimizations

34.5.3.1 One step loop unrolling

One step loop unrolling is similar to static loop unrolling. However, this optimization only ‘unrolls’ the first loop iteration, and can therefore work even when the total number of iterations is not predictable. This can speed up a program by providing more opportunities for further optimizations such as partial evaluation and lazy code motion.

Example Stan program:

data {
  int n;
}
transformed data {
  int x = 0;
  for (i in 1:n) {
    x += i;
  }
}

Compiler representation of program before one step static loop unrolling (simplified from the output of --debug-transformed-mir-pretty):

prepare_data {
  data int n = FnReadData__("n")[1];
  data int x = 0;
  for(i in 1:n) {
    x = (x + i);
  }
}

Compiler representation of program after one step static loop unrolling (simplified from the output of --debug-optimized-mir-pretty):

prepare_data {
  data int n = FnReadData__("n")[1];
  int x = 0;
  if((n >= 1)) {
    x = (x + 1);
    for(i in (1 + 1):n) {
      x = (x + i);
    }
  }
}

34.5.3.2 Expression propagation

Constant propagation replaces the uses of a variable which is known to have a constant value E with that constant E. This often results in recalculating the expression, but provides more opportunities for further optimizations such as partial evaluation. Expression propagation is always followed by lazy code motion to avoid unnecessarily recomputing expressions.

Example Stan program:

data {
  int m;
}
transformed data {
  int n = m+1;
  int a[n];
  for (i in 1:n-1) {
    a[i] = i;
  }
}

Compiler representation of program before expression propagation (simplified from the output of --debug-transformed-mir-pretty):

prepare_data {
  data int m = FnReadData__("m")[1];
  data int n = (m + 1);
  data array[int, n] a;
  for(i in 1:(n - 1)) {
    a[i] = i;
  }
}

Compiler representation of program after expression propagation (simplified from the output of --debug-optimized-mir-pretty):

prepare_data {
  data int m = FnReadData__("m")[1];
  data int n = (m + 1);
  data array[int, (m + 1)] a;
  for(i in 1:((m + 1) - 1)) {
    a[i] = i;
  }
}

34.5.3.3 Lazy code motion

Lazy code motion rearranges the statements and expressions in a program with the goals of:

Avoiding computing expressions more than once, and
Computing expressions as late as possible (to minimize the strain on the working memory set).

To accomplish these goals, lazy code motion will perform optimizations such as:

Moving a repeatedly calculated expression to its own variable (also referred to as common-subexpression elimination)
Moving an expression outside of a loop if it does not need to be in the loop (also referred to as loop-invariant code motion)

Lazy code motion can make some programs significantly more efficient by avoiding redundant or early computations.

Example Stan program:

model {
  real x;
  real y;
  real z;

  for (i in 1:10) {
    x = sqrt(10);
    y = sqrt(i);
  }
  z = sqrt(10);
}

Compiler representation of program before lazy code motion (simplified from the output of --debug-transformed-mir-pretty):

log_prob {
  real x;
  real y;
  real z;
  for(i in 1:10) {
    x = sqrt(10);
    y = sqrt(i);
  }
  z = sqrt(10);
}

Compiler representation of program after lazy code motion (simplified from the output of --debug-optimized-mir-pretty):

log_prob {
  data real lcm_sym4__;
  data real lcm_sym3__;
  real x;
  real y;
  lcm_sym4__ = sqrt(10);
  real z;
  for(i in 1:10) {
    x = lcm_sym4__;
    y = sqrt(i);
  }
  z = lcm_sym4__;
}

34.5.3.4 Function inlining

Function inlining replaces each function call to each user-defined function f with the body of f. It does this by copying the function body to the call site and doing appropriately renaming the argument variables. This optimization can speed up a program by avoiding the overhead of a function call and providing more opportunities for further optimizations (such as partial evaluation).

Example Stan program:

functions {
  int incr(int x) {
    int y = 1;
    return x + y;
  }
}
transformed data {
  int a = 2;
  int b = incr(a);
}

Compiler representation of program before function inlining (simplified from the output of --debug-transformed-mir-pretty):

functions {
  int incr(int x) {
    int y = 1;
    return (x + y);
  }
}

prepare_data {
  data int a = 2;
  data int b = incr(a);
}

Compiler representation of program after function inlining (simplified from the output of --debug-optimized-mir-pretty):

prepare_data {
  data int a;
  a = 2;
  data int b;
  data int inline_sym1__;
  data int inline_sym3__;
  inline_sym3__ = 0;
  for(inline_sym4__ in 1:1) {
    int inline_sym2__;
    inline_sym2__ = 1;
    inline_sym3__ = 1;
    inline_sym1__ = (a + inline_sym2__);
    break;
  }
  b = inline_sym1__;
}

In this code, the for loop and break is used to simulate the behavior of a return statement. The value to be returned is held in inline_sym1__. The flag variable inline_sym3__ indicates whether a return has occurred and is necessary to handle return statements nested inside loops within the function body.

34.5.3.5 Static loop unrolling

Static loop unrolling takes a loop with a predictable number of iterations X and replaces it by writing out the loop body X times. The loop index in each repeat is replaced with the appropriate constant. This optimization can speed up a program by avoiding the overhead of a loop and providing more opportunities for further optimizations (such as partial evaluation).

Example Stan program:

transformed data {
  int x = 0;
  for (i in 1:4) {
    x += i;
  }
}

Compiler representation of program before static loop unrolling (simplified from the output of --debug-transformed-mir-pretty):

prepare_data {
  data int x = 0;
  for(i in 1:4) {
    x = (x + i);
  }
}

Compiler representation of program after static loop unrolling (simplified from the output of --debug-optimized-mir-pretty):

prepare_data {
  data int x;
  x = 0;
  x = (x + 1);
  x = (x + 2);
  x = (x + 3);
  x = (x + 4);
}