22.8 Vectorization
Gradient Bottleneck
Stan spends the vast majority of its time computing the gradient of the log probability function, making gradients the obvious target for optimization. Stan’s gradient calculations with algorithmic differentiation require a template expression to be allocated and constructed for each subexpression of a Stan program involving parameters or transformed parameters.41 This section defines optimization strategies based on vectorizing these subexpressions to reduce the work done during algorithmic differentiation.
Vectorizing Summations
Because of the gradient bottleneck described in the previous section,
it is more efficient to collect a sequence of summands into a vector
or array and then apply the sum()
operation than it is to
continually increment a variable by assignment and addition. For
example, consider the following code snippet, where foo()
is
some operation that depends on n
.
for (n in 1:N)
total += foo(n,...);
This code has to create intermediate representations for each
of the N
summands.
A faster alternative is to copy the values into a vector, then
apply the sum()
operator, as in the following refactoring.
{
vector[N] summands;
for (n in 1:N)
summands[n] = foo(n,...);
total = sum(summands);
}
Syntactically, the replacement is a statement block delineated
by curly brackets ({
, }
), starting with the definition
of the local variable summands
.
Even though it involves extra work to allocate the summands
vector and copy N
values into it, the savings in
differentiation more than make up for it. Perhaps surprisingly,
it will also use substantially less memory overall than incrementing
total
within the loop.
Vectorization through Matrix Operations
The following program directly encodes a linear regression with fixed
unit noise using a two-dimensional array x
of predictors, an
array y
of outcomes, and an array beta
of regression
coefficients.
data {
int<lower=1> K;
int<lower=1> N;
real x[K, N];
real y[N];
}
parameters {
real beta[K];
}
model {
for (n in 1:N) {
real gamma = 0;
for (k in 1:K)
gamma += x[n, k] * beta[k];
y[n] ~ normal(gamma, 1);
}
}
The following model computes the same log probability function as the previous model, even supporting the same input files for data and initialization.
data {
int<lower=1> K;
int<lower=1> N;
vector[K] x[N];
real y[N];
}
parameters {
vector[K] beta;
}
model {
for (n in 1:N)
y[n] ~ normal(dot_product(x[n], beta), 1);
}
Although it produces equivalent results, the dot product should not be replaced with a transpose and multiply, as in
y[n] ~ normal(x[n]' * beta, 1);
The relative inefficiency of the transpose and multiply approach is that the transposition operator allocates a new vector into which the result of the transposition is copied. This consumes both time and memory.42
The inefficiency of transposition could itself be mitigated by reordering the product and pulling the transposition out of the loop, as follows.
...
transformed parameters {
row_vector[K] beta_t;
beta_t = beta';
}
model {
for (n in 1:N)
y[n] ~ normal(beta_t * x[n], 1);
}
The problem with transposition could be completely solved by directly
encoding the x
as a row vector, as in the
following example.
data {
...
row_vector[K] x[N];
...
}
parameters {
vector[K] beta;
}
model {
for (n in 1:N)
y[n] ~ normal(x[n] * beta, 1);
}
Declaring the data as a matrix and then computing all the predictors at once using matrix multiplication is more efficient still, as in the example discussed in the next section.
Having said all this, the most efficient way to code this model is with direct matrix multiplication, as in
data {
matrix[N, K] x;
vector[N] y;
}
parameters {
vector[K] beta;
}
model {
y ~ normal(x * beta, 1);
In general, encapsulated single operations that do the work of loops will be more efficient in their encapsulated forms. Rather than performing a sequence of row-vector/vector multiplications, it is better to encapsulate it as a single matrix/vector multiplication.
Vectorized Probability Functions
The final and most efficient version replaces the loops and transformed parameters by using the vectorized form of the normal probability function, as in the following example.
data {
int<lower=1> K;
int<lower=1> N;
matrix[N, K] x;
vector[N] y;
}
parameters {
vector[K] beta;
}
model {
y ~ normal(x * beta, 1);
}
The variables are all declared as either matrix or vector types.
The result of the matrix-vector multiplication x * beta
in the
model block is a vector of the same length as y
.
The probability function documentation in the function reference
manual indicates which of Stan’s probability functions support
vectorization; see the function reference manual for full details.
Vectorized probability functions accept either vector or scalar inputs
for all arguments, with the only restriction being that all vector
arguments are the same dimensionality. In the example above, y
is a
vector of size N
, x * beta
is a vector of size N
, and 1
is a
scalar.
Reshaping Data for Vectorization
Sometimes data does not arrive in a shape that is ideal for vectorization, but can be put into such shape with some munging (either inside Stan’s transformed data block or outside).
John Hall provided a simple example on the Stan users group. Simplifying notation a bit, the original model had a sampling statement in a loop, as follows.
for (n in 1:N)
y[n] ~ normal(mu[ii[n]], sigma);
The brute force vectorization would build up a mean vector and then vectorize all at once.
{
vector[N] mu_ii;
for (n in 1:N)
mu_ii[n] = mu[ii[n]];
y ~ normal(mu_ii, sigma);
If there aren’t many levels (values ii[n]
can take), then it
behooves us to reorganize the data by group in a case like this.
Rather than having a single observation vector y
, there are K of them.
And because Stan doesn’t support ragged arrays, it means K
declarations. For instance, with 5 levels, we have
y_1 ~ normal(mu[1], sigma);
...
y_5 ~ normal(mu[5], sigma);
This way, both the mu
and sigma
parameters are shared.
Which way works out to be more efficient will depend on the shape of
the data; if the sizes are small, the simple vectorization may be
faster, but for moderate to large sized groups, the full expansion
should be faster.