## 22.8 Vectorization

### Gradient Bottleneck

Stan spends the vast majority of its time computing the gradient of
the log probability function, making gradients the obvious target for
optimization. Stan’s gradient calculations with algorithmic
differentiation require a template expression to be allocated
and constructed for each subexpression of a Stan program involving
parameters or transformed parameters.^{41} This section defines optimization strategies based on vectorizing these subexpressions to reduce the work done during algorithmic differentiation.

### Vectorizing Summations

Because of the gradient bottleneck described in the previous section,
it is more efficient to collect a sequence of summands into a vector
or array and then apply the `sum()`

operation than it is to
continually increment a variable by assignment and addition. For
example, consider the following code snippet, where `foo()`

is
some operation that depends on `n`

.

```
for (n in 1:N)
total += foo(n,...);
```

This code has to create intermediate representations for each
of the `N`

summands.

A faster alternative is to copy the values into a vector, then
apply the `sum()`

operator, as in the following refactoring.

```
{
vector[N] summands;
for (n in 1:N)
summands[n] = foo(n,...);
total = sum(summands);
}
```

Syntactically, the replacement is a statement block delineated
by curly brackets (`{`

, `}`

), starting with the definition
of the local variable `summands`

.

Even though it involves extra work to allocate the `summands`

vector and copy `N`

values into it, the savings in
differentiation more than make up for it. Perhaps surprisingly,
it will also use substantially less memory overall than incrementing
`total`

within the loop.

### Vectorization through Matrix Operations

The following program directly encodes a linear regression with fixed
unit noise using a two-dimensional array `x`

of predictors, an
array `y`

of outcomes, and an array `beta`

of regression
coefficients.

```
data {
int<lower=1> K;
int<lower=1> N;
real x[K, N];
real y[N];
}
parameters {
real beta[K];
}
model {
for (n in 1:N) {
real gamma = 0;
for (k in 1:K)
gamma += x[n, k] * beta[k];
y[n] ~ normal(gamma, 1);
}
}
```

The following model computes the same log probability function as the previous model, even supporting the same input files for data and initialization.

```
data {
int<lower=1> K;
int<lower=1> N;
vector[K] x[N];
real y[N];
}
parameters {
vector[K] beta;
}
model {
for (n in 1:N)
y[n] ~ normal(dot_product(x[n], beta), 1);
}
```

Although it produces equivalent results, the dot product should not be replaced with a transpose and multiply, as in

` y[n] ~ normal(x[n]' * beta, 1);`

The relative inefficiency of the transpose and multiply approach is
that the transposition operator allocates a new vector into which the
result of the transposition is copied. This consumes both time
and memory.^{42}

The inefficiency of transposition could itself be mitigated by reordering the product and pulling the transposition out of the loop, as follows.

```
...
transformed parameters {
row_vector[K] beta_t;
beta_t = beta';
}
model {
for (n in 1:N)
y[n] ~ normal(beta_t * x[n], 1);
}
```

The problem with transposition could be completely solved by directly
encoding the `x`

as a row vector, as in the
following example.

```
data {
...
row_vector[K] x[N];
...
}
parameters {
vector[K] beta;
}
model {
for (n in 1:N)
y[n] ~ normal(x[n] * beta, 1);
}
```

Declaring the data as a matrix and then computing all the predictors at once using matrix multiplication is more efficient still, as in the example discussed in the next section.

Having said all this, the most efficient way to code this model is with direct matrix multiplication, as in

```
data {
matrix[N, K] x;
vector[N] y;
}
parameters {
vector[K] beta;
}
model {
y ~ normal(x * beta, 1);
```

In general, encapsulated single operations that do the work of loops will be more efficient in their encapsulated forms. Rather than performing a sequence of row-vector/vector multiplications, it is better to encapsulate it as a single matrix/vector multiplication.

### Vectorized Probability Functions

The final and most efficient version replaces the loops and transformed parameters by using the vectorized form of the normal probability function, as in the following example.

```
data {
int<lower=1> K;
int<lower=1> N;
matrix[N, K] x;
vector[N] y;
}
parameters {
vector[K] beta;
}
model {
y ~ normal(x * beta, 1);
}
```

The variables are all declared as either matrix or vector types.
The result of the matrix-vector multiplication `x * beta`

in the
model block is a vector of the same length as `y`

.

The probability function documentation in the function reference
manual indicates which of Stan’s probability functions support
vectorization; see the function reference manual for full details.
Vectorized probability functions accept either vector or scalar inputs
for all arguments, with the only restriction being that all vector
arguments are the same dimensionality. In the example above, `y`

is a
vector of size `N`

, `x * beta`

is a vector of size `N`

, and `1`

is a
scalar.

### Reshaping Data for Vectorization

Sometimes data does not arrive in a shape that is ideal for vectorization, but can be put into such shape with some munging (either inside Stan’s transformed data block or outside).

John Hall provided a simple example on the Stan users group. Simplifying notation a bit, the original model had a sampling statement in a loop, as follows.

```
for (n in 1:N)
y[n] ~ normal(mu[ii[n]], sigma);
```

The brute force vectorization would build up a mean vector and then vectorize all at once.

```
{
vector[N] mu_ii;
for (n in 1:N)
mu_ii[n] = mu[ii[n]];
y ~ normal(mu_ii, sigma);
```

If there aren’t many levels (values `ii[n]`

can take), then it
behooves us to reorganize the data by group in a case like this.
Rather than having a single observation vector `y`

, there are K of them.
And because Stan doesn’t support ragged arrays, it means K
declarations. For instance, with 5 levels, we have

```
y_1 ~ normal(mu[1], sigma);
...
y_5 ~ normal(mu[5], sigma);
```

This way, both the `mu`

and `sigma`

parameters are shared.
Which way works out to be more efficient will depend on the shape of
the data; if the sizes are small, the simple vectorization may be
faster, but for moderate to large sized groups, the full expansion
should be faster.