Efficiency Tuning
This chapter provides a grab bag of techniques for optimizing Stan code, including vectorization, sufficient statistics, and conjugacy. At a coarse level, efficiency involves both the amount of time required for a computation and the amount of memory required. For practical applied statistical modeling, we are mainly concerned with reducing wall time (how long a program takes as measured by a clock on the wall) and keeping memory requirements within available bounds.
What is efficiency?
The standard algorithm analyses in computer science measure efficiency asymptotically as a function of problem size (such as data, number of parameters, etc.) and typically do not consider constant additive factors like startup times or multiplicative factors like speed of operations. In practice, the constant factors are important; if run time can be cut in half or more, that’s a huge gain. This chapter focuses on both the constant factors involved in efficiency (such as using built-in matrix operations as opposed to naive loops) and on asymptotic efficiency factors (such as using linear algorithms instead of quadratic algorithms in loops).
Efficiency for probabilistic models and algorithms
Stan programs express models which are intrinsically statistical in nature. The algorithms applied to these models may or may not themselves be probabilistic. For example, given an initial value for parameters (which may itself be given deterministically or generated randomly), Stan’s optimization algorithm (L-BFGS) for penalized maximum likelihood estimation is purely deterministic. Stan’s sampling algorithms are based on Markov chain Monte Carlo algorithms, which are probabilistic by nature at every step. Stan’s variational inference algorithm (ADVI) is probabilistic despite being an optimization algorithm; the randomization lies in a nested Monte Carlo calculation for an expected gradient.
With probabilistic algorithms, there will be variation in run times (and maybe memory usage) based on the randomization involved. For example, by starting too far out in the tail, iterative algorithms underneath the hood, such as the solvers for ordinary differential equations, may take different numbers of steps. Ideally this variation will be limited; when there is a lot of variation it can be a sign that there is a problem with the model’s parameterization in a Stan program or with initialization.
A well-behaved Stan program will have low variance between runs with different random initializations and differently seeded random number generators. But sometimes an algorithm can get stuck in one part of the posterior, typically due to high curvature. Such sticking almost always indicates the need to reparameterize the model. Just throwing away Markov chains with apparently poor behavior (slow, or stuck) can lead to bias in posterior estimates. This problem with getting stuck can often be overcome by lowering the initial step size to avoid getting stuck during adaptation and increasing the target acceptance rate in order to target a lower step size. This is because smaller step sizes allow Stan’s gradient-based algorithms to better follow the curvature in the density or penalized maximum likelihood being fit.
Statistical vs. computational efficiency
There is a difference between pure computational efficiency and statistical efficiency for Stan programs fit with sampling-based algorithms. Computational efficiency measures the amount of time or memory required for a given step in a calculation, such as an evaluation of a log posterior or penalized likelihood.
Statistical efficiency typically involves requiring fewer steps in algorithms by making the statistical formulation of a model better behaved. The typical way to do this is by applying a change of variables (i.e., reparameterization) so that sampling algorithms mix better or optimization algorithms require less adaptation.
Model conditioning and curvature
Because Stan’s algorithms (other than Riemannian Hamiltonian Monte Carlo) rely on step-based gradient-based approximations of the density (or penalized maximum likelihood) being fitted, posterior curvature not captured by this first-order approximation plays a central role in determining the statistical efficiency of Stan’s algorithms.
A second-order approximation to curvature is provided by the Hessian, the matrix of second derivatives of the log density
Condition number and adaptation
A good gauge of how difficult a problem the curvature presents is given by the condition number of the Hessian matrix
Unit scales without correlation
Ideally, all parameters should be programmed so that they have unit scale and so that posterior correlation is reduced; together, these properties mean that there is no rotation or scaling required for optimal performance of Stan’s algorithms. For Hamiltonian Monte Carlo, this implies a unit mass matrix, which requires no adaptation as it is where the algorithm initializes. Riemannian Hamiltonian Monte Carlo performs this conditioning on the fly at every step, but such conditioning is expensive computationally.
Varying curvature
In all but very simple models (such as multivariate normals), the Hessian will vary as
For all of Stan’s algorithms, the curvature along the path from the initial values of the parameters to the solution is relevant. For penalized maximum likelihood and variational inference, the solution of the iterative algorithm will be a single point, so this is all that matters. For sampling, the relevant “solution” is the typical set, which is the posterior volume where almost all draws from the posterior lies; thus, the typical set contains almost all of the posterior probability mass.
With sampling, the curvature may vary dramatically between the points on the path from the initialization point to the typical set and within the typical set. This is why adaptation needs to run long enough to visit enough points in the typical set to get a good first-order estimate of the curvature within the typical set. If adaptation is not run long enough, sampling within the typical set after adaptation will not be efficient. We generally recommend at least one hundred iterations after the typical set is reached (and the first effective draw is ready to be realized). Whether adaptation has run long enough can be measured by comparing the adaptation parameters derived from a set of diffuse initial parameter values.
Reparameterizing with a change of variables
Improving statistical efficiency is achieved by reparameterizing the model so that the same result may be calculated using a density or penalized maximum likelihood that is better conditioned. Again, see the example of reparameterizing Neal’s funnel for an example, and also the examples in the change of variables chapter.
One has to be careful in using change-of-variables reparameterizations when using maximum likelihood estimation, because they can change the result if the Jacobian term is inadvertently included in the revised likelihood model.
Well-specified models
Model misspecification, which roughly speaking means using a model that doesn’t match the data, can be a major source of slow code. This can be seen in cases where simulated data according to the model runs robustly and efficiently, whereas the real data for which it was intended runs slowly or may even have convergence and mixing issues. While some of the techniques recommended in the remaining sections of this chapter may mitigate the problem, the best remedy is a better model specification.
Counterintuitively, more complicated models often run faster than simpler models. One common pattern is with a group of parameters with a wide fixed prior such as normal(0, 1000)
). This can fit slowly due to the mismatch between prior and posterior (the prior has support for values in the hundreds or even thousands, whereas the posterior may be concentrated near zero). In such cases, replacing the fixed prior with a hierarchical prior such as normal(mu, sigma)
, where mu
and sigma
are new parameters with their own hyperpriors, can be beneficial.
Avoiding validation
Stan validates all of its data structure constraints. For example, consider a transformed parameter defined to be a covariance matrix and then used as a covariance parameter in the model block.
transformed parameters {
cov_matrix[K] Sigma;
// ...
// first validation
} model {
// second validation
y ~ multi_normal(mu, Sigma); // ...
}
Because Sigma
is declared to be a covariance matrix, it will be factored at the end of the transformed parameter block to ensure that it is positive definite. The multivariate normal log density function also validates that Sigma
is positive definite. This test is expensive, having cubic run time (i.e.,
The test may be avoided by simply declaring Sigma
to be a simple unconstrained matrix.
transformed parameters {
matrix[K, K] Sigma;
// ...
}model {
// only validation
y ~ multi_normal(mu, Sigma); }
Now the only validation is carried out by the multivariate normal.
Reparameterization
Stan’s sampler can be slow in sampling from distributions with difficult posterior geometries. One way to speed up such models is through reparameterization. In some cases, reparameterization can dramatically increase effective sample size for the same number of iterations or even make programs that would not converge well behaved.
Example: Neal’s funnel
In this section, we discuss a general transform from a centered to a non-centered parameterization (Papaspiliopoulos, Roberts, and Sköld 2007).1
This reparameterization is helpful when there is not much data, because it separates the hierarchical parameters and lower-level parameters in the prior.
Neal (2003) defines a distribution that exemplifies the difficulties of sampling from some hierarchical models. Neal’s example is fairly extreme, but can be trivially reparameterized in such a way as to make sampling straightforward. Neal’s example has support for
The probability contours are shaped like ten-dimensional funnels. The funnel’s neck is particularly sharp because of the exponential function applied to
The marginal density of Neal’s funnel for the upper-level variable
The funnel can be implemented directly in Stan as follows.
parameters {
real y;
vector[9] x;
}model {
0, 3);
y ~ normal(0, exp(y/2));
x ~ normal( }
When the model is expressed this way, Stan has trouble sampling from the neck of the funnel, where
4000 draws are taken from a run of Stan’s sampler with default settings. Both plots are restricted to the shown window of
In this particular instance, because the analytic form of the density from which samples are drawn is known, the model can be converted to the following more efficient form.
parameters {
real y_raw;
vector[9] x_raw;
}transformed parameters {
real y;
vector[9] x;
3.0 * y_raw;
y = 2) * x_raw;
x = exp(y/
}model {
// implies y ~ normal(0, 3)
y_raw ~ std_normal(); // implies x ~ normal(0, exp(y/2))
x_raw ~ std_normal(); }
In this second model, the parameters x_raw
and y_raw
are sampled as independent standard normals, which is easy for Stan. These are then transformed into samples from the funnel. In this case, the same transform may be used to define Monte Carlo samples directly based on independent standard normal samples; Markov chain Monte Carlo methods are not necessary. If such a reparameterization were used in Stan code, it is useful to provide a comment indicating what the distribution for the parameter implies for the distribution of the transformed parameter.
Reparameterizing the Cauchy
Sampling from heavy tailed distributions such as the Cauchy is difficult for Hamiltonian Monte Carlo, which operates within a Euclidean geometry.2
The practical problem is that tail of the Cauchy requires a relatively large step size compared to the trunk. With a small step size, the No-U-Turn sampler requires many steps when starting in the tail of the distribution; with a large step size, there will be too much rejection in the central portion of the distribution. This problem may be mitigated by defining the Cauchy-distributed variable as the transform of a uniformly distributed variable using the Cauchy inverse cumulative distribution function.
Suppose a random variable of interest
Consider a Stan program involving a Cauchy-distributed parameter beta
.
parameters {
real beta;
// ...
}model {
beta ~ cauchy(mu, tau);// ...
}
This declaration of beta
as a parameter may be replaced with a transformed parameter beta
defined in terms of a uniform-distributed parameter beta_unif
.
parameters {
real<lower=-pi() / 2, upper=pi() / 2> beta_unif;
// ...
}transformed parameters {
real beta;
// beta ~ cauchy(mu, tau)
beta = mu + tau * tan(beta_unif);
}model {
2, pi() / 2); // not necessary
beta_unif ~ uniform(-pi() / // ...
}
It is more convenient in Stan to transform a uniform variable on mu
and tau
, may be defined as data or may themselves be parameters. The variable beta
could also be defined as a local variable if it does not need to be included in the sampler’s output.
The uniform distribution on beta_unif
is defined explicitly in the model block, but it could be safely removed from the program without changing sampling behavior. This is because beta_unif
is between -pi() / 2
and pi() / 2
, but this condition is guaranteed by the constraints in the declaration of beta_unif
.
Reparameterizing a Student-t distribution
One thing that sometimes works when you’re having trouble with the heavy-tailedness of Student-t distributions is to use the gamma-mixture representation, which says that you can generate a Student-t distributed variable
Because
The marginal distribution of
To go one step further, instead of defining a
Now suppose
To translate into Stan notation, this reparameterization replaces
parameters {
real<lower=0> nu;
real beta;
// ...
}model {
0, 1);
beta ~ student_t(nu, // ...
}
with
parameters {
real<lower=0> nu;
real<lower=0> tau;
real alpha;
// ...
}transformed parameters {
real beta;
beta = alpha / sqrt(tau);// ...
}model {
real half_nu;
0.5 * nu;
half_nu =
tau ~ gamma(half_nu, half_nu);
alpha ~ std_normal();// ...
}
Although set to 0
here, in most cases, the lower bound for the degrees of freedom parameter nu
can be set to 1
or higher; when nu
is 1, the result is a Cauchy distribution with fat tails and as nu
approaches infinity, the Student-t distribution approaches a normal distribution. Thus the parameter nu
characterizes the heaviness of the tails of the model.
Hierarchical models and the non-centered parameterization
Unfortunately, the usual situation in applied Bayesian modeling involves complex geometries and interactions that are not known analytically. Nevertheless, reparameterization can still be effective for separating parameters.
Centered parameterization
For example, a vectorized hierarchical model might draw a vector of coefficients
parameters {
real mu_beta;
real<lower=0> sigma_beta;
vector[K] beta;
// ...
}model {
beta ~ normal(mu_beta, sigma_beta);// ...
}
Although not shown, a full model will have priors on both mu_beta
and sigma_beta
along with data modeled based on these coefficients. For instance, a standard binary logistic regression with data matrix x
and binary outcome vector y
would include a likelihood statement such as form y ~ bernoulli_logit(x * beta)
, leading to an analytically intractable posterior.
A hierarchical model such as the above will suffer from the same kind of inefficiencies as Neal’s funnel, because the values of beta
, mu_beta
and sigma_beta
are highly correlated in the posterior. The extremity of the correlation depends on the amount of data, with Neal’s funnel being the extreme with no data. In these cases, the non-centered parameterization, discussed in the next section, is preferable; when there is a lot of data, the centered parameterization is more efficient. See Betancourt and Girolami (2013) for more information on the effects of centering in hierarchical models fit with Hamiltonian Monte Carlo.
Non-centered parameterization
Sometimes the group-level effects do not constrain the hierarchical distribution tightly. Examples arise when there are not many groups, or when the inter-group variation is high. In such cases, hierarchical models can be made much more efficient by shifting the data’s correlation with the parameters to the hyperparameters. Similar to the funnel example, this will be much more efficient in terms of effective sample size when there is not much data (see Betancourt and Girolami (2013)), and in more extreme cases will be necessary to achieve convergence.
parameters {
vector[K] beta_raw;
// ...
}transformed parameters {
vector[K] beta;
// implies: beta ~ normal(mu_beta, sigma_beta)
beta = mu_beta + sigma_beta * beta_raw;
}model {
beta_raw ~ std_normal();// ...
}
Any priors defined for mu_beta
and sigma_beta
remain as defined in the original model.
Reparameterization of hierarchical models is not limited to the normal distribution, although the normal distribution is the best candidate for doing so. In general, any distribution of parameters in the location-scale family is a good candidate for reparameterization. Let
There are several univariate distributions in the location-scale family, such as the Student t distribution, including its special cases of the Cauchy distribution (with one degree of freedom) and the normal distribution (with infinite degrees of freedom). As shown above, if
Also, if
Multivariate reparameterizations
The benefits of reparameterization are not limited to univariate distributions. A parameter with a multivariate normal prior distribution is also an excellent candidate for reparameterization. Suppose you intend the prior for
data {
int<lower=2> K;
vector[K] mu;
cov_matrix[K] Sigma;
// ...
}parameters {
vector[K] beta;
// ...
}model {
beta ~ multi_normal(mu, Sigma);// ...
}
In this case mu
and Sigma
are fixed data, but they could be unknown parameters, in which case their priors would be unaffected by a reparameterization of beta
.
If
data {
int<lower=2> K;
vector[K] mu;
cov_matrix[K] Sigma;
// ...
}transformed data {
matrix[K, K] L;
L = cholesky_decompose(Sigma);
}parameters {
vector[K] alpha;
// ...
}transformed parameters {
vector[K] beta;
beta = mu + L * alpha;
}model {
alpha ~ std_normal();// implies: beta ~ multi_normal(mu, Sigma)
// ...
}
This reparameterization is more efficient for two reasons. First, it reduces dependence among the elements of alpha
and second, it avoids the need to invert Sigma
every time multi_normal
is evaluated.
The Cholesky factor is also useful when a covariance matrix is decomposed into a correlation matrix that is multiplied from both sides by a diagonal matrix of standard deviations, where either the standard deviations or the correlations are unknown parameters. The Cholesky factor of the covariance matrix is equal to the product of a diagonal matrix of standard deviations and the Cholesky factor of the correlation matrix. Furthermore, the product of a diagonal matrix of standard deviations and a vector is equal to the elementwise product between the standard deviations and that vector. Thus, if for example the correlation matrix Tau
were fixed data but the vector of standard deviations sigma
were unknown parameters, then a reparameterization of beta
in terms of alpha
could be implemented as follows.
data {
int<lower=2> K;
vector[K] mu;
corr_matrix[K] Tau;
// ...
}transformed data {
matrix[K, K] L;
L = cholesky_decompose(Tau);
}parameters {
vector[K] alpha;
vector<lower=0>[K] sigma;
// ...
}transformed parameters {
vector[K] beta;
// This equals mu + diag_matrix(sigma) * L * alpha;
beta = mu + sigma .* (L * alpha);
}model {
0, 5);
sigma ~ cauchy(
alpha ~ std_normal();// implies: beta ~ multi_normal(mu,
// diag_matrix(sigma) * L * L' * diag_matrix(sigma)))
// ...
}
This reparameterization of a multivariate normal distribution in terms of standard normal variates can be extended to other multivariate distributions that can be conceptualized as contaminations of the multivariate normal, such as the multivariate Student t and the skew multivariate normal distribution.
A Wishart distribution can also be reparameterized in terms of standard normal variates and chi-squared variates. Let
data {
int<lower=1> N;
int<lower=1> K;
int<lower=K + 2> nu
matrix[K, K] L; // Cholesky factor of scale matrix
vector[K] mu;
matrix[N, K] y;
// ...
}parameters {
vector<lower=0>[K] c;
vector[0.5 * K * (K - 1)] z;
// ...
}model {
matrix[K, K] A;
int count = 1;
for (j in 1:(K - 1)) {
for (i in (j + 1):K) {
A[i, j] = z[count];1;
count +=
}for (i in 1:(j - 1)) {
0.0;
A[i, j] =
}
A[j, j] = sqrt(c[j]);
}for (i in 1:(K - 1)) {
0;
A[i, K] =
}
A[K, K] = sqrt(c[K]);
for (i in 1:K) {
1);
c[i] ~ chi_square(nu - i +
}
z ~ std_normal();// implies: L * A * A' * L' ~ wishart(nu, L * L')
y ~ multi_normal_cholesky(mu, L * A);// ...
}
This reparameterization is more efficient for three reasons. First, it reduces dependence among the elements of z
and second, it avoids the need to invert the covariance matrix, wishart
is evaluated. Third, if multi_normal_cholesky
function, rather than passing multi_normal
.
If
data {
int<lower=1> K;
int<lower=K + 2> nu
matrix[K, K] L; // Cholesky factor of scale matrix
// ...
}transformed data {
matrix[K, K] eye;
matrix[K, K] L_inv;
for (j in 1:K) {
for (i in 1:K) {
0.0;
eye[i, j] =
}1.0;
eye[j, j] =
}
L_inv = mdivide_left_tri_low(L, eye);
}parameters {
vector<lower=0>[K] c;
vector[0.5 * K * (K - 1)] z;
// ...
}model {
matrix[K, K] A;
matrix[K, K] A_inv_L_inv;
int count;
1;
count = for (j in 1:(K - 1)) {
for (i in (j + 1):K) {
A[i, j] = z[count];1;
count +=
}for (i in 1:(j - 1)) {
0.0;
A[i, j] =
}
A[j, j] = sqrt(c[j]);
}for (i in 1:(K - 1)) {
0;
A[i, K] =
}
A[K, K] = sqrt(c[K]);
A_inv_L_inv = mdivide_left_tri_low(A, L_inv);for (i in 1:K) {
1);
c[i] ~ chi_square(nu - i +
}
// implies: crossprod(A_inv_L_inv) ~
z ~ std_normal(); // inv_wishart(nu, L_inv' * L_inv)
// ...
}
Another candidate for reparameterization is the Dirichlet distribution with all
data {
int<lower=1> beta;
// ...
}parameters {
array[K] vector[beta] z;
// ...
}transformed parameters {
simplex[K] rho;
for (k in 1:K) {
// sum-of-squares
rho[k] = dot_self(z[k]);
}
rho = rho / sum(rho);
}model {
for (k in 1:K) {
z[k] ~ std_normal();
}// implies: rho ~ dirichlet(0.5 * beta * ones)
// ...
}
Vectorization
Gradient bottleneck
Stan spends the vast majority of its time computing the gradient of the log probability function, making gradients the obvious target for optimization. Stan’s gradient calculations with algorithmic differentiation require a template expression to be allocated and constructed for each subexpression of a Stan program involving parameters or transformed parameters.3 This section defines optimization strategies based on vectorizing these subexpressions to reduce the work done during algorithmic differentiation.
Vectorizing summations
Because of the gradient bottleneck described in the previous section, it is more efficient to collect a sequence of summands into a vector or array and then apply the sum()
operation than it is to continually increment a variable by assignment and addition. For example, consider the following code snippet, where foo()
is some operation that depends on n
.
for (n in 1:N) {
total += foo(n,...); }
This code has to create intermediate representations for each of the N
summands.
A faster alternative is to copy the values into a vector, then apply the sum()
operator, as in the following refactoring.
{vector[N] summands;
for (n in 1:N) {
summands[n] = foo(n,...);
}
total = sum(summands); }
Syntactically, the replacement is a statement block delineated by curly brackets ({
, }
), starting with the definition of the local variable summands
.
Even though it involves extra work to allocate the summands
vector and copy N
values into it, the savings in differentiation more than make up for it. Perhaps surprisingly, it will also use substantially less memory overall than incrementing total
within the loop.
Vectorization through matrix operations
The following program directly encodes a linear regression with fixed unit noise using a two-dimensional array x
of predictors, an array y
of outcomes, and an array beta
of regression coefficients.
data {
int<lower=1> K;
int<lower=1> N;
array[K, N] real x;
array[N] real y;
}parameters {
array[K] real beta;
}model {
for (n in 1:N) {
real gamma = 0;
for (k in 1:K) {
gamma += x[n, k] * beta[k];
}1);
y[n] ~ normal(gamma,
} }
The following model computes the same log probability function as the previous model, even supporting the same input files for data and initialization.
data {
int<lower=1> K;
int<lower=1> N;
array[N] vector[K] x;
array[N] real y;
}parameters {
vector[K] beta;
}model {
for (n in 1:N) {
1);
y[n] ~ normal(dot_product(x[n], beta),
} }
Although it produces equivalent results, the dot product should not be replaced with a transpose and multiply, as in
1); y[n] ~ normal(x[n]' * beta,
The relative inefficiency of the transpose and multiply approach is that the transposition operator allocates a new vector into which the result of the transposition is copied. This consumes both time and memory.4
The inefficiency of transposition could itself be mitigated by reordering the product and pulling the transposition out of the loop, as follows.
// ...
transformed parameters {
row_vector[K] beta_t;
beta_t = beta';
}model {
for (n in 1:N) {
1);
y[n] ~ normal(beta_t * x[n],
} }
The problem with transposition could be completely solved by directly encoding the x
as a row vector, as in the following example.
data {
// ...
array[N] row_vector[K] x;
// ...
}parameters {
vector[K] beta;
}model {
for (n in 1:N) {
1);
y[n] ~ normal(x[n] * beta,
} }
Declaring the data as a matrix and then computing all the predictors at once using matrix multiplication is more efficient still, as in the example discussed in the next section.
Having said all this, the most efficient way to code this model is with direct matrix multiplication, as in
data {
matrix[N, K] x;
vector[N] y;
}parameters {
vector[K] beta;
}model {
1);
y ~ normal(x * beta, }
In general, encapsulated single operations that do the work of loops will be more efficient in their encapsulated forms. Rather than performing a sequence of row-vector/vector multiplications, it is better to encapsulate it as a single matrix/vector multiplication.
Vectorized probability functions
The final and most efficient version replaces the loops and transformed parameters by using the vectorized form of the normal probability function, as in the following example.
data {
int<lower=1> K;
int<lower=1> N;
matrix[N, K] x;
vector[N] y;
}parameters {
vector[K] beta;
}model {
1);
y ~ normal(x * beta, }
The variables are all declared as either matrix or vector types. The result of the matrix-vector multiplication x * beta
in the model block is a vector of the same length as y
.
The probability function documentation in the function reference manual indicates which of Stan’s probability functions support vectorization; see the function reference manual for full details. Vectorized probability functions accept either vector or scalar inputs for all arguments, with the only restriction being that all vector arguments are the same dimensionality. In the example above, y
is a vector of size N
, x * beta
is a vector of size N
, and 1
is a scalar.
Reshaping data for vectorization
Sometimes data does not arrive in a shape that is ideal for vectorization, but can be put into such shape with some munging (either inside Stan’s transformed data block or outside).
John Hall provided a simple example on the Stan users group. Simplifying notation a bit, the original model had a sampling statement in a loop, as follows.
for (n in 1:N) {
y[n] ~ normal(mu[ii[n]], sigma); }
The brute force vectorization would build up a mean vector and then vectorize all at once.
{vector[N] mu_ii;
for (n in 1:N) {
mu_ii[n] = mu[ii[n]];
}
y ~ normal(mu_ii, sigma); }
If there aren’t many levels (values ii[n]
can take), then it behooves us to reorganize the data by group in a case like this. Rather than having a single observation vector y
, there are K of them. And because Stan doesn’t support ragged arrays, it means K declarations. For instance, with 5 levels, we have
1], sigma);
y_1 ~ normal(mu[// ...
5], sigma); y_5 ~ normal(mu[
This way, both the mu
and sigma
parameters are shared. Which way works out to be more efficient will depend on the shape of the data; if the sizes are small, the simple vectorization may be faster, but for moderate to large sized groups, the full expansion should be faster.
Exploiting sufficient statistics
In some cases, models can be recoded to exploit sufficient statistics in estimation. This can lead to large efficiency gains compared to an expanded model. For example, consider the following Bernoulli sampling model.
data {
int<lower=0> N;
array[N] int<lower=0, upper=1> y;
real<lower=0> alpha;
real<lower=0> beta;
}parameters {
real<lower=0, upper=1> theta;
}model {
theta ~ beta(alpha, beta);for (n in 1:N) {
y[n] ~ bernoulli(theta);
} }
In this model, the sum of positive outcomes in y
is a sufficient statistic for the chance of success theta
. The model may be recoded using the binomial distribution as follows.
theta ~ beta(alpha, beta); sum(y) ~ binomial(N, theta);
Because truth is represented as one and falsehood as zero, the sum sum(y)
of a binary vector y
is equal to the number of positive outcomes out of a total of N
trials.
This can be generalized to other discrete cases (one wouldn’t expect continuous observations to be duplicated if they are random). Suppose there are only foo
can be coded as follows.
for (k in 1:K) {
target += f[k] * foo_lpmf(z[k] | ...);
}
where the ellipses are the parameters of the log probability mass function for distribution foo
(there’s no distribution called “foo”; this is just a placeholder for any discrete distribution name).
The resulting program looks like a “weighted” regression, but here the weights f[k]
are counts and thus sufficient statistics for the PMF and simply amount to an alternative, more efficient coding of the same likelihood. For efficiency, the frequencies f[k]
should be counted once in the transformed data block and stored.
Aggregating common subexpressions
If an expression is calculated once, the value should be saved and reused wherever possible. That is, rather than using exp(theta)
in multiple places, declare a local variable to store its value and reuse the local variable.
Another case that may not be so obvious is with two multilevel parameters, say a[ii[n]] + b[jj[n]]
. If a
and b
are small (i.e., do not have many levels), then a table a_b
of their sums can be created, with
matrix[size(a), size(b)] a_b;
for (i in 1:size(a)) {
for (j in 1:size(b)) {
a_b[i, j] = a[i] + b[j];
} }
Then the sum can be replaced with a_b[ii[n], jj[n]]
.
Exploiting conjugacy
Continuing the model from the previous section, the conjugacy of the beta prior and binomial sampling distribution allow the model to be further optimized to the following equivalent form.
theta ~ beta(alpha + sum(y), beta + N - sum(y));
To make the model even more efficient, a transformed data variable defined to be sum(y)
could be used in the place of sum(y)
.
Standardizing predictors and outputs
Stan programs will run faster if the input is standardized to have a zero sample mean and unit sample variance. This section illustrates the principle with a simple linear regression.
Suppose that
If either vector
To standardize a regression problem, the predictors and outcomes are standardized. This changes the scale of the variables, and hence changes the scale of the priors. Consider the following initial model.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}parameters {
real alpha;
real beta;
real<lower=0> sigma;
}model {
// priors
0, 10);
alpha ~ normal(0, 10);
beta ~ normal(0, 5);
sigma ~ cauchy(// likelihood
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n], sigma);
} }
The data block for the standardized model is identical. The standardized predictors and outputs are defined in the transformed data block.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}transformed data {
vector[N] x_std;
vector[N] y_std;
x_std = (x - mean(x)) / sd(x);
y_std = (y - mean(y)) / sd(y);
}parameters {
real alpha_std;
real beta_std;
real<lower=0> sigma_std;
}model {
0, 10);
alpha_std ~ normal(0, 10);
beta_std ~ normal(0, 5);
sigma_std ~ cauchy(for (n in 1:N) {
y_std[n] ~ normal(alpha_std + beta_std * x_std[n],
sigma_std);
} }
The parameters are renamed to indicate that they aren’t the “natural” parameters, but the model is otherwise identical. In particular, the fairly diffuse priors on the coefficients and error scale are the same. These could have been transformed as well, but here they are left as is, because the scales make sense as diffuse priors for standardized data; the priors could be made more informative. For instance, because the outputs
The original regression
These recovered parameter values on the original scales can be calculated within Stan using a generated quantities block following the model block,
generated quantities {
real alpha;
real beta;
real<lower=0> sigma;
alpha = sd(y) * (alpha_std - beta_std * mean(x) / sd(x))
+ mean(y);
beta = beta_std * sd(y) / sd(x);
sigma = sd(y) * sigma_std; }
It is inefficient to compute all of the means and standard deviations every iteration; for more efficiency, these can be calculated once and stored as transformed data. Furthermore, the model sampling statement can be easily vectorized, for instance, in the transformed model, to
y_std ~ normal(alpha_std + beta_std * x_std, sigma_std);
Standard normal distribution
For many applications on the standard scale, normal distributions with location zero and scale one will be used. In these cases, it is more efficient to use
y ~ std_normal();
than to use
0, 1); y ~ normal(
because the subtraction of the location and division by the scale cancel, as does subtracting the log of the scale.
Using map-reduce
The map-reduce operation, even without multi-core MPI support, can be used to make programs more scalable and also more efficient. See the map-reduce chapter for more information on implementing map-reduce operations.
Map-reduce allows greater scalability because only the Jacobian of the mapped function for each shard is stored. The Jacobian consists of all of the derivatives of the outputs with respect to the parameters. During execution, the derivatives of the shard are evaluated using nested automatic differentiation. As often happens with modern CPUs, reduced memory overhead leads to increased memory locality and faster execution. The Jacobians are all computed with local memory and their outputs stored contiguously in memory.
References
Footnotes
This parameterization came to be known on our mailing lists as the “Matt trick” after Matt Hoffman, who independently came up with it while fitting hierarchical models in Stan.↩︎
Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) overcomes this difficulty by simulating the Hamiltonian dynamics in a space with a position-dependent metric; see Girolami and Calderhead (2011) and Betancourt (2012).↩︎
Stan uses its own arena-based allocation, so allocation and deallocation are faster than with a raw call to
new
.↩︎Future versions of Stan may remove this inefficiency by more fully exploiting expression templates inside the Eigen C++ matrix library. This will require enhancing Eigen to deal with mixed-type arguments, such as the type
double
used for constants and the algorithmic differentiation typestan::math::var
used for variables.↩︎