Transitioning from BUGS

From the outside, Stan and BUGS¹ are similar—they use statistically-themed modeling languages (which are similar but with some differences; see below), they can be called from R, running some specified number of chains to some specified length, producing posterior simulations that can be assessed using standard convergence diagnostics. This is not a coincidence: in designing Stan: we wanted to keep many of the useful features of Bugs.

Some differences in how BUGS and Stan work

BUGS is interpreted, Stan is compiled

Stan is compiled in two steps, first a model is translated to templated C++ and then to a platform-specific executable. Stan, unlike BUGS, allows the user to directly program in C++, but we do not describe how to do this in this Stan manual (see the getting started with C++ section of https://mc-stan.org for more information on using Stan directly from C++).

BUGS performs MCMC updating one scalar parameter at a time, Stan uses HMC which moves in the entire space of all the parameters at each step

BUGS performs MCMC updating one scalar parameter at a time, (with some exceptions such as JAGS’s implementation of regression and generalized linear models and some conjugate multivariate parameters), using conditional distributions (Gibbs sampling) where possible and otherwise using adaptive rejection sampling, slice sampling, and Metropolis jumping. BUGS figures out the dependence structure of the joint distribution as specified in its modeling language and uses this information to compute only what it needs at each step. Stan moves in the entire space of all the parameters using Hamiltonian Monte Carlo (more precisely, the no-U-turn sampler), thus avoiding some difficulties that occur with one-dimension-at-a-time sampling in high dimensions but at the cost of requiring the computation of the entire log density at each step.

Differences in tuning during warmup

BUGS tunes its adaptive jumping (if necessary) during its warmup phase (traditionally referred to as “burn-in”). Stan uses its warmup phase to tune the no-U-turn sampler (NUTS).

The Stan language is directly executable, the BUGS modeling language is not

The BUGS modeling language is not directly executable. Rather, BUGS parses its model to determine the posterior density and then decides on a sampling scheme. In contrast, the statements in a Stan model are directly executable: they translate exactly into C++ code that is used to compute the log posterior density (which in turn is used to compute the gradient).

Differences in statement order

In BUGS, statements are executed according to the directed graphical model so that variables are always defined when needed. A side effect of the direct execution of Stan’s modeling language is that statements execute in the order in which they are written. For instance, the following Stan program, which sets mu before using it to sample y:

mu = a + b * x;
y ~ normal(mu, sigma);

translates to the following C++ code:

mu = a + b * x;
target += normal_lpdf(y | mu, sigma);

Contrast this with the following Stan program:

y ~ normal(mu, sigma);
mu = a + b * x;

This program is well formed, but is almost certainly a coding error, because it attempts to use mu before it is set. The direct translation to C++ code highlights the potential error of using mu in the first statement:

target += normal_lpdf(y | mu, sigma);
mu = a + b * x;

To trap these kinds of errors, variables are initialized to the special not-a-number (NaN) value. If NaN is passed to a log probability function, it will raise a domain exception, which will in turn be reported by the sampler. The sampler will reject the sample out of hand as if it had zero probability.

Stan computes the gradient of the log density, BUGS computes the log density but not its gradient

Stan uses its own C++ algorithmic differentiation packages to compute the gradient of the log density (up to a proportion). Gradients are required during the Hamiltonian dynamics simulations within the leapfrog algorithm of the Hamiltonian Monte Carlo and NUTS samplers.

Both BUGS and Stan are semi-automatic

Both BUGS and Stan are semi-automatic in that they run by themselves with no outside tuning required. Nevertheless, the user needs to pick the number of chains and number of iterations per chain. We usually pick 4 chains and start with 10 iterations per chain (to make sure there are no major bugs and to approximately check the timing), then go to 100, 1000, or more iterations as necessary. Compared to Gibbs or Metropolis, Hamiltonian Monte Carlo can take longer per iteration (as it typically takes many “leapfrog steps” within each iteration), but the iterations typically have lower autocorrelation. So Stan might work fine with 1000 iterations in an example where BUGS would require 100,000 for good mixing. We recommend monitoring potential scale reduction statistics (\(\hat{R}\)) and the effective sample size to judge when to stop (stopping when \(\hat{R}\) values do not counter-indicate convergence and when high enough effective sample size have been obtained).

Licensing

WinBUGS is closed source. OpenBUGS and JAGS are both licensed under the Gnu Public License (GPL), otherwise known as copyleft due to the restrictions it places on derivative works. Stan is licensed under the much more liberal new BSD license.

Interfaces

Like WinBUGS, OpenBUGS and JAGS, Stan can be run directly from the command line or through common analytics platforms like R, Python, Julia, MATLAB, Mathematica, and the command line.

Platforms

Like OpenBUGS and JAGS, Stan can be run on Linux, Mac, and Windows platforms.

Some differences in the modeling languages

The BUGS modeling language follows an R-like syntax in which line breaks are meaningful. Stan follows the rules of C, in which line breaks are equivalent to spaces, and each statement ends in a semicolon. For example:

y ~ normal(mu, sigma);

and

for (i in 1:n) y[i] ~ normal(mu, sigma);

Or, equivalently (recall that a line break is just another form of whitespace),

for (i in 1:n)
  y[i] ~ normal(mu, sigma);

and also equivalently,

for (i in 1:n) {
  y[i] ~ normal(mu, sigma);
}

There’s a semicolon after the model statement but not after the brackets indicating the body of the for loop.

In Stan, variables can have names constructed using letters, numbers, and the underscore (_) symbol, but nothing else (and a variable name cannot begin with a number). BUGS variables can also include the dot, or period (.) symbol.

In Stan, the second argument to the “normal” function is the standard deviation (i.e., the scale), not the variance (as in Bayesian Data Analysis) and not the inverse-variance (i.e., precision) (as in BUGS). Thus a normal with mean 1 and standard deviation 2 is normal(1,2), not normal(1,4) or normal(1,0.25).

Similarly, the second argument to the “multivariate normal” function is the covariance matrix and not the inverse covariance matrix (i.e., the precision matrix) (as in BUGS). The same is true for the “multivariate student” distribution.

The distributions have slightly different names:

BUGS	Stan
`dnorm`	`normal`
`dbinom`	`binomial`
`dpois`	`poisson`
…	…

Stan, unlike BUGS, allows intermediate quantities, in the form of local variables, to be reassigned. For example, the following is legal and meaningful (if possibly inefficient) Stan code.

{
  total = 0;
  for (i in 1:n) {
    theta[i] ~ normal(total, sigma);
    total = total + theta[i];
  }
}

In BUGS, the above model would not be legal because the variable total is defined more than once. But in Stan, the loop is executed in order, so total is overwritten in each step.

Stan uses explicit declarations. Variables are declared with base type integer or real, and vectors, matrices, and arrays have specified dimensions. When variables are bounded, we give that information also. For data and transformed parameters, the bounds are used for error checking. For parameters, the constraints are critical to sampling as they determine the geometry over which the Hamiltonian is simulated.

In Stan, variables can be declared as data, transformed data, parameters, transformed parameters, or generated quantities. They can also be declared as local variables within blocks. For more information, see the part of this manual devoted to the Stan programming language and examine at the example models.

Stan allows all sorts of tricks with vector and matrix operations which can make Stan models more compact. For example, arguments to probability functions may be vectorized,² allowing

for (i in 1:n) {
  y[i] ~ normal(mu[i], sigma[i]);
}

to be expressed more compactly as

y ~ normal(mu, sigma);

The vectorized form is also more efficient because Stan can unfold the computation of the chain rule during algorithmic differentiation.

Stan also allows for arrays of vectors and matrices. For example, in a hierarchical model might have a vector of K parameters for each of J groups; this can be declared using

array[J] vector[K] theta;

Then theta[j] is an expression denoting a K-vector and may be used in the code just like any other vector variable.

An alternative encoding would be with a two-dimensional array, as in

array[J, K] real theta;

The vector version can have some advantages, both in convenience and in computational speed for some operations.

A third encoding would use a matrix:

matrix[J, K] theta;

but in this case, theta[j] is a row vector, not a vector, and accessing it as a vector is less efficient than with an array of vectors. The transposition operator, as in theta[j]', may be used to convert the row vector theta[j] to a (column) vector. Column vector and row vector types are not interchangeable everywhere in Stan; see the function signature declarations in the programming language section of this manual.

Stan supports general conditional statements using a standard if-else syntax. For example, a zero-inflated (or -deflated) Poisson mixture model is defined using the if-else syntax as described in the zero inflation section.

Stan supports general while loops using a standard syntax. While loops give Stan full Turing equivalent computational power. They are useful for defining iterative functions with complex termination conditions. As an illustration of their syntax, the for-loop

model {
    // ...
    for (n in 1:N) {
        // ... do something with n ....
    }
}

may be recoded using the following while loop.

model {
    int n;
    // ...
    n = 1;
    while (n <= N) {
        // ... do something with n ...
        n = n + 1;
    }
}

Some differences in the statistical models that are allowed

Stan does not yet support declaration of discrete parameters. Discrete data variables are supported. Inference is supported for discrete parameters as described in the mixture and latent discrete parameters chapters of the manual.

Stan has some distributions on covariance matrices that do not exist in BUGS, including a uniform distribution over correlation matrices which may be rescaled, and the priors based on C-vines defined in Lewandowski, Kurowicka, and Joe (2009). In particular, the Lewandowski et al. prior allows the correlation matrix to be shrunk toward the unit matrix while the scales are given independent priors.

In BUGS you need to define all variables. In Stan, if you declare but don’t define a parameter it implicitly has a flat prior (on the scale in which the parameter is defined). For example, if you have a parameter p declared as

real<lower=0, upper=1> p;

and then have no distribution statement for p in the model block, then you are implicitly assigning a uniform \([0,1]\) prior on p.

On the other hand, if you have a parameter theta declared with

real theta;

and have no distribution statement for theta in the model block, then you are implicitly assigning an improper uniform prior on \((-\infty,\infty)\) to theta.

BUGS models are always proper (being constructed as a product of proper marginal and conditional densities). Stan models can be improper. Here is the simplest improper Stan model:

parameters {
  real theta;
}
model { }

Although parameters in Stan models may have improper priors, we do not want improper posterior distributions, as we are trying to use these distributions for Bayesian inference. There is no general way to check if a posterior distribution is improper. But if all the priors are proper, the posterior will be proper also.

Each statement in a Stan model is directly translated into the C++ code for computing the log posterior. Thus, for example, the following pair of statements is legal in a Stan model:

y ~ normal(0,1);
y ~ normal(2,3);

The second line here does not simply overwrite the first; rather, both statements contribute to the density function that is evaluated. The above two lines have the effect of including the product, \(\textsf{normal}(y \mid 0,1) * \textsf{normal}(y \mid 2,3)\), into the density function.

For a perhaps more confusing example, consider the following two lines in a Stan model:

x ~ normal(0.8 * y, sigma);
y ~ normal(0.8 * x, sigma);

At first, this might look like a joint normal distribution with a correlation of 0.8. But it is not. The above are not interpreted as conditional entities; rather, they are factors in the joint density. Multiplying them gives, \(\textsf{normal}(x \mid 0.8y,\sigma) \times \textsf{normal}(y \mid 0.8x,\sigma)\), which is what it is (you can work out the algebra) but it is not the joint distribution where the conditionals have regressions with slope 0.8.

With censoring and truncation, Stan uses the censored-data or truncated-data likelihood—this is not always done in BUGS. All of the approaches to censoring and truncation discussed in Gelman et al. (2013) and Gelman and Hill (2007) may be implemented in Stan directly as written.

Stan, like BUGS, can benefit from human intervention in the form of reparameterization.

Some differences when running from R

Stan can be set up from within R using two lines of code. Follow the instructions for running Stan from R on the Stan web site. You don’t need to separately download Stan and RStan. Installing RStan will automatically set up Stan.

In practice we typically run the same Stan model repeatedly. If you pass RStan the result of a previously fitted model the model will not need be recompiled. An example is given on the running Stan from R pages available from the Stan web site.

When you run Stan, it saves various conditions including starting values, some control variables for the tuning and running of the no-U-turn sampler, and the initial random seed. You can specify these values in the Stan call and thus achieve exact replication if desired. (This can be useful for debugging.)

When running BUGS from R, you need to send exactly the data that the model needs. When running RStan, you can include extra data, which can be helpful when playing around with models. For example, if you remove a variable x from the model, you can keep it in the data sent from R, thus allowing you to quickly alter the Stan model without having to also change the calling information in your R script.

As in R2WinBUGS and R2jags, after running the Stan model, you can quickly summarize using plot() and print(). You can access the simulations themselves using various extractor functions, as described in the RStan documentation.

Various information about the sampler, such as number of leapfrog steps, log probability, and step size, is available through extractor functions. These can be useful for understanding what is going wrong when the algorithm is slow to converge.

The Stan community

Stan, like WinBUGS, OpenBUGS, and JAGS, has an active community, which you can access via the user’s mailing list and the developer’s mailing list; see the Stan web site for information on subscribing and posting and to look at archives.

References

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third Edition. London: Chapman & Hall / CRC Press.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100: 1989–2001.

Footnotes

Except where otherwise noted, we use “BUGS” to refer to WinBUGS, OpenBUGS, and JAGS, indiscriminately.↩︎
Most distributions have been vectorized, but currently the truncated versions may not exist.↩︎