This is an old version, view current version.

1. Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as log(sigma) in the above model, will be computed once and reused.↩︎

2. The Phi_approx function is a rescaled version of the inverse logit function, so while the scale is roughly the same $$\Phi$$, the tails do not match.↩︎

3. The Bernoulli-logit distribution builds in the log link function, taking $\mathsf{BernoulliLogit}(y \mid \alpha) = \mathsf{Bernoulli}(y \mid \mathrm{logit}^{-1}(\alpha)).$↩︎

4. Gelman and Hill (2007) treat the $$\delta$$ term equivalently as the location parameter in the distribution of student abilities.↩︎

5. The prior is named for Lewandowski, Kurowicka, and Joe, as it was derived by inverting the random correlation matrix generation strategy of Lewandowski, Kurowicka, and Joe (2009).↩︎

6. Thanks to Mike Lawrence for pointing this out in the GitHub issue for the manual.↩︎

7. The intercept in this model is $$\alpha / (1 - \beta)$$. An alternative parameterization in terms of an intercept $$\gamma$$ suggested Mark Scheuerell on GitHub is $$y_n \sim \mathsf{normal}(\gamma + \beta \cdot (y_{n-1} - \gamma), \sigma)$$.↩︎

8. In practice, it can be useful to remove the constraint to test whether a non-stationary set of coefficients provides a better fit to the data. It can also be useful to add a trend term to the model, because an unfitted trend will manifest as non-stationarity.↩︎

9. This subsection is a lightly edited comment of Jonathan Gilligan’s on GitHub; see ↩︎

10. A more meaningful estimation example would involve a regression of the observed and missing observations using predictors that were known for each and specified in the data block.↩︎

11. This is not the same as missing components of a multivariate predictor in a regression problem; in that case, you will need to represent the missing data as a parameter and impute missing values in order to feed them into the regression.↩︎

12. Imposing a constraint such as $$\theta < 0.5$$ will resolve the symmetry, but fundamentally changes the model and its posterior inferences.↩︎

13. Stan’s integer constraints are not powerful enough to express the constraint that $$\mathtt{r\_t[j]} \leq \mathtt{n\_t[j]}$$, but this constraint could be checked in the transformed data block.↩︎

14. When dividing two integers, the result type is an integer and rounding will ensue if the result is not exact. See the discussion of primitive arithmetic types in the reference manual for more information.↩︎

15. The computations are similar to those involved in expectation maximization (EM) algorithms Dempster, Laird, and Rubin (1977).↩︎

16. The source of the data is Jarrett (1979), which itself is a note correcting an earlier data collection.↩︎

17. The R counterpart, ifelse, is slightly different in that it is typically used in a vectorized situation. The conditional operator is not (yet) vectorized in Stan.↩︎

18. An alternative would be to compute this on the outside and feed it into the Stan model as preprocessed data. Yet another alternative encoding would be a sparse one recording only the capture events along with their time and identifying the individual captured.↩︎

19. Both functions return 0 if the individual represented by the input array was never captured. Individuals with no captures are not relevant for estimating the model because all probability statements are conditional on earlier captures. Typically they would be removed from the data, but the program allows them to be included even though they make not contribution to the log probability function.↩︎

20. Diagnostic procedures are often ordinal, as in stages of cancer in oncological diagnosis or the severity of a cavity in dental diagnosis. Dawid and Skene’s model may be used as is or naturally generalized for ordinal ratings using a latent continuous rating and cutpoints as in ordinal logistic regression.↩︎

21. In the subscript, $$z[i]$$ is written as $$z_i$$ to improve legibility.↩︎

22. For clustering, the non-identifiability problems for all mixture models present a problem, whereas there is no such problem for classification. Despite the difficulties with full Bayesian inference for clustering, researchers continue to use it, often in an exploratory data analysis setting rather than for predictive modeling.↩︎

23. Gaussian processes can be extended to covariance functions producing positive semi-definite matrices, but Stan does not support inference in the resulting models because the resulting distribution does not have unconstrained support.↩︎

24. This example is drawn from the documentation for the Boost Numeric Odeint library (Ahnert and Mulansky 2011), which Stan uses to implement the rk45 solver.↩︎

25. Not coincidentally, high curvature in the posterior of a general Stan model poses the same kind of problem for Euclidean Hamiltonian Monte Carlo (HMC) sampling. The reason is that HMC is based on the leapfrog algorithm, a gradient-based, stepwise numerical differential equation solver specialized for Hamiltonian systems with separable potential and kinetic energy terms.↩︎

26. The notable exception is Intel’s optimizing compilers under certain optimization settings.↩︎

27. The main problem with comments is that they can be misleading, either due to misunderstandings on the programmer’s part or because the program’s behavior is modified after the comment is written. The program always behaves the way the code is written, which is why refactoring complex code into understandable units is preferable to simply adding comments.↩︎

28. Just because this makes it possible to code a rejection sampler does not make it a good idea. Rejections break differentiability and the smooth exploration of the posterior. In Hamiltonian Monte Carlo, it can cause the sampler to be reduced to a diffusive random walk.↩︎

29. A range of built-in validation routines is coming to Stan soon! Alternatively, the reject statement can be used to check constraints on the simplex.↩︎

30. As of Stan 2.9.0, the only way a user-defined producer will raise an exception is if a function it calls (including sampling statements) raises an exception via the reject statement.↩︎

31. The original code and impetus for including this in the manual came from the Stan forums post ; by user lcomm, who also explained truncation above and below.↩︎

32. The problem is the (extremely!) light tails of the triangle distribution. The standard HMC and NUTS samplers can’t get into the corners of the triangle properly. Because the Stan code declares y to be of type real<lower=-1,upper=1>, the inverse logit transform is applied to the unconstrained variable and its log absolute derivative added to the log probability. The resulting distribution on the logit-transformed y is well behaved.↩︎

33. This example was raised by Richard McElreath on the Stan users group in a query about the difference in behavior between Gibbs sampling as used in BUGS and JAGS and the Hamiltonian Monte Carlo (HMC) and no-U-turn samplers (NUTS) used by Stan.↩︎

34. The marginal posterior $$p(\sigma|y)$$ for $$\sigma$$ is proper here as long as there are at least two distinct data points.↩︎

35. A Laplace prior (or an L1 regularizer for penalized maximum likelihood estimation) is not sufficient to remove this additive invariance. It provides shrinkage, but does not in and of itself identify the parameters because adding a constant to $$\lambda_1$$ and subtracting it from $$\lambda_2$$ results in the same value for the prior density.↩︎

36. Tempering methods may be viewed as automated ways to carry out such a search for modes, though most MCMC tempering methods continue to search for modes on an ongoing basis; see (Swendsen and Wang 1986; Neal 1996b).↩︎

37. This is in contrast to (penalized) maximum likelihood estimates, which are not parameterization invariant.↩︎

38. This example is for illustrative purposes only; the recommended way to implement the lognormal distribution in Stan is with the built-in lognormal probability function; see the functions reference manual for details.↩︎

39. This parameterization came to be known on our mailing lists as the “Matt trick” after Matt Hoffman, who independently came up with it while fitting hierarchical models in Stan.↩︎

40. Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) overcomes this difficulty by simulating the Hamiltonian dynamics in a space with a position-dependent metric; see Girolami and Calderhead (2011) and Betancourt (2012).↩︎

41. Stan uses its own arena-based allocation, so allocation and deallocation are faster than with a raw call to new.↩︎

42. Future versions of Stan may remove this inefficiency by more fully exploiting expression templates inside the Eigen C++ matrix library. This will require enhancing Eigen to deal with mixed-type arguments, such as the type double used for constants and the algorithmic differentiation type stan::math::var used for variables.↩︎

43. The term “shard” is borrowed from databases, where it refers to a slice of the rows of a database. That is exactly what it is here if we think of rows of a dataframe. Stan’s shards are more general in that they need not correspond to rows of a dataframe.↩︎

44. This example is a simplified form of the model described in (Gelman and Hill 2007, Section 14.2)↩︎

45. This makes the strong assumption that each group has the same number of observations!↩︎

46. Even 80 characters may be too many for rendering in print; for instance, in this manual, the number of code characters that fit on a line is about 65.↩︎

47. This is the usual convention in both typesetting and other programming languages. Neither R nor BUGS allows breaks before an operator because they allow newlines to signal the end of an expression or statement.↩︎

48. Except where otherwise noted, we use “BUGS” to refer to WinBUGS, OpenBUGS, and JAGS, indiscriminately.↩︎

49. Most distributions have been vectorized, but currently the truncated versions may not exist and may not be vectorized.↩︎

### References

Ahnert, Karsten, and Mario Mulansky. 2011. “Odeint—Solving Ordinary Differential Equations in C++.” arXiv 1110.3397.

Betancourt, Michael. 2012. “A General Metric for Riemannian Manifold Hamiltonian Monte Carlo.” arXiv 1212.4693. http://arxiv.org/abs/1212.4693.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

Girolami, Mark, and Ben Calderhead. 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2): 123–214.

Jarrett, R. G. 1979. “A Note on the Intervals Between Coal-Mining Disasters.” Biometrika 66 (1): 191–93.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100: 1989–2001.

Neal, Radford M. 1996b. “Sampling from Multimodal Distributions Using Tempered Transitions.” Statistics and Computing 6 (4): 353–66.

Swendsen, Robert H., and Jian-Sheng Wang. 1986. “Replica Monte Carlo Simulation of Spin Glasses.” Physical Review Letters 57: 2607–9.