This is an old version, view current version.

References

Agrawal, Nikhar, Anton Bikineev, Paul Bristow, Marco Guazzone, Christopher Kormanyos, Hubert Holin, Bruno Lalande, et al. 2017. “Double-Exponential Quadrature.” https://www.boost.org/doc/libs/1_66_0/libs/math/doc/html/math_toolkit/double_exponential.html.

Aguilar, Omar, and Mike West. 2000. “Bayesian Dynamic Factor Models and Portfolio Allocation.” Journal of Business & Economic Statistics 18 (3): 338–57.

Ahnert, Karsten, and Mario Mulansky. 2011. “Odeint—Solving Ordinary Differential Equations in C++.” arXiv 1110.3397.

Albert, J. H., and S. Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88: 669–79.

Barnard, John, Robert McCulloch, and Xiao-Li Meng. 2000. “Modeling Covariance Matrices in Terms of Standard Deviations and Correlations, with Application to Shrinkage.” Statistica Sinica, 1281–1311.

Betancourt, Michael. 2012. “A General Metric for Riemannian Manifold Hamiltonian Monte Carlo.” arXiv 1212.4693. http://arxiv.org/abs/1212.4693.

Betancourt, Michael, and Mark Girolami. 2013. “Hamiltonian Monte Carlo for Hierarchical Models.” arXiv 1312.0906. http://arxiv.org/abs/1312.0906.

Blei, David M., and John D. Lafferty. 2007. “A Correlated Topic Model of Science.” The Annals of Applied Statistics 1 (1): 17–37.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.

Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika 78 (4): 685–709.

Clayton, D. G. 1992. “Models for the Analysis of Cohort and Case-Control Studies with Inaccurately Measured Exposures.” In Statistical Models for Longitudinal Studies of Exposure and Health, edited by James H. Dwyer, Manning Feinleib, Peter Lippert, and Hans Hoffmeister, 301–31. New York: Oxford University Press.

Cohen, Scott D, and Alan C Hindmarsh. 1996. “CVODE, a Stiff/Nonstiff ODE Solver in C.” Computers in Physics 10 (2): 138–43.

Cormack, R. M. 1964. “Estimates of Survival from the Sighting of Marked Animals.” Biometrika 51 (3/4): 429–38.

Curtis, S. McKay. 2010. “BUGS Code for Item Response Theory.” Journal of Statistical Software 36 (1): 1–34.

Dawid, A. P., and A. M. Skene. 1979. “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1): 20–28.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.

Dormand, John R, and Peter J Prince. 1980. “A Family of Embedded Runge-Kutta Formulae.” Journal of Computational and Applied Mathematics 6 (1): 19–26.

Engle, Robert F. 1982. “Autoregressive Conditional Heteroscedasticity with Estimates of Variance of United Kingdom Inflation.” Econometrica 50: 987–1008.

Fonnesbeck, Chris, Anand Patil, David Huard, and John Salvatier. 2013. PyMC User’s Guide.

Gelman, Andrew. 2004. “Parameterization and Bayesian Modeling.” Journal of the American Statistical Association 99: 537–45.

Gelman, Andrew, J. B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. Third. London: Chapman &Hall/CRC Press.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

Girolami, Mark, and Ben Calderhead. 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2): 123–214.

Greene, William H. 2011. Econometric Analysis. 7th ed. Prentice-Hall.

Hoeting, Jennifer A., David Madigan, Adrian E Raftery, and Chris T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417.

Hoffman, Matthew D., and Andrew Gelman. 2011. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” arXiv 1111.4246. http://arxiv.org/abs/1111.4246.

———. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15: 1593–1623. http://jmlr.org/papers/v15/hoffman14a.html.

Jarrett, R. G. 1979. “A Note on the Intervals Between Coal-Mining Disasters.” Biometrika 66 (1): 191–93.

Kim, Sangjoon, Neil Shephard, and Siddhartha Chib. 1998. “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” Review of Economic Studies 65: 361–93.

Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics 34 (1).

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100: 1989–2001.

Lincoln, F. C. 1930. “Calculating Waterfowl Abundance on the Basis of Banding Returns.” United States Department of Agriculture Circular 118: 1–4.

Marsaglia, George. 1972. “Choosing a Point from the Surface of a Sphere.” The Annals of Mathematical Statistics 43 (2): 645–46.

Neal, Radford M. 1996a. Bayesian Learning for Neural Networks. Lecture Notes in Statistics 118. New York: Springer.

———. 1996b. “Sampling from Multimodal Distributions Using Tempered Transitions.” Statistics and Computing 6 (4): 353–66.

———. 1997. “Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification.” 9702. University of Toronto, Department of Statistics.

———. 2003. “Slice Sampling.” Annals of Statistics 31 (3): 705–67.

Papaspiliopoulos, Omiros, Gareth O. Roberts, and Martin Sköld. 2007. “A General Framework for the Parametrization of Hierarchical Models.” Statistical Science 22 (1): 59–73.

Petersen, C. G. J. 1896. “The Yearly Immigration of Young Plaice into the Limfjord from the German Sea.” Report of the Danish Biological Station 6: 5–84.

Piironen, Juho, and Aki Vehtari. 2016. “Projection Predictive Model Selection for Gaussian Processes.” In Machine Learning for Signal Processing (Mlsp), 2016 Ieee 26th International Workshop on. IEEE.

Powell, Michael J. D. 1970. “A Hybrid Method for Nonlinear Equations.” In Numerical Methods for Nonlinear Algebraic Equations, edited by P. Rabinowitz. Gordon; Breach.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.

Richardson, Sylvia, and Walter R. Gilks. 1993. “A Bayesian Approach to Measurement Error Problems in Epidemiology Using Conditional Independence Models.” American Journal of Epidemiology 138 (6): 430–42.

Rubin, Donald B. 1981. “Estimation in Parallel Randomized Experiments.” Journal of Educational Statistics 6: 377–401.

Schofield, Matthew R. 2007. “Hierarchical Capture-Recapture Models.” PhD thesis, Department of of Statistics, University of Otago, Dunedin.

Serban, Radu, and Alan C Hindmarsh. 2005. “CVODES: The Sensitivity-Enabled ODE Solver in SUNDIALS.” In ASME 2005 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 257–69. American Society of Mechanical Engineers.

Smith, Teresa C., David J. Spiegelhalter, and Andrew Thomas. 1995. “Bayesian Approaches to Random-Effects Meta-Analysis: A Comparative Study.” Statistics in Medicine 14 (24): 2685–99.

Swendsen, Robert H., and Jian-Sheng Wang. 1986. “Replica Monte Carlo Simulation of Spin Glasses.” Physical Review Letters 57: 2607–9.

Warn, David E., S. G. Thompson, and David J. Spiegelhalter. 2002. “Bayesian Random Effects Meta-Analysis of Trials with Binary Outcomes: Methods for the Absolute Risk Difference and Relative Risk Scales.” Statistics in Medicine 21: 1601–23.

Zellner, Arnold. 1962. “An Efficient Method of Estimating Seemingly Unrelated Regression Equations and Tests for Aggregation Bias.” Journal of the American Statistical Association 57: 348–68.

Zhang, Hao. 2004. “Inconsistent Estimation and Asymptotically Equal Interpolations in Model-Based Geostatistics.” Journal of the American Statistical Association 99 (465): 250–61.

Zyczkowski, K., and H. J. Sommers. 2001. “Induced Measures in the Space of Mixed Quantum States.” Journal of Physics A: Mathematical and General 34 (35): 7111.


  1. Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as log(sigma) in the above model, will be computed once and reused.↩︎

  2. The Phi_approx function is a rescaled version of the inverse logit function, so while the scale is roughly the same \(\Phi\), the tails do not match.↩︎

  3. The Bernoulli-logit distribution builds in the log link function, taking \[\mathsf{BernoulliLogit}(y \mid \alpha) = \mathsf{Bernoulli}(y \mid \mathrm{logit}^{-1}(\alpha)).\]↩︎

  4. Gelman and Hill (2007) treat the \(\delta\) term equivalently as the location parameter in the distribution of student abilities.↩︎

  5. The prior is named for Lewandowski, Kurowicka, and Joe, as it was derived by inverting the random correlation matrix generation strategy of Lewandowski, Kurowicka, and Joe (2009).↩︎

  6. Thanks to Mike Lawrence for pointing this out in the GitHub issue for the manual.↩︎

  7. The intercept in this model is \(\alpha / (1 - \beta)\). An alternative parameterization in terms of an intercept \(\gamma\) suggested Mark Scheuerell on GitHub is \(y_n \sim \mathsf{normal}(\gamma + \beta \cdot (y_{n-1} - \gamma), \sigma)\).↩︎

  8. In practice, it can be useful to remove the constraint to test whether a non-stationary set of coefficients provides a better fit to the data. It can also be useful to add a trend term to the model, because an unfitted trend will manifest as non-stationarity.↩︎

  9. This subsection is a lightly edited comment of Jonathan Gilligan’s on GitHub; see ↩︎

  10. A more meaningful estimation example would involve a regression of the observed and missing observations using predictors that were known for each and specified in the data block.↩︎

  11. This is not the same as missing components of a multivariate predictor in a regression problem; in that case, you will need to represent the missing data as a parameter and impute missing values in order to feed them into the regression.↩︎

  12. Imposing a constraint such as \(\theta < 0.5\) will resolve the symmetry, but fundamentally changes the model and its posterior inferences.↩︎

  13. Stan’s integer constraints are not powerful enough to express the constraint that \(\mathtt{r\_t[j]} \leq \mathtt{n\_t[j]}\), but this constraint could be checked in the transformed data block.↩︎

  14. When dividing two integers, the result type is an integer and rounding will ensue if the result is not exact. See the discussion of primitive arithmetic types in the reference manual for more information.↩︎

  15. The computations are similar to those involved in expectation maximization (EM) algorithms Dempster, Laird, and Rubin (1977).↩︎

  16. The source of the data is Jarrett (1979), which itself is a note correcting an earlier data collection.↩︎

  17. The R counterpart, ifelse, is slightly different in that it is typically used in a vectorized situation. The conditional operator is not (yet) vectorized in Stan.↩︎

  18. An alternative would be to compute this on the outside and feed it into the Stan model as preprocessed data. Yet another alternative encoding would be a sparse one recording only the capture events along with their time and identifying the individual captured.↩︎

  19. Both functions return 0 if the individual represented by the input array was never captured. Individuals with no captures are not relevant for estimating the model because all probability statements are conditional on earlier captures. Typically they would be removed from the data, but the program allows them to be included even though they make not contribution to the log probability function.↩︎

  20. Diagnostic procedures are often ordinal, as in stages of cancer in oncological diagnosis or the severity of a cavity in dental diagnosis. Dawid and Skene’s model may be used as is or naturally generalized for ordinal ratings using a latent continuous rating and cutpoints as in ordinal logistic regression.↩︎

  21. In the subscript, \(z[i]\) is written as \(z_i\) to improve legibility.↩︎

  22. For clustering, the non-identifiability problems for all mixture models present a problem, whereas there is no such problem for classification. Despite the difficulties with full Bayesian inference for clustering, researchers continue to use it, often in an exploratory data analysis setting rather than for predictive modeling.↩︎

  23. Gaussian processes can be extended to covariance functions producing positive semi-definite matrices, but Stan does not support inference in the resulting models because the resulting distribution does not have unconstrained support.↩︎

  24. This example is drawn from the documentation for the Boost Numeric Odeint library (Ahnert and Mulansky 2011), which Stan uses to implement the rk45 solver.↩︎

  25. Not coincidentally, high curvature in the posterior of a general Stan model poses the same kind of problem for Euclidean Hamiltonian Monte Carlo (HMC) sampling. The reason is that HMC is based on the leapfrog algorithm, a gradient-based, stepwise numerical differential equation solver specialized for Hamiltonian systems with separable potential and kinetic energy terms.↩︎

  26. The notable exception is Intel’s optimizing compilers under certain optimization settings.↩︎

  27. The main problem with comments is that they can be misleading, either due to misunderstandings on the programmer’s part or because the program’s behavior is modified after the comment is written. The program always behaves the way the code is written, which is why refactoring complex code into understandable units is preferable to simply adding comments.↩︎

  28. Just because this makes it possible to code a rejection sampler does not make it a good idea. Rejections break differentiability and the smooth exploration of the posterior. In Hamiltonian Monte Carlo, it can cause the sampler to be reduced to a diffusive random walk.↩︎

  29. A range of built-in validation routines is coming to Stan soon! Alternatively, the reject statement can be used to check constraints on the simplex.↩︎

  30. As of Stan 2.9.0, the only way a user-defined producer will raise an exception is if a function it calls (including sampling statements) raises an exception via the reject statement.↩︎

  31. The original code and impetus for including this in the manual came from the Stan forums post ; by user lcomm, who also explained truncation above and below.↩︎

  32. The problem is the (extremely!) light tails of the triangle distribution. The standard HMC and NUTS samplers can’t get into the corners of the triangle properly. Because the Stan code declares y to be of type real<lower=-1,upper=1>, the inverse logit transform is applied to the unconstrained variable and its log absolute derivative added to the log probability. The resulting distribution on the logit-transformed y is well behaved.↩︎

  33. This example was raised by Richard McElreath on the Stan users group in a query about the difference in behavior between Gibbs sampling as used in BUGS and JAGS and the Hamiltonian Monte Carlo (HMC) and no-U-turn samplers (NUTS) used by Stan.↩︎

  34. The marginal posterior \(p(\sigma|y)\) for \(\sigma\) is proper here as long as there are at least two distinct data points.↩︎

  35. A Laplace prior (or an L1 regularizer for penalized maximum likelihood estimation) is not sufficient to remove this additive invariance. It provides shrinkage, but does not in and of itself identify the parameters because adding a constant to \(\lambda_1\) and subtracting it from \(\lambda_2\) results in the same value for the prior density.↩︎

  36. Tempering methods may be viewed as automated ways to carry out such a search for modes, though most MCMC tempering methods continue to search for modes on an ongoing basis; see (Swendsen and Wang 1986; Neal 1996b).↩︎

  37. This is in contrast to (penalized) maximum likelihood estimates, which are not parameterization invariant.↩︎

  38. This example is for illustrative purposes only; the recommended way to implement the lognormal distribution in Stan is with the built-in lognormal probability function; see the functions reference manual for details.↩︎

  39. This parameterization came to be known on our mailing lists as the “Matt trick” after Matt Hoffman, who independently came up with it while fitting hierarchical models in Stan.↩︎

  40. Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) overcomes this difficulty by simulating the Hamiltonian dynamics in a space with a position-dependent metric; see Girolami and Calderhead (2011) and Betancourt (2012).↩︎

  41. Stan uses its own arena-based allocation, so allocation and deallocation are faster than with a raw call to new.↩︎

  42. Future versions of Stan may remove this inefficiency by more fully exploiting expression templates inside the Eigen C++ matrix library. This will require enhancing Eigen to deal with mixed-type arguments, such as the type double used for constants and the algorithmic differentiation type stan::math::var used for variables.↩︎

  43. The term “shard” is borrowed from databases, where it refers to a slice of the rows of a database. That is exactly what it is here if we think of rows of a dataframe. Stan’s shards are more general in that they need not correspond to rows of a dataframe.↩︎

  44. This example is a simplified form of the model described in (Gelman and Hill 2007, Section 14.2)↩︎

  45. This makes the strong assumption that each group has the same number of observations!↩︎

  46. Even 80 characters may be too many for rendering in print; for instance, in this manual, the number of code characters that fit on a line is about 65.↩︎

  47. This is the usual convention in both typesetting and other programming languages. Neither R nor BUGS allows breaks before an operator because they allow newlines to signal the end of an expression or statement.↩︎

  48. Except where otherwise noted, we use “BUGS” to refer to WinBUGS, OpenBUGS, and JAGS, indiscriminately.↩︎

  49. Most distributions have been vectorized, but currently the truncated versions may not exist and may not be vectorized.↩︎

References

Ahnert, Karsten, and Mario Mulansky. 2011. “Odeint—Solving Ordinary Differential Equations in C++.” arXiv 1110.3397.

Betancourt, Michael. 2012. “A General Metric for Riemannian Manifold Hamiltonian Monte Carlo.” arXiv 1212.4693. http://arxiv.org/abs/1212.4693.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 1–38.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

Girolami, Mark, and Ben Calderhead. 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2): 123–214.

Jarrett, R. G. 1979. “A Note on the Intervals Between Coal-Mining Disasters.” Biometrika 66 (1): 191–93.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100: 1989–2001.

Neal, Radford M. 1996b. “Sampling from Multimodal Distributions Using Tempered Transitions.” Statistics and Computing 6 (4): 353–66.

Swendsen, Robert H., and Jian-Sheng Wang. 1986. “Replica Monte Carlo Simulation of Spin Glasses.” Physical Review Letters 57: 2607–9.