July 30, 2016

- You write the program in a (text) .stan file in the Stan language
- Stan's parser,
`stanc`

, does two things:- checks that program is syntactically valid and tells you if not
- writes a conceptually equivalent C++ source file to disk

- C++ compiler creates a binary file from the C++ source
- C++ is used due to operator overloading and templating
- C++ also facilitates autodifferentiation

- You execute the binary from R (can be concurrent with 2 – 4)
- You analyze the resulting draws from the posterior

- In Stan / C++, variables must first be declared with types
- In Stan / C++, statements are terminated with semi-colons
- Primitive scalar declarations:
`real x;`

or`int K;`

- Unknowns cannot be int. No derivatives and hence no HMC
- Can condition on integer data. No derivatives are needed

- Real declarations:
`vector[K] z;`

,`row_vector[K] zt;`

,`matrix[N,K] X;`

- Arrays are just holders of any other
*homogenous*objects, like an R list where are elements are restricted to be the same type and shape - Vectors and matrices cannot contain
*genuinely*integer data so use integer array declarations:`int y[N];`

or`int Y[N,P]`

;

**rstan**has a function called`lookup`

- Input the name of an R functionto find an analagous Stan function
- Input a regular expression to find all matching Stan functions

suppressPackageStartupMessages(library(rstan)) lookup("besselK")

## StanFunction Arguments ReturnType Page ## 290 modified_bessel_second_kind (int v, real z) real 342

`functions`

Block of a Stan Program- Stan permits users to define and use their own functions
- If used, must be defined in a leading
`functions`

block - Can only validate constraints inside user-defined functions
- Very useful for several reasons
- Easier to reuse across different .stan programs
- Makes subsequent chunks of code more readable
- Enables likelihoods with Ordinary Differential Equations
- Can be exported to R via
`expose_stan_functions`

- All functions, whether user-defined or build-in, must be called by argument position rather than by argument name, and there are no default arguments
- See cumprod.stan file

- Any primitive object can have lower and / or upper bounds if declared in the
`data`

,`transformed data`

,`parameters`

, or`transformed parameters`

blocks `int<lower=1> K; real<lower=-1,upper=1> rho;`

`vector<lower=0>[K] alpha;`

and similarly for a`matrix`

- Alternatively, a
`vector`

can be specialized as`unit_vector[K] x;`

implies \(\sum_{k=1}^K{x_k^2} = 1\)`simplex[K] x;`

implies \(x_k \geq 0 \forall k\) and \(\sum_{k=1}^K{x_k} = 1\)`ordered[K] x;`

implies \(x_i \leq x_j \forall i<j\)`positive_ordered[K] x;`

implies also \(0 \leq x_1\)

- Alternatively, a
`matrix`

can be specialized as`cov_matrix[K] Sigma`

or better`cholesky_factor_cov[K,K] L;`

`corr_matrix[K] Lambda`

or better`cholesky_factor_corr[K] L;`

`data`

Block of a Stan Program- Contains declarations for
*everything*being conditioned on in Bayes Rule - Each such object needs to be passed from R as a named list
- Can have comments in C++ style (
`// comment`

or`/* comment */`

) Whitespace is essentially irrelevant

data { int<lower=1> N; // number of observations int<lower=1> K; // number of predictors matrix[N, K] X; // design matrix vector[N] y; // outcomes real<lower=0> prior_mean; // hyperparameter }

`transformed data`

Block- Is executed only once before the iterations start
- Used to calculate needed deterministic functions of objects in the
`data`

block - Can use it to check that data was passed correctly from R
All declarations must come directly after the opening {

transformed data { vector[N] log_y; log_y = log(y); }

`parameters`

Block of a Stan Program- Declare everything whose posterior distribution is sought
- Cannot declare
`int`

parameters - Cannot do assignments within the
`parameters`

block *Must*specify the sample space of the parameters but lower and upper bounds are implicitly \(\pm\infty\) if unspecifiedparameters { vector[K] beta; real<lower=0> sigma_unscaled; // Jacobian handled automatically here }

`transformed parameters`

Block- Like
`transformed data`

but involves objects declared in the`parameters`

block and is evaluated each leapfrog step Constraints are validated and draws are stored

transformed parameters { real<lower=0> sigma; sigma = sigma_unscaled * prior_mean; }

`model`

Block of a Stan Program- Builds up a evaluation of the log-kernel function with the
`target`

keyword Can declare local objects at the top of the

`model`

block and then assign to them but draws are not storedmodel { vector[N] eta; eta = X * beta; target += normal_lpdf(log_y | eta, sigma); // likelihood of log(y) target += normal_lpdf(beta | 0, 5); // prior for each beta_k target += exponential_lpdf(sigma_unscaled | 1); // prior for sigma_unscaled }

Can increment

`target`

with user-defined functions or arbitrary expressions

`generated quantities`

Block- Only evaluated once per iteration
- Useful to declare and define objects of interest that do not go into the likelihood function
- Can reference any object declared in
`data`

,`transformed data`

,`parameters`

, or`transformed parameters`

blocks Can use pseduo-random number generation

generated quantities { vector[N] y_rep; // posterior beliefs about each y[n] for (n in 1:N) y_rep[n] = normal_rng(X[n,] * beta, sigma); }

state.x77 <- within(as.data.frame(state.x77), { # choose reasonable units Density <- Population / Area Income <- Income / 1000 Frost <- Frost / 100 }) X <- model.matrix(Murder ~ Density + Income + Illiteracy + Frost, data = state.x77) y <- state.x77$Murder data_block <- list(N = nrow(X), K = ncol(X), X = X, y = y, prior_mean = 5) options(mc.cores = parallel::detectCores()) post <- stan("regression.stan", data = data_block)

print(post, pars = 'y_rep', include = FALSE, digits = 2, probs = c(.25, .75))

## Inference for Stan model: regression. ## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000. ## ## mean se_mean sd 25% 75% n_eff Rhat ## beta[1] 0.75 0.02 0.74 0.26 1.22 1476 1 ## beta[2] -0.60 0.01 0.33 -0.83 -0.39 2129 1 ## beta[3] 0.17 0.00 0.13 0.08 0.26 1666 1 ## beta[4] 0.57 0.00 0.16 0.46 0.68 1728 1 ## beta[5] -0.23 0.00 0.18 -0.35 -0.11 2090 1 ## sigma_unscaled 0.09 0.00 0.01 0.09 0.10 2368 1 ## sigma 0.47 0.00 0.05 0.44 0.51 2368 1 ## lp__ -47.99 0.05 1.82 -48.95 -46.64 1315 1 ## ## Samples were drawn using NUTS(diag_e) at Sat Jul 30 13:28:10 2016. ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).

CA_rep <- extract(post, pars = 'y_rep')[[1]][,which(rownames(state.x77) == "California")] hist(state.x77["California", "Murder"] - exp(CA_rep), prob = TRUE, main = "Errors for California", las = 1)

- Hierchical models are essentially models with interaction terms between predictors and group-indicators with the additional provisions that:
- Group deviations are from a common mean rather than a baseline
- Include distributional assumptions for how the groups deviate

- Suppose there are \(J\) groups and \(N_j\) observations in the \(j\)-th group and \[y_{ij} \thicksim \mathcal{N}\left(\mu_{ij},\sigma\right) \forall i,j\] \[\mu_{ij} = \left(\alpha + a_j\right) + \left(\beta + b_j\right)x_{ij} = \alpha + \beta x_{ij} + a_j + b_j x_{ij} \forall i,j\] \[\begin{bmatrix}a_j\\ b_j \end{bmatrix} \thicksim \mathcal{N}_{2}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_a^2 & \rho \sigma_a \sigma_b \\ \rho \sigma_a \sigma_b & \sigma_b^2 \end{bmatrix}\right) \forall j\]
- Bayesians put priors on the
*common*parameters \(\sigma\), \(\alpha\), \(\beta\), \(\rho\), \(\sigma_a\), and \(\sigma_b\)

- For frequentists, \(a_j + b_j x_{ij}\) is part of the error term; thus
- Observations within group \(j\) aren't conditionally independent given \(\alpha\) & \(\beta\)
- Frequentists are willing to make distributional assumptions about \(a_j\) and \(b_j\) (invariably bivariate normal). Let \(\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_a^2 & \rho \sigma_a \sigma_b \\ \rho \sigma_a \sigma_b & \sigma_b^2 \end{bmatrix}\) and

- This particular integral happens to have a closed-form solution and one can choose \(\sigma\), \(\alpha\), \(\beta\), \(\rho\), \(\sigma_a\), and \(\sigma_b\) to maximize \(\sum_{j=1}^J \sum_{i=1}^{N_j} \ln z_{ij}\)
- But maximum likelihood is not a great estimator here so people penalize

- For frequentists, \(a_j\) and \(b_j\) are not parameters and thus cannot be
*estimated* - \(a_j\) and \(b_j\) can be
*predicted*from group \(j\)'s residuals implied by \(\widehat{\alpha}\) and \(\widehat{\beta}\) - Since \(a_j\) and \(b_j\) are not
*estimated*, you cannot construct standard errors - Thus, you cannot make frequentist inferences about \(a_j\) and / or \(b_j\)
- You can conceptualize standard errors for the estimator of the common parameters \(\sigma\), \(\alpha\), \(\beta\), \(\rho\), \(\sigma_a\), and \(\sigma_b\) but they are hard to calculate unless you treat the predictions of \(a_j\) and \(b_j\) as given
- To obtain a closed-form likelihood function to maximize, you have to assume normality both for the outcome (conditional on \(x_{ij}\)) and for \(a_j, b_j\). Otherwise, you get all of the computational difficulty with intractable integrals that Bayesians avoid with MCMC and none of the benefit in interpretation.
- The optimization process is not nearly as routine as it is for flat GLMs, and you often get corner solutions where \(\widehat{\boldsymbol{\Sigma}}\) is not positive definite

`stan_glmer`

Function- The
`[g]lmer`

functions in the**lme4**R package are very popular because people want to quickly estimate hierarchical models with a convenient syntax and interpret the results as if they were Bayesian - But you can slowly estimate hierarchical models using the same convenient syntax by using the
`stan_glmer`

function in the**rstanarm**R package and interpret the results in a genuinely Bayesian fashion - Eric will show an example of this

- You can write almost any posterior log-kernel (whose parameters are continuous) in the Stan language
- Whether Stan samples efficiently depends on how numerically stable your Stan program is
- But Stan will have no problem with a multivariate normal
- Stan has no problem with a lot of non-normal posteriors
- If Stan has problems, you can usually work around them
- If you cannot get around those problems, other software will usually suffer from them too without being as easy to diagnose

- Easy to include particular Stan program(s) in a package like
**rstanarm**does; see the`rstan.package.skeleton`

function in the**rstan**package