July 11, 2016

## Installation

• This is available from http://mc-stan.org/workshops/IMPS2016/
• If you have not installed the rstan R package yet, please follow the steps at the wiki because you need Xcode on a Mac or RTools on Windows
• Try to avoid installing packages in a directory with a space but can fix on Windows with
install.packages("http://win-builder.r-project.org/o5akk04AfsIm/rstan_2.10.1-1.zip",
repos = NULL, dependencies = TRUE)
• Please verify that a Stan program compiles and runs with
example("stan_model", package = "rstan", run.dontrun = TRUE)
• Also, verify that packageVersion("rstanarm") is 2.10.1; otherwise execute
install.packages("rstanarm", repos = "https://cloud.r-project.org", dependencies = TRUE)

## Outline

1. Why Bayes?
2. Break from 9:30 to 9:45
3. Bayesian process
4. Stan language
5. Lunch break from 12:00 to 1:30
6. Hierarchical models
7. Applications of Stan to IRT models
8. Break from 2:30 to 2:45
9. More applications of Stan to IRT models

## Obligatory Disclosure

• Ben is an employee of Columbia University, which has received several research grants to develop Stan
• Ben is also a cofounder of Stan Group (http://stan.fit), which provides support, consulting, etc. for businesses using Stan
• According to Columbia University policy, any such employee who has any equity stake in, a title (such as officer or director) with, or is expected to earn at least $$\5,000.00$$ per year from a private company is required to disclose these facts in presentations
• Daniel is neither an employee of Columbia University nor affiliated with Stan Group and thus is not implicated by the above

## What Is the Probability that Johnny Manziel Has Narcissistic Personality Disorder?

• For those of you who may not know, Johnny Manziel is a quarterback
• For the past couple of years, he has been more known for off-the-field problems than for playing quarterback
• Based on what we observe, pyschologists and non-psychologists have speculated about his mental state
• Where did these beliefs about the probability of a diagnosis come from?

## Different Perspectives on Probability

What is the paradigm? What is fixed? What is random? What proportion is important? What is the conclusion?
Randomization $${y_1, y_2, \dots, y_N}$$ Treatment assignment $$p$$-value for null: ATE $$= 0$$? ATE $$\neq 0$$
Frequentist $$Y$$, $$\boldsymbol{\theta}$$, $$N$$ Sample inclusion $$\theta \in$$ confidence intervals (plural) Something basically Bayesian
Supervised learning $${y_1, y_2, \dots, y_N}$$ Training / testing inclusion Correctly classified outcomes in testing data Some procedure predicts best
Bayesian $${y_1, y_2, \dots, y_N}$$, $$\boldsymbol{\theta}$$ Beliefs about $$\boldsymbol{\theta}$$ Posterior draws of $$\theta \in \left(a,b\right)$$ Decision or action

## Two Justifications for Bayes Rule

1. $$f\left(\mathbf{y}\right) \times f\left(\boldsymbol{\theta} | \mathbf{y}\right) = f\left(\boldsymbol{\theta}, \mathbf{y}\right) = f\left(\boldsymbol{\theta}\right) \times f\left(\mathbf{y} | \boldsymbol{\theta}\right) \implies f\left(\boldsymbol{\theta} | \mathbf{y}\right) = \frac{f\left(\boldsymbol{\theta}\right) \times f\left(\mathbf{y} | \boldsymbol{\theta}\right)}{f\left(\mathbf{y}\right)}$$ where $$\mathbf{y} = \{y_1, y_2 \dots y_N\}$$ and $$f\left(\cdot\right)$$ is a PDF so $$f\left(\cdot\right) \geq 0$$ & $$\int f\left(u\right)du = 1$$
• $$f\left(\boldsymbol{\theta}\right)$$ represents what someone believes about $$\boldsymbol{\theta}$$ prior to observing $$\mathbf{y}$$
• $$f\left(\boldsymbol{\theta} | \mathbf{y}\right)$$ represents what someone believes about $$\boldsymbol{\theta}$$ after observing $$\mathbf{y}$$
• $$f\left(\mathbf{y} | \boldsymbol{\theta}\right)$$ is the likelihood function, a function of $$\boldsymbol{\theta}$$ for an observed $$\mathbf{y}$$
• $$f\left(\mathbf{y}\right) = \int \cdots \int \int f\left(\boldsymbol{\theta}\right) f\left(\mathbf{y} | \boldsymbol{\theta}\right) d\theta_1 d\theta_2 \dots d\theta_K = \mathbb{E}_{\boldsymbol{\theta}}f\left(\mathbf{y} | \boldsymbol{\theta}\right)$$
2. $$f\left(\boldsymbol{\theta} | \mathbf{y}\right)$$ is the unique function that minimizes the sum of
• Penalty: Kullback-Leibler divergence to $$f\left(\boldsymbol{\theta}\right)$$
• Expected misfit: $$\mathbb{E}_{\boldsymbol{\theta}}\left[-\ln f\left(\mathbf{y} | \boldsymbol{\theta}\right)\right]$$

## Markov Chain Monte Carlo

• Even if $$f\left(\mathbf{y}\right)$$ could be calculated, you would have to do another K-dimensional integral to obtain something like $$\mathbb{E}\left[\theta_k | \mathbf{y}\right]$$
• So we draw randomly $$S$$ times from the posterior distribution — which does not require knowing $$f\left(\mathbf{y}\right)$$ — and estimate $$\mathbb{E}\left[\theta_k | \mathbf{y}\right]$$ with $$\frac{1}{S}\sum_{s=1}^S{\tilde{\theta}_k^{[s]}}$$
• There is no way to draw independently from most posterior distributions
• The price to be paid for relying on Markov Chain Monte Carlo (MCMC) to draw from a posterior distribution is that the draws are not independent
• The degree of dependence in a MCMC algorithm governs how badly $$\frac{1}{S}\sum_{s=1}^S{g\left(\widetilde{\boldsymbol{\theta}}^{[s]}\right)}$$ estimates $$\mathbb{E}g\left(\boldsymbol{\theta}\right) | \mathbf{y}$$ for finite $$S$$
• Effective Sample Size is a concept like that in complex survey design and is defined as the number of independent draws that would estimate a posterior mean with the same precision as the $$S$$ dependent draws you do have

## A Markov Process with Severe Dependence

par(mar = c(4,4,1,1) + .1, las = 1, bg = "lightgrey")
x <- sapply(1:6, FUN = function(i) arima.sim(model = list(ar = 0.9999999), n = 10^6))
matplot(x, type = "l", col = 1:6, lty = 1)
for (j in 1:ncol(x)) abline(h = mean(x[,j]), col = j, lty = 2)

## Bayesianism and the Crisis in Psychology

• Bayesianism cannot solve some aspects of the Crisis in Psychology, such as selective reporting of results, but it could help in some ways:
• Harder to find borderline results if skeptical priors are used originally
• Can talk about the probability that a research hypothesis is true
• Easier to replicate if informative priors are used in the replication
• Andrews and Baguley 2016: What is the probability that a null hypothesis is true given that a $$p$$-value is less than $$\alpha$$? $\Pr\left(H_0 | p \leq \alpha\right) = \frac{\alpha}{\alpha + \omega + \frac{\lambda}{1-\lambda}}$ where $$\omega$$ is the power of the test and $$\lambda$$ is the prior probability that the alternative hypothesis is true
• You have to introduce prior beliefs in order to make sense out of the Crisis

## Why Doesn't Everyone Use Bayesian Methods?

• There are very few useful analytical results
• Can't let 1 programmer write generic code that all paying researchers use
• Posterior distribution depends not just on the researcher's data but on the prior beliefs of the researcher, which must be encoded somehow
• To express your prior beliefs using probability distributions, you need to know the functional characteristics of lots of probability distributions
• Drawing from an entire probability distribution is a much more ambitious task than finding a optimal point and takes a lot longer
• Many researchers were frustrated by the BUGS family of software
• Harder to publish a Bayesian analysis in an applied journal

## What is Stan and How Does It Help?

• Includes a probabalistic programming language
• The rstanarm, brms, and rethinking R packages provide code to specify some statistical models — with a limited choice of prior distributions — that can be mapped into the Stan language
• Includes new Hamiltonian Monte Carlo (HMC) algorithms
• HMC is to MCMC as BFGS is to optimization
• HMC is aided by the gradient of the posterior distribution wrt $$\boldsymbol{\theta}$$
• Dependence between consecutive draws is minimal
• Includes a matrix and scalar math library that supports autodifferentiation
• Includes interfaces from R and other high-level software
• Includes (not Stan specific) post-estimation R functions of MCMC output
• Includes a large community of users and many developers

## Overview of Hamiltonian Monte Carlo

• Since the early 1990s, most MCMC uses Gibbs updates when feasible and falls back to something more general otherwise
• Gibbs entails drawing $$\theta_k$$ from its "full-conditional distribution": $$\theta_k | \boldsymbol{\theta}_{-k}, \mathbf{y}$$
• "Something more general" includes slice sampling, Metropolis-Hastings, etc., which is needed when the full-conditional distribution of $$\theta_k$$ is not known in closed form
• If Gibbs updates are feasible, they are easy to code and fast to execute but are statistically inefficient because the dependence between draws is high
• HMC differs from Gibbs in that all elements of $$\boldsymbol{\theta}$$ are updated simultaneously
• H stands for Hamiltonian, which is a physics framework for how a particle $$\left(\boldsymbol{\theta}\right)$$ moves through an unbounded frictionless space

## Example of Drawing from a Multivariate Normal

• $$\mathbf{y} \thicksim \mathcal{N}_{250}\left(\mathbf{0}, \boldsymbol{\Sigma}\right)$$ where $$\boldsymbol{\Sigma}$$ is ill-conditioned but focus on just two dimensions
• Do 1 million draws w/ Random Walk Metropolis & Gibbs, thinning by $$1000$$
• Do 1000 draws with the NUTS algorithm in Stan and 1000 independent draws

Comparison of MCMC Samplers

## Details of Hamiltonian Monte Carlo

• HMC augments the parameter space with a momentum vector $$\left(\boldsymbol{\phi}\right)$$ of size $$K$$
• $$\boldsymbol{\phi}$$ does not enter the likelihood for $$\mathbf{y}$$, so its marginal posterior distribution is the same as its prior distribution, which is multivariate normal with mean vector zero and a covariance matrix that is tuned during the warmup phase
• Given a draw of $$\boldsymbol{\phi}$$ from this multivariate normal distribution, the Hamiltonian equations tell us where $$\boldsymbol{\theta}$$ would move to in $$t$$ periods, depending on the posterior kernel in log-units $$\ln f\left(\boldsymbol{\theta}\right) + \ln f\left(\mathbf{y} | \boldsymbol{\theta}\right)$$
• We approximate the solution to the Hamiltonian equations numerically assuming discrete time
• Draw from the footprints of the discrete Hamiltonian path with a categorical distribution whose probabilities are proportional to the posterior kernel
• Stepsize and momentum are automatically tuned but can be adjusted by you
• Essentially, the only thing that can go wrong is numerical instability

## A Model for SAT Quantiatative Scores

data("sat.act", package = "psych") # requires psych R package
sat.act <- within(sat.act, { # choose reasonable codings
gender <- factor(gender, labels = c("male", "female"))
# education <- as.factor(education)
})
library(rstanarm)
options(mc.cores = parallel::detectCores())
post <- stan_lm(SATQ ~ gender + education + age,
data = sat.act, prior = R2(stop("put a number here")))
print(post, digits = 2)

## Results

## stan_lm(formula = SATQ ~ gender + education + age, data = sat.act,
##     prior = R2(0.25))
##
## Estimates:
## (Intercept)   640.38  13.97
## genderfemale  -42.04   8.96
## education       8.34   3.61
## age            -1.16   0.54
## sigma         113.63   3.12
## log-fit_ratio   0.00   0.03
## R2              0.04   0.01
##
## Sample avg. posterior predictive
## distribution of y (X = xbar):
## mean_PPD 610.32   5.87

## Diagnostics

• The defaults for Stan are 4 chains, each with 2000 iterations, of which 1000 are discarded as warmup
• For functions in the rstanarm package, that is almost certainly plenty for convergence and a large enough effective sample size
• That may not be the case for other Stan programs
• Any Stan program may yield a warning about "divergent transitions"
• Divergent transitions can often be avoided by increasing the adapt_delta tuning parameter
• But increasing the adapt_delta tuning parameter causes Stan to take smaller steps, in which case you may get a warning about hitting the maximum treedepth
• The max_tredepth tuning parameter can also be increased

mean(as.data.frame(post)$education > 0) # Pr(beta_{education} > 0) ## [1] 0.9885 round(posterior_interval(post, prob = 0.5), digits = 3) # endpoints of the IQRs ## 25% 75% ## (Intercept) 630.909 649.767 ## genderfemale -48.152 -36.066 ## education 5.947 10.839 ## age -1.527 -0.803 ## sigma 111.568 115.762 ## log-fit_ratio -0.016 0.021 ## R2 0.029 0.047 launch_shinystan(post) ## Model Comparison • The most important insight of supervised learning is that you will choose a model that overfits if you evaluate the models on the same data that you estimate the models with • Thus, supervised learning people partition "the" data (often randomly) into a training dataset (that is used to "train a model") and a testing dataset (that is used to evaluate how well models predict) • Nothing prevents you from doing that in a Bayesian context but holding out data makes your posterior distribution more diffuse • Bayesians usually condition on all the data and evaluate how well a model is expected to predict out of sample using "information criteria", which are all intended to select the model with the highest expected log predictive density (ELPD) for new data • This is easy to do with rstanarm using the loo and compare functions under the verifiable assumption that each observation could be omitted without having a drastic effect on the posterior distribution ## Using the loo Function (loo_1 <- loo(post)) ## Computed from 4000 by 687 log-likelihood matrix ## ## Estimate SE ## elpd_loo -4230.2 18.8 ## p_loo 4.9 0.4 ## looic 8460.4 37.5 ## ## All Pareto k estimates OK (k < 0.5) post2 <- stan_glm(SATQ ~ gender * education + age, data = sat.act, family = gaussian, prior = normal(0, 5), prior_intercept = student_t(df = 2)) compare(loo_1, loo(post2)) ## elpd_diff se ## -1.0 0.6 ## Stan Language ## Workflow for Stan via the rstan R Pakcage 1. You write the program in a (text) .stan file in the Stan language 2. Stan's parser, stanc, does two things: • checks that program is syntactically valid and tells you if not • writes a conceptually equivalent C++ source file to disk 3. C++ compiler creates a binary file from the C++ source 4. C++ is used due to operator overloading and templating • C++ also facilitates autodifferentiation 5. You execute the binary from R (can be concurrent with 2 – 4) 6. You analyze the resulting draws from the posterior ## Primitive Object Types in Stan • In Stan / C++, variables must first be declared with types • In Stan / C++, statements are terminated with semi-colons • Primitive scalar declarations: real x; or int K; • Unknowns cannot be int. No derivatives and hence no HMC • Can condition on integer data. No derivatives are needed • Real declarations: vector[K] z;, row_vector[K] zt;, matrix[N,K] X; • Arrays are just holders of any other homogenous objects, like an R list where are elements are restricted to be the same type and shape • Vectors and matrices cannot contain genuinely integer data so use integer array declarations: int y[N]; or int Y[N,P]; ## Builtin Functions in Stan • rstan has a function called lookup • Input the name of an R functionto find an analagous Stan function • Input a regular expression to find all matching Stan functions suppressPackageStartupMessages(library(rstan)) lookup("besselK") ## StanFunction Arguments ReturnType Page ## 290 modified_bessel_second_kind (int v, real z) real 342 ## Optional functions Block of a Stan Program • Stan permits users to define and use their own functions • If used, must be defined in a leading functions block • Can only validate constraints inside user-defined functions • Very useful for several reasons 1. Easier to reuse across different .stan programs 2. Makes subsequent chunks of code more readable 3. Enables likelihoods with Ordinary Differential Equations 4. Can be exported to R via expose_stan_functions • All functions, whether user-defined or build-in, must be called by argument position rather than by argument name, and there are no default arguments • See cumprod.stan file ## Constrained Object Declarations in Stan • Any primitive object can have lower and / or upper bounds if declared in the data, transformed data, parameters, or transformed parameters blocks • int<lower=1> K; real<lower=-1,upper=1> rho; • vector<lower=0>[K] alpha; and similarly for a matrix • Alternatively, a vector can be specialized as 1. unit_vector[K] x; implies $$\sum_{k=1}^K{x_k^2} = 1$$ 2. simplex[K] x; implies $$x_k \geq 0 \forall k$$ and $$\sum_{k=1}^K{x_k} = 1$$ 3. ordered[K] x; implies $$x_i \leq x_j \forall i<j$$ 4. positive_ordered[K] x; implies also $$0 \leq x_1$$ • Alternatively, a matrix can be specialized as 1. cov_matrix[K] Sigma or better cholesky_factor_cov[K,K] L; 2. corr_matrix[K] Lambda or better cholesky_factor_corr[K] L; ## Required data Block of a Stan Program • Contains declarations for everything being conditioned on in Bayes Rule • Each such object needs to be passed from R as a named list • Can have comments in C++ style (// comment or /* comment */) • Whitespace is essentially irrelevant data { int<lower=1> N; // number of observations int<lower=1> K; // number of predictors matrix[N, K] X; // design matrix vector[N] y; // outcomes real<lower=0> prior_scale; // hyperparameter } ## Optional transformed data Block • Is executed only once before the iterations start • Used to calculate needed deterministic functions of objects in the data block • Can use it to check that data was passed correctly from R • All declarations must come directly after the opening { transformed data { vector[N] log_y; log_y = log(y); } ## Required parameters Block of a Stan Program • Declare everything whose posterior distribution is sought • Cannot declare int parameters • Cannot do assignments within the parameters block • Must specify the sample space of the parameters but lower and upper bounds are implicitly $$\pm\infty$$ if unspecified parameters { vector[K] beta; real<lower=0> sigma_unscaled; // Jacobian handled automatically here } ## Optional transformed parameters Block • Like transformed data but involves objects declared in the parameters block and is evaluated each leapfrog step • Constraints are validated and draws are stored transformed parameters { real<lower=0> sigma; sigma = sigma_unscaled * prior_scale; } ## Required model Block of a Stan Program • Builds up a evaluation of the log-kernel function with the target keyword • Can declare local objects at the top of the model block and then assign to them but draws are not stored model { vector[N] eta; eta = X * beta; target += normal_lpdf(log_y | eta, sigma); // likelihood of log(y) target += normal_lpdf(beta | 0, 5); // prior for each beta_k target += exponential_lpdf(sigma_unscaled | 1); // prior for sigma_unscaled } • Can increment target with user-defined functions or arbitrary expressions ## Optional generated quantities Block • Only evaluated once per iteration • Useful to declare and define objects of interest that do not go into the likelihood function • Can reference any object declared in data, transformed data, parameters, or transformed parameters blocks • Can use pseduo-random number generation generated quantities { vector[N] y_rep; // posterior beliefs about each y[n] for (n in 1:N) y_rep[n] = normal_rng(X[n,] * beta, sigma); } ## Calling a Stan Program X <- model.matrix(SATQ ~ gender + education + age, data = sat.act) y <- sat.act$SATQ; y <- y[!is.na(y)]
data_block <- list(N = nrow(X), K = ncol(X), X = X, y = y, prior_scale = 5)
options(mc.cores = parallel::detectCores())
post <- stan("regression.stan", data = data_block)

## Results

print(post, pars = 'y_rep', include = FALSE, digits = 2, probs = c(.25, .75))
## Inference for Stan model: regression.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
##                 mean se_mean   sd   25%   75% n_eff Rhat
## beta[1]         6.44    0.00 0.03  6.42  6.46  2201    1
## beta[2]        -0.07    0.00 0.02 -0.08 -0.06  2500    1
## beta[3]         0.02    0.00 0.01  0.01  0.02  2604    1
## beta[4]         0.00    0.00 0.00  0.00  0.00  3428    1
## sigma_unscaled  0.04    0.00 0.00  0.04  0.04  2142    1
## sigma           0.21    0.00 0.01  0.21  0.21  2142    1
## lp__           84.63    0.04 1.52 83.87 85.75  1686    1
##
## Samples were drawn using NUTS(diag_e) at Mon Jul 11 22:34:07 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
##  The estimated Bayesian Fraction of Missing Information is a measure of
##  the efficiency of the sampler with values close to 1 being ideal.
##  For each chain, these estimates are
##  1.1 1 1 1.1

## Summary

• Using the model-fitting functions in the rstanarm package is easy
• stan_lm, stan_aov, stan_glm, stan_glm.nb, and stan_polr all have the same syntax and same likelihood as their frequentist counterparts
• Developers can add models to rstanarm or copy the build process of rstanarm into their own R packages to use Stan to estimate particular models
• The brms (on CRAN) and rethinking (on GitHub) packages are a bit different than rstanarm but permit estimation of an overlapping set of models w/ Stan
• Using Stan for Bayesian inference is sufficiently easy for most basic and some not-so-basic models that there should rarely be a reason to use frequentist tools to make Bayesian inferences
• But to take advantage of the full generality of Stan, you eventually have to start writing your own models in the Stan language

## Bayesian Perspective on Hierarchical Models

• Hierchical models are essentially models with interaction terms between predictors and group-indicators with the additional provisions that:
• Group deviations are from a common mean rather than a baseline
• Include distributional assumptions for how the groups deviate
• Suppose there are $$J$$ groups and $$N_j$$ observations in the $$j$$-th group and $y_{ij} \thicksim \mathcal{N}\left(\mu_{ij},\sigma\right) \forall i,j$ $\mu_{ij} = \left(\alpha + a_j\right) + \left(\beta + b_j\right)x_{ij} = \alpha + \beta x_{ij} + a_j + b_j x_{ij} \forall i,j$ $\begin{bmatrix}a_j\\ b_j \end{bmatrix} \thicksim \mathcal{N}_{2}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_a^2 & \rho \sigma_a \sigma_b \\ \rho \sigma_a \sigma_b & \sigma_b^2 \end{bmatrix}\right) \forall j$
• Bayesians put priors on the common parameters $$\sigma$$, $$\alpha$$, $$\beta$$, $$\rho$$, $$\sigma_a$$, and $$\sigma_b$$

## Frequentist Perspective on Hierarchical Models

• For frequentists, $$a_j + b_j x_{ij}$$ is part of the error term; thus
• Observations within group $$j$$ aren't conditionally independent given $$\alpha$$ & $$\beta$$
• Frequentists are willing to make distributional assumptions about $$a_j$$ and $$b_j$$ (invariably bivariate normal). Let $$\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_a^2 & \rho \sigma_a \sigma_b \\ \rho \sigma_a \sigma_b & \sigma_b^2 \end{bmatrix}$$ and
$z_{ij} = \int\limits _{-\infty}^{\infty} \int\limits _{-\infty}^{\infty} { \frac{1}{\sigma \sqrt{2 \pi}}{ e^{-0.5 \left(\frac{y_{ij} - \alpha - \beta x_{ij} - a_j - b_j x_{ij}}{\sigma} \right)^2}} \frac{1}{\left|\boldsymbol{\Sigma}\right|}e^{-0.5 \begin{bmatrix} a_j \\ b_j \end{bmatrix}^\top \boldsymbol{\Sigma}^{-1} \begin{bmatrix} a_j \\ b_j \end{bmatrix}} \ da_j db_j}$
• This particular integral happens to have a closed-form solution and one can choose $$\sigma$$, $$\alpha$$, $$\beta$$, $$\rho$$, $$\sigma_a$$, and $$\sigma_b$$ to maximize $$\sum_{j=1}^J \sum_{i=1}^{N_j} \ln z_{ij}$$
• But maximum likelihood is not a great estimator here so people penalize

## Limitations of Frequentist Perspective

• For frequentists, $$a_j$$ and $$b_j$$ are not parameters and thus cannot be estimated
• $$a_j$$ and $$b_j$$ can be predicted from group $$j$$'s residuals implied by $$\widehat{\alpha}$$ and $$\widehat{\beta}$$
• Since $$a_j$$ and $$b_j$$ are not estimated, you cannot construct standard errors
• Thus, you cannot make frequentist inferences about $$a_j$$ and / or $$b_j$$
• You can conceptualize standard errors for the estimator of the common parameters $$\sigma$$, $$\alpha$$, $$\beta$$, $$\rho$$, $$\sigma_a$$, and $$\sigma_b$$ but they are hard to calculate unless you treat the predictions of $$a_j$$ and $$b_j$$ as given
• To obtain a closed-form likelihood function to maximize, you have to assume normality both for the outcome (conditional on $$x_{ij}$$) and for $$a_j, b_j$$. Otherwise, you get all of the computational difficulty with intractable integrals that Bayesians avoid with MCMC and none of the benefit in interpretation.
• The optimization process is not nearly as routine as it is for flat GLMs, and you often get corner solutions where $$\widehat{\boldsymbol{\Sigma}}$$ is not positive definite

## The stan_glmer Function

• The [g]lmer functions in the lme4 R package are very popular because people want to quickly estimate hierarchical models with a convenient syntax and interpret the results as if they were Bayesian
• But you can slowly estimate hierarchical models using the same convenient syntax by using the stan_glmer function in the rstanarm R package and interpret the results in a genuinely Bayesian fashion
post <- stan_glmer(Days ~ (1 | Age : Sex : Eth : Lrn), data = MASS::quine,
family = "neg_binomial_2")

## Results

summary(post, probs = c(.25, .75))
## stan_glmer(formula = Days ~ (1 | Age:Sex:Eth:Lrn), data = MASS::quine,
##     family = "neg_binomial_2")
##
## Family: neg_binomial_2 (log)
## Algorithm: sampling
## Posterior sample size: 4000
## Observations: 146
## Groups: Age:Sex:Eth:Lrn 28
##
## Estimates:
##                                            mean   sd     25%    75%
## (Intercept)                                 2.7    0.1    2.6    2.7
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:A:AL]    0.3    0.4    0.0    0.5
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:A:SL]   -0.3    0.5   -0.6    0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:N:AL]    0.2    0.3    0.0    0.4
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:N:SL]    0.2    0.5   -0.1    0.5
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:A:AL]    0.0    0.3   -0.2    0.2
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:A:SL]   -0.2    0.4   -0.5    0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:N:AL]   -0.6    0.4   -0.9   -0.4
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:N:SL]    0.5    0.4    0.2    0.7
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:A:AL]   -0.1    0.3   -0.3    0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:A:SL]    0.4    0.3    0.2    0.6
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:N:AL]   -0.2    0.3   -0.4    0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:N:SL]   -0.6    0.3   -0.8   -0.5
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:A:AL]   -0.1    0.4   -0.4    0.2
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:A:SL]   -0.2    0.4   -0.5    0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:N:AL]   -0.5    0.5   -0.8   -0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:N:SL]   -0.5    0.3   -0.8   -0.3
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:A:AL]   -0.3    0.5   -0.7    0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:A:SL]    0.8    0.3    0.6    0.9
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:N:AL]   -0.4    0.6   -0.7    0.0
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:N:SL]   -0.6    0.3   -0.8   -0.4
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:A:AL]    0.5    0.3    0.3    0.7
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:A:SL]    0.7    0.3    0.4    0.9
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:N:AL]   -0.3    0.3   -0.5   -0.1
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:N:SL]    0.5    0.4    0.2    0.7
## b[(Intercept) Age:Sex:Eth:Lrn:F3:F:A:AL]    0.0    0.3   -0.1    0.2
## b[(Intercept) Age:Sex:Eth:Lrn:F3:F:N:AL]    0.0    0.3   -0.2    0.2
## b[(Intercept) Age:Sex:Eth:Lrn:F3:M:A:AL]    0.5    0.3    0.3    0.7
## b[(Intercept) Age:Sex:Eth:Lrn:F3:M:N:AL]    0.5    0.3    0.3    0.7
## overdispersion                              1.5    0.2    1.3    1.6
## mean_PPD                                   16.6    1.9   15.2   17.9
## log-posterior                            -585.9    5.8 -589.5 -581.8
##
## Diagnostics:
##                                          mcse Rhat n_eff
## (Intercept)                              0.0  1.0  1693
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:A:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:N:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F0:F:N:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:A:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:N:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F0:M:N:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:A:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:N:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:F:N:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:A:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:N:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F1:M:N:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:A:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:N:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:F:N:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:A:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:N:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F2:M:N:SL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F3:F:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F3:F:N:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F3:M:A:AL] 0.0  1.0  4000
## b[(Intercept) Age:Sex:Eth:Lrn:F3:M:N:AL] 0.0  1.0  4000
## overdispersion                           0.0  1.0  4000
## mean_PPD                                 0.0  1.0  4000
## log-posterior                            0.2  1.0  1050
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).