July 30, 2016

Installation

example("stan_model", package = "rstan", run.dontrun = TRUE)
  • Also, verify that packageVersion("rstanarm") is 2.11.1; otherwise execute
install.packages("rstanarm", repos = "https://cloud.r-project.org", dependencies = TRUE)
install.packages("rstanarm", repos = "https://cloud.r-project.org", type = "source")

Outline

  1. Introduction
  2. Hamiltonian Markov Chain Monte Carlo
  3. An rstanarm example
  4. Writing Stan programs
  5. Hierarchcial models
  6. Case study with optimal book pricing

Obligatory Disclosure

  • Ben is an employee of Columbia University, which has received several research grants to develop Stan
  • Ben is also a cofounder of Stan Group (http://stan.fit), which provides support, consulting, etc. for businesses using Stan
  • According to Columbia University policy, any such employee who has any equity stake in, a title (such as officer or director) with, or is expected to earn at least \(\$5,000.00\) per year from a private company is required to disclose these facts in presentations
  • Eric is also a cofounder of Stan Group but not a Columbia employee and hence is not implicated by the above

Introduction

Who Is Using Stan?

  • Stan is used in academia, business, and government
  • Downloads
    • rstan:
    • rstanarm:
    • brms:
    • Lots of PyStan downloads
    • ~ 10,000+ manual downloads with each new release
  • Over 1,000 mailing list registrations
  • Stan is used for fitting climate models, clinical drug trials, genomics and cancer biology, population dynamics, psycholinguistics, social networks, finance and econometrics, professional sports, publishing, recommender systems, educational testing, and many more.

Stan in Pharmacometrics

Stan in Books Publishing

How Birth Year Influences Political Views

From the New York Times: A new model of presidential voting suggests President Obama’s approval rating — currently in the low 40s — will inform not only the 2016 election, but also the election in 2076. The model, by researchers at Catalist, the Democratic data firm, and Columbia University, uses hundreds of thousands of survey responses and new statistical software to estimate how people’s preferences change at different stages of their lives.

What Was the Probability of Brexit?

  • If I were to have asked you (a month ago), what is the probability that the British would vote to leave the European Union, what would you have said?
  • If everyone could have supplied an answer to that question (a month ago), where did these beliefs about the probability of this event come from?

Different Perspectives on Probability

What is the paradigm? What is fixed? What is random? What proportion is important? What is the conclusion?
Randomization \({y_1, y_2, \dots, y_N}\) Treatment assignment \(p\)-value for null: ATE \(= 0\)? ATE \(\neq 0\)
Frequentist \(Y\), \(\boldsymbol{\theta}\), \(N\) Sample inclusion \(\theta \in\) confidence intervals (plural) Something basically Bayesian
Supervised learning \({y_1, y_2, \dots, y_N}\) Training / testing inclusion Correctly classified outcomes in testing data Some procedure predicts best
Bayesian \({y_1, y_2, \dots, y_N}\), \(\boldsymbol{\theta}\) Beliefs about \(\boldsymbol{\theta}\) Posterior draws of \(\theta \in \left(a,b\right)\) Decision or action
Fiducial \({y_1, y_2, \dots, y_N}\), \(\boldsymbol{\theta}\) Error term Fiducial draws of \(\theta \in \left(a,b\right)\) ?

Two Justifications for Bayes Rule

  1. \(f\left(\mathbf{y}\right) \times f\left(\boldsymbol{\theta} | \mathbf{y}\right) = f\left(\boldsymbol{\theta}, \mathbf{y}\right) = f\left(\boldsymbol{\theta}\right) \times f\left(\mathbf{y} | \boldsymbol{\theta}\right) \implies f\left(\boldsymbol{\theta} | \mathbf{y}\right) = \frac{f\left(\boldsymbol{\theta}\right) \times f\left(\mathbf{y} | \boldsymbol{\theta}\right)}{f\left(\mathbf{y}\right)}\) where \(\mathbf{y} = \{y_1, y_2 \dots y_N\}\) and \(f\left(\cdot\right)\) is a PDF so \(f\left(\cdot\right) \geq 0\) & \(\int f\left(u\right)du = 1\)
    • \(f\left(\boldsymbol{\theta}\right)\) represents what someone believes about \(\boldsymbol{\theta}\) prior to observing \(\mathbf{y}\)
    • \(f\left(\boldsymbol{\theta} | \mathbf{y}\right)\) represents what someone believes about \(\boldsymbol{\theta}\) after observing \(\mathbf{y}\)
    • \(f\left(\mathbf{y} | \boldsymbol{\theta}\right)\) is the likelihood function, a function of \(\boldsymbol{\theta}\) for an observed \(\mathbf{y}\)
    • \(f\left(\mathbf{y}\right) = \int \cdots \int \int f\left(\boldsymbol{\theta}\right) f\left(\mathbf{y} | \boldsymbol{\theta}\right) d\theta_1 d\theta_2 \dots d\theta_K = \mathbb{E}_{\boldsymbol{\theta}}f\left(\mathbf{y} | \boldsymbol{\theta}\right)\)
  2. \(f\left(\boldsymbol{\theta} | \mathbf{y}\right)\) is the unique function that minimizes the sum of
    • Penalty: Kullback-Leibler divergence to \(f\left(\boldsymbol{\theta}\right)\)
    • Expected misfit: \(\mathbb{E}_{\boldsymbol{\theta}}\left[-\ln f\left(\mathbf{y} | \boldsymbol{\theta}\right)\right]\)

ABC vs. WTH

Approximate Bayesian Computation

Repeat many times:

  1. Draw \(\tilde{\boldsymbol{\theta}}\) from the prior distribution
  2. Draw \(\tilde{y}_1 \dots \tilde{y}_N\) from the distribution whose PDF is \(f\left(\mathbf{y}|\boldsymbol{\theta}\right)\)
  3. If \(\tilde{y}_1 \dots \tilde{y}_N\) is "sufficiently close to" \(y_1 \dots y_N\), retain \(\tilde{\boldsymbol{\theta}}\); otherwise discard \(\tilde{\boldsymbol{\theta}}\)

Retained draws approach those from the posterior distribution as "sufficiently close to" approaches "exactly the same as" but requires infinite time

What Typically Happens

Repeat as many times as necessary:

  1. Estimate \(\boldsymbol{\theta}\) by optimizing some function to obtain \(\widehat{\boldsymbol{\theta}}\)
  2. If \(\widehat{\boldsymbol{\theta}}\) is "sufficiently close to" the resesearcher's prior beliefs about \(\boldsymbol{\theta}\), retain \(\widehat{\boldsymbol{\theta}}\); otherwise discard \(\widehat{\boldsymbol{\theta}}\) and tweak the model and / or data

This is not even a valid frequentist procedure

Explains a lot of "The Crisis in Psychology"

Baysian Workflow

Hamiltonian Markov Chain Monte Carlo

Markov Chain Monte Carlo

  • Even if \(f\left(\mathbf{y}\right)\) could be calculated, you would have to do another K-dimensional integral to obtain something like \(\mathbb{E}\left[\theta_k | \mathbf{y}\right]\)
  • So we draw randomly \(S\) times from the posterior distribution — which does not require knowing \(f\left(\mathbf{y}\right)\) — and estimate \(\mathbb{E}\left[\theta_k | \mathbf{y}\right]\) with \(\frac{1}{S}\sum_{s=1}^S{\tilde{\theta}_k^{[s]}}\)
  • There is no way to draw independently from most posterior distributions
  • The price to be paid for relying on Markov Chain Monte Carlo (MCMC) to draw from a posterior distribution is that the draws are not independent
  • The degree of dependence in a MCMC algorithm governs how badly \(\frac{1}{S}\sum_{s=1}^S{g\left(\tilde{\boldsymbol{\theta}}^{[s]}\right)}\) estimates \(\mathbb{E}g\left(\boldsymbol{\theta}\right) | \mathbf{y}\) for finite \(S\)
  • Effective Sample Size is a concept like that in complex survey design and is defined as the number of independent draws that would estimate a posterior mean with the same precision as the \(S\) dependent draws you do have

A Markov Process with Severe Dependence

par(mar = c(4,4,1,1) + .1, las = 1, bg = "lightgrey")
x <- sapply(1:6, FUN = function(i) arima.sim(model = list(ar = 0.9999999), n = 10^6))
matplot(x, type = "l", col = 1:6, lty = 1)
for (j in 1:ncol(x)) abline(h = mean(x[,j]), col = j, lty = 2)

What is Stan and How Does It Help?

  • Includes a probabalistic programming language
    • The rstanarm, brms, and rethinking R packages provide code to specify some statistical models — with a limited choice of prior distributions — that can be mapped into the Stan language
  • Includes new Hamiltonian Monte Carlo (HMC) algorithms
    • HMC is to MCMC as BFGS is to optimization
    • HMC is aided by the gradient of the posterior distribution wrt \(\boldsymbol{\theta}\)
    • Dependence between consecutive draws is minimal
  • Includes a matrix and scalar math library that supports autodifferentiation
  • Includes interfaces from R and other high-level software
  • Includes (not Stan specific) post-estimation R functions of MCMC output
  • Includes a large community of users and many developers

Overview of Hamiltonian Monte Carlo

  • Since the early 1990s, most MCMC uses Gibbs updates when feasible and falls back to something more general otherwise
    • Gibbs entails drawing \(\theta_k\) from its "full-conditional distribution": \(\theta_k | \boldsymbol{\theta}_{-k}, \mathbf{y}\)
    • "Something more general" includes slice sampling, Metropolis-Hastings, etc., which is needed when the full-conditional distribution of \(\theta_k\) is not known in closed form
  • If Gibbs updates are feasible, they are easy to code and fast to execute but are statistically inefficient because the dependence between draws is high
  • HMC differs from Gibbs in that all elements of \(\boldsymbol{\theta}\) are updated simultaneously
  • H stands for Hamiltonian, which is a physics framework for how a particle \(\left(\boldsymbol{\theta}\right)\) moves through an unbounded frictionless space

Example of Drawing from a Multivariate Normal

  • \(\mathbf{y} \thicksim \mathcal{N}_{250}\left(\mathbf{0}, \boldsymbol{\Sigma}\right)\) where \(\boldsymbol{\Sigma}\) is ill-conditioned but focus on just two dimensions
  • Do 1 million draws w/ Random Walk Metropolis & Gibbs, thinning by \(1000\)
  • Do 1000 draws with the NUTS algorithm in Stan and 1000 independent draws

Comparison of MCMC Samplers

Details of Hamiltonian Monte Carlo

  • HMC augments the parameter space with a momentum vector \(\left(\boldsymbol{\phi}\right)\) of size \(K\)
  • \(\boldsymbol{\phi}\) does not enter the likelihood for \(\mathbf{y}\), so its marginal posterior distribution is the same as its prior distribution, which is multivariate normal with mean vector zero and a covariance matrix that is tuned during the warmup phase
  • Given a draw of \(\boldsymbol{\phi}\) from this multivariate normal distribution, the Hamiltonian equations tell us where \(\boldsymbol{\theta}\) would move to in \(t\) periods, depending on the posterior kernel in log-units \(\ln f\left(\boldsymbol{\theta}\right) + \ln f\left(\mathbf{y} | \boldsymbol{\theta}\right)\)
  • We approximate the solution to the Hamiltonian equations numerically assuming discrete time
  • Draw from the footprints of the discrete Hamiltonian path with a categorical distribution whose probabilities are proportional to the posterior kernel
  • Stepsize and momentum are automatically tuned but can be adjusted by you
  • Essentially, the only thing that can go wrong is numerical instability

Hamiltonian Simulation: Wrong Step Sizes

Too Small

Hamiltonian Simulation

Too Big

Hamiltonian Simulation

Hamiltonian Simulation: Right Step Size

Hamiltonian Simulation

Example with the rstanarm package

A Model for State-level Murder Rates in 1977

state.x77 <- within(as.data.frame(state.x77), { # choose reasonable units
  Density <- Population / Area
  Income <- Income / 1000
  Frost <- Frost / 100
})
library(rstanarm)
options(mc.cores = parallel::detectCores())
post <- stan_lm(Murder ~ Density + Income + Illiteracy + Frost, 
                data = state.x77, prior = R2(stop("put a number here")))
print(post, digits = 2)

Results

## stan_lm(formula = Murder ~ Density + Income + Illiteracy + Frost, 
##     data = state.x77, prior = R2(0.25, what = "median"))
## 
## Estimates:
##               Median MAD_SD
## (Intercept)   -0.47   3.90 
## Density       -3.68   1.66 
## Income         1.01   0.67 
## Illiteracy     3.79   0.88 
## Frost         -0.56   0.91 
## sigma          2.65   0.28 
## log-fit_ratio -0.01   0.09 
## R2             0.46   0.09 
## 
## Sample avg. posterior predictive 
## distribution of y (X = xbar):
##          Median MAD_SD
## mean_PPD 7.38   0.52

You Can Do Anything with the Draws

mean(as.data.frame(post)$Density < 0) # Pr(beta_{Density} < 0)
## [1] 0.98225
round(posterior_interval(post, prob = 0.5), digits = 3) # endpoints of the IQRs
##                  25%    75%
## (Intercept)   -3.019  2.238
## Density       -4.835 -2.589
## Income         0.562  1.461
## Illiteracy     3.211  4.399
## Frost         -1.146  0.088
## sigma          2.474  2.859
## log-fit_ratio -0.078  0.047
## R2             0.399  0.521
launch_shinystan(post)

Model Comparison

  • The most important insight of supervised learning is that you will choose a model that overfits if you evaluate the models on the same data that you estimate the models with
  • Thus, supervised learning people partition "the" data (often randomly) into a training dataset (that is used to "train a model") and a testing dataset (that is used to evaluate how well models predict)
  • Nothing prevents you from doing that in a Bayesian context but holding out data makes your posterior distribution more diffuse
  • Bayesians usually condition on all the data and evaluate how well a model is expected to predict out of sample using "information criteria", which are all intended to select the model with the highest expected log predictive density (ELPD) for new data
  • This is easy to do with rstanarm using the loo and compare functions under the verifiable assumption that each observation could be omitted without having a drastic effect on the posterior distribution

Using the loo Function

(loo_1 <- loo(post))
## Computed from 4000 by 50 log-likelihood matrix
## 
##          Estimate  SE
## elpd_loo   -121.3 4.5
## p_loo         4.7 1.1
## looic       242.6 8.9
## 
## All Pareto k estimates OK (k < 0.5)
post2 <- stan_glm(Murder ~ Density + Income + Illiteracy, data = state.x77, family = gaussian, 
                  prior = normal(0, 5), prior_intercept = student_t(df = 2))
compare(loo_1, loo(post2))
## elpd_diff        se 
##       1.3       0.9

Summary

  • Using the model-fitting functions in the rstanarm package is easy
  • stan_lm, stan_aov, stan_glm, stan_glm.nb, and stan_polr all have the same syntax and same likelihood as their frequentist counterparts
  • Developers can add models to rstanarm or copy the build process of rstanarm into their own R packages to use Stan to estimate particular models
  • The brms (on CRAN) and rethinking (on GitHub) packages are a bit different than rstanarm but permit estimation of an overlapping set of models w/ Stan
  • Using Stan for Bayesian inference is sufficiently easy for most basic and some not-so-basic models that there should rarely be a reason to use frequentist tools to make Bayesian inferences