# Introduction

This vignette illustrates the effects on posterior inference of pooling data (a.k.a sharing strength) across units for repeated binary trial data. It provides R code to fit and check predictive models for three situations: (a) complete pooling, which assumes each unit is the same, (b) no pooling, which assumes the units are unrelated, and (c) partial pooling, where the similarity among the units is estimated. The note explains with working examples how to (i) fit the models using rstanarm and plot the results, (ii) estimate event probabilities, (iii) evaluate posterior predictive densities to evaluate model predictions on held-out data, (iv) rank units by chance of success, (v) perform multiple comparisons in several settings, (vi) replicate new data for posterior $$p$$-values, and (vii) perform graphical posterior predictive checks.

The content of the vignette is based on Bob Carpenter’s Stan tutorial Hierarchical Partial Pooling for Repeated Binary Trials, but here we show how to fit the models and carry out predictions and model checking and comparison using rstanarm. Most of the text is taken from the original, with some additions and substractions to make the content more useful for rstanarm users. The Stan code from the original tutorial has also been entirely removed, as rstanarm will fit all of the models in Stan without the user having to write the underlying Stan programs. The Stan code in the original document is a good reference for anyone interested in how these models are estimated “under-the-hood”, though the parameterizations used internally by rstanarm differ somewhat from those in the original.

# Repeated Binary Trials

Suppose that for each of $$N$$ units $$n \in 1{:}N$$, we observe $$y_n$$ successes out of $$K_n$$ trials. For example, the data may consist of

• rat tumor development, with $$y_n$$ rats developing tumors of $$K_n$$ total rats in experimental control group $$n \in 1{:}N$$ (Tarone 1982)

• surgical mortality, with $$y_n$$ surgical patients dying in $$K_n$$ surgeries for hospitals $$n \in 1{:}N$$ (Spiegelhalter et al. 1996)

• baseball batting ability, with $$y_n$$ hits in $$K_n$$ at bats for baseball players $$n \in 1{:}N$$ (Efron and Morris 1975; Carpenter 2009)

• machine learning system accuracy, with $$y_n$$ correct classifications out of $$K_n$$ examples for systems $$n \in 1{:}N$$ (ML conference proceedings; Kaggle competitions)

In this vignette we use the small baseball data set of Efron and Morris (1975), but we also provide the rat control data of Tarone (1982), the surgical mortality data of Spiegelhalter et al. (1996) and the extended baseball data set of Carpenter (2009).

### Baseball Hits (Efron and Morris 1975)

As a running example, we will use the data from Table 1 of (Efron and Morris 1975), which is included in rstanarm under the name bball1970 (it was downloaded 24 Dec 2015 from here). It is drawn from the 1970 Major League Baseball season (from both leagues).

library(rstanarm)
data(bball1970)
bball <- bball1970
print(bball)
       Player AB Hits RemainingAB RemainingHits
1    Clemente 45   18         367           127
2    Robinson 45   17         426           127
3      Howard 45   16         521           144
4   Johnstone 45   15         275            61
5       Berry 45   14         418           114
6     Spencer 45   14         466           126
7   Kessinger 45   13         586           155
8    Alvarado 45   12         138            29
9       Santo 45   11         510           137
10    Swaboda 45   11         200            46
11 Petrocelli 45   10         538           142
12  Rodriguez 45   10         186            42
13      Scott 45   10         435           132
14      Unser 45   10         277            73
15   Williams 45   10         591           195
16 Campaneris 45    9         558           159
17     Munson 45    8         408           129
18      Alvis 45    7          70            14
# A few quantities we'll use throughout
N <- nrow(bball)
K <- bball$AB y <- bball$Hits
K_new <- bball$RemainingAB y_new <- bball$RemainingHits

The data separates the outcome from the initial 45 at-bats from the rest of the season. After running this code, N is the number of units (players). Then for each unit n, K[n] is the number of initial trials (at-bats), y[n] is the number of initial successes (hits), K_new[n] is the remaining number of trials (remaining at-bats), and y_new[n] is the number of successes in the remaining trials (remaining hits).

The remaining data can be used to evaluate the predictive performance of our models conditioned on the observed data. That is, we will “train” on the first 45 at bats and see how well our various models do at predicting the rest of the season.

# Pooling

With complete pooling, each unit is assumed to have the same chance of success. With no pooling, each unit is assumed to have a completely unrelated chance of success. With partial pooling, each unit is assumed to have a different chance of success, but the data for all of the observed units informs the estimates for each unit.

Partial pooling is typically accomplished through hierarchical models. Hierarchical models directly model the population of units. From a population model perspective, no pooling corresponds to infinite population variance, whereas complete pooling corresponds to zero population variance.

In the following sections, all three types of pooling models will be fit for the baseball data.

# Fitting the Models

First we’ll create some useful objects to use throughout the rest of this vignette. One of them is a function batting_avg, which just formats a number to include three decimal places to the right of zero when printing, as is customary for batting averages.

batting_avg <- function(x) print(format(round(x, digits = 3), nsmall = 3), quote = FALSE)
player_avgs <- y / K # player avgs through 45 AB
tot_avg <- sum(y) / sum(K) # overall avg through 45 AB

cat("Player averages through 45 at-bats:\n")
batting_avg(player_avgs)
cat("Overall average through 45 at-bats:\n")
batting_avg(tot_avg)
Player averages through 45 at-bats:
[1] 0.400 0.378 0.356 0.333 0.311 0.311 0.289 0.267 0.244 0.244 0.222
[12] 0.222 0.222 0.222 0.222 0.200 0.178 0.156
Overall average through 45 at-bats:
[1] 0.265

## Complete Pooling

The complete pooling model assumes a single parameter $$\theta$$ representing the chance of success for all units (in this case players).

Assuming each player’s at-bats are independent Bernoulli trials, the probability distribution for each player’s number of hits $$y_n$$ is modeled as

$p(y_n \, | \, \theta) \ = \ \mathsf{Binomial}(y_n \, | \, K_n, \theta).$

When viewed as a function of $$\theta$$ for fixed $$y_n$$, this is called the likelihood function.

Assuming each player is independent leads to the complete data likelihood

$p(y \, | \, \theta) = \prod_{n=1}^N \mathsf{Binomial}(y_n \, | \, K_n, \theta).$

Using family=binomial("logit"), the stan_glm function in rstanarm will parameterize the model in terms of the log-odds $$\alpha$$, which are defined by the logit transform as

$\alpha = \mathrm{logit}(\theta) = \log \, \frac{\theta}{1 - \theta}.$

For example, $$\theta = 0.25$$ corresponds to odds of $$.25$$ to $$.75$$ (equivalently, $$1$$ to $$3$$), or log-odds of $$\log .25 / .75 = -1.1$$.

The model is therefore

$p(y_n \, | \, K_n, \alpha) \ = \ \mathsf{Binomial}(y_n \, | \, K_n, \ \mathrm{logit}^{-1}(\alpha))$

The inverse logit function is the logistic sigmoid from which logistic regression gets its name because the inverse logit function is also the standard logistic Cumulative Distribution Function (CDF),

$\mathrm{logit}^{-1}(\alpha) = \frac{1}{1 + \exp(-\alpha)} = \theta.$

By construction, for any $$\alpha \in (-\infty, \infty)$$, $$\mathrm{logit}^{-1}(\alpha) \in (0, 1)$$; the sigmoid converts arbitrary log odds back to the probability scale.

We will use a normal distribution with mean $$-1$$ and standard deviation $$1$$ as the prior on the log-odds $$\alpha$$. This is a weakly informative prior that places about 95% of the prior probability in the interval $$(-3, 1)$$, which inverse-logit transforms to the interval $$(0.05, 0.73)$$. The prior median $$-1$$ corresponds to a $$0.27$$ chance of success. In fact, an even narrower prior is actually motivated here from substantial baseball knowledge.

The figure below shows both this prior on $$\alpha$$ as well as the prior it implies on the probability $$\theta$$.

To fit the model we call stan_glm with the formula cbind(Hits, AB - Hits) ~ 1. The left-hand side of the formula specifies the binomial outcome by providing the number of successes (hits) and failures (at-bats) for each player, and the right-hand side indicates that we want an intercept-only model.

SEED <- 101
wi_prior <- normal(-1, 1)  # weakly informative prior on log-odds
fit_pool <- stan_glm(cbind(Hits, AB - Hits) ~ 1, data = bball, family = binomial("logit"),
prior_intercept = wi_prior, seed = SEED)

The summary function will compute all sorts of summary statistics from the fitted model, but here we’ll create a small function that will compute just a few posterior summary statistics that we’ll want for each of the models we estimate. The summary_stats function, defined below, will take a matrix of posterior draws as its input, apply an inverse-logit transformation (to convert from log-odds to probabilities) and then compute the median and 80% interval.

invlogit <- plogis  # function(x) 1/(1 + exp(-x))
summary_stats <- function(posterior) {
x <- invlogit(posterior)  # log-odds -> probabilities
t(apply(x, 2, quantile, probs = c(0.1, 0.5, 0.9)))
}

pool <- summary_stats(as.matrix(fit_pool))  # as.matrix extracts the posterior draws
pool <- matrix(pool,  # replicate to give each player the same estimates
nrow(bball), ncol(pool), byrow = TRUE,
dimnames = list(bball$Player, c("10%", "50%", "90%"))) batting_avg(pool)  10% 50% 90% Clemente 0.246 0.265 0.285 Robinson 0.246 0.265 0.285 Howard 0.246 0.265 0.285 Johnstone 0.246 0.265 0.285 Berry 0.246 0.265 0.285 Spencer 0.246 0.265 0.285 Kessinger 0.246 0.265 0.285 Alvarado 0.246 0.265 0.285 Santo 0.246 0.265 0.285 Swaboda 0.246 0.265 0.285 Petrocelli 0.246 0.265 0.285 Rodriguez 0.246 0.265 0.285 Scott 0.246 0.265 0.285 Unser 0.246 0.265 0.285 Williams 0.246 0.265 0.285 Campaneris 0.246 0.265 0.285 Munson 0.246 0.265 0.285 Alvis 0.246 0.265 0.285 With more data, such as from more players or from the rest of the season, the posterior approaches a delta function around the maximum likelihood estimate and the posterior interval around the centeral posterior intervals will shrink. Nevertheless, even if we know a player’s chance of success exactly, there is a large amount of uncertainty in running $$K$$ binary trials with that chance of success; using a binomial model fundamentally bounds our prediction accuracy. Although this model will be a good baseline for comparison, we have good reason to believe from a large amount of prior data (players with as many as 10,000 trials) that it is very unlikely that all baseball players have the same chance of success. ## No Pooling A model with no pooling involves a separate chance-of-success parameter $$\theta_n \in [0,1]$$ for each player $$n$$, where the $$\theta_n$$ are assumed to be independent. rstanarm will again parameterize the model in terms of the log-odds, $$\alpha_n = \mathrm{logit}(\theta_n)$$, so the likelihood then uses the log-odds of succes $$\alpha_n$$ for unit $$n$$ in modeling the number of successes $$y_n$$ as $p(y_n \, | \, \alpha_n) = \mathsf{Binomial}(y_n \, | \, K_n, \mathrm{logit}^{-1}(\alpha_n)).$ Assuming the $$y_n$$ are independent (conditional on $$\theta$$), this leads to the total data likelihood $p(y \, | \, \alpha) = \prod_{n=1}^N \mathsf{Binomial}(y_n \, | \, K_n, \mathrm{logit}^{-1}(\alpha_n)).$ To fit the model we need only tweak the model formula used for the full pooling model to drop the intercept and instead include as the only predictor the factor variable Player. This is equivalent to estimating a separate intercept on the log-odds scale for each player. We’ll also use the prior (rather than prior_intercept) argument since Player is considered a predictor rather than an intercept from R’s perspective. Using the same weakly informative prior now means that the each $$\alpha_n$$ gets a $$\mathsf{Normal}(-1, 1)$$ prior, independent of the others. fit_nopool <- update(fit_pool, formula = . ~ 0 + Player, prior = wi_prior) nopool <- summary_stats(as.matrix(fit_nopool)) rownames(nopool) <- as.character(bball$Player)
batting_avg(nopool)

parameters   10%   50%   90%
Clemente   0.303 0.388 0.482
Robinson   0.278 0.364 0.462
Howard     0.261 0.346 0.436
Johnstone  0.244 0.326 0.418
Berry      0.222 0.305 0.392
Spencer    0.227 0.305 0.395
Kessinger  0.209 0.285 0.369
Santo      0.174 0.244 0.327
Swaboda    0.174 0.244 0.325
Petrocelli 0.156 0.224 0.305
Rodriguez  0.156 0.225 0.304
Scott      0.153 0.224 0.306
Unser      0.155 0.223 0.304
Williams   0.156 0.225 0.307
Campaneris 0.140 0.204 0.284
Munson     0.122 0.185 0.259
Alvis      0.105 0.166 0.239

Each 80% interval is much wider than the estimated interval for the population in the complete pooling model; this is to be expected—there are only 45 data units for each parameter here as opposed to 810 in the complete pooling case. If the units each had different numbers of trials, the intervals would also vary based on size.

As the estimated chance of success goes up toward 0.5, the 80% intervals gets wider. This is to be expected for chance of success parameters, because the variance is maximized when $$\theta = 0.5$$.

Based on our existing knowledge of baseball, the no-pooling model is almost certainly overestimating the high abilities and underestimating lower abilities (Ted Williams, 30 years prior to the year this data was collected, was the last player with a 40% observed success rate over a season, whereas 20% or less is too low for all but a few rare defensive specialists).

## Partial Pooling

Complete pooling provides estimated abilities that are too narrowly distributed for the units and removes any chance of modeling population variation. Estimating each chance of success separately without any pooling provides estimated abilities that are too broadly distributed for the units and hence too variable. Clearly some amount of pooling between these two extremes is called for. But how much?

A hierarchical model treats the players as belonging to a population of players. The properties of this population will be estimated along with player abilities, implicitly controlling the amount of pooling that is applied. The more variable the (estimate of the) population, the less pooling is applied. Mathematically, the hierarchical model places a prior on the abilities with parameters that are themselves estimated.

This model can be estimated using the stan_glmer function.

fit_partialpool <-
stan_glmer(cbind(Hits, AB - Hits) ~ (1 | Player), data = bball, family = binomial("logit"),
prior_intercept = wi_prior, seed = SEED)

Because stan_glmer (like glmer) estimates the varying intercepts for Player by estimating a single global intercept $$\alpha_0$$ and individual deviations from that intercept for each player $$\delta_n = \alpha_n - \alpha_0$$, to get the posterior distribution for each $$\alpha_n$$ we need to shift each of the posterior draws by the corresponding draw for the intercept. We can do this easily using the sweep function.

# shift each player's estimate by intercept (and then drop intercept)
shift_draws <- function(draws) {
sweep(draws[, -1], MARGIN = 1, STATS = draws[, 1], FUN = "+")
}
alphas <- shift_draws(as.matrix(fit_partialpool))
partialpool <- summary_stats(alphas)
partialpool <- partialpool[-nrow(partialpool),]
rownames(partialpool) <- as.character(bball\$Player)
batting_avg(partialpool)

parameters   10%   50%   90%
Clemente   0.249 0.282 0.347
Robinson   0.246 0.279 0.340
Howard     0.244 0.276 0.331
Johnstone  0.240 0.273 0.318
Berry      0.236 0.270 0.315
Spencer    0.236 0.270 0.316
Kessinger  0.233 0.267 0.309
Santo      0.222 0.261 0.298
Swaboda    0.220 0.260 0.299
Petrocelli 0.217 0.258 0.295
Rodriguez  0.218 0.258 0.293
Scott      0.217 0.259 0.292
Unser      0.218 0.259 0.295
Williams   0.217 0.259 0.295
Campaneris 0.212 0.256 0.290
Munson     0.203 0.252 0.287
Alvis      0.198 0.250 0.284

Here the estimates are less extreme than in the no-pooling case, which we should expect due to the partial pooling. It is also clear from the wide posteriors for the $$\theta_n$$ that there is considerable uncertainty in the estimates of chance-of-success on an unit-by-unit (player-by-player) basis.

## Observed vs. Estimated Chance of Success

Figure 5.4 from (Gelman et al. 2013) plots the observed number of successes $$y_n$$ for the first $$K_n$$ trials versus the median and 80% intervals for the estimated chance-of-success parameters $$\theta_n$$ in the posterior. The following R code reproduces a similar plot for our data.

library(ggplot2)
models <- c("complete pooling", "no pooling", "partial pooling")
estimates <- rbind(pool, nopool, partialpool)
colnames(estimates) <- c("lb", "median", "ub")
plotdata <- data.frame(estimates,
observed = rep(player_avgs, times = length(models)),
model = rep(models, each = N),
row.names = NULL)

ggplot(plotdata, aes(x = observed, y = median, ymin = lb, ymax = ub)) +
geom_hline(yintercept = tot_avg, color = "lightpink", size = 0.75) +
geom_abline(intercept = 0, slope = 1, color = "skyblue") +
geom_linerange(color = "gray60", size = 0.75) +
geom_point(size = 2.5, shape = 21, fill = "gray30", color = "white", stroke = 0.2) +
facet_grid(. ~ model) +
coord_fixed() +
scale_x_continuous(breaks = c(0.2, 0.3, 0.4)) +
labs(x = "Observed Hits / AB", y = "Predicted chance of hit") +
ggtitle("Posterior Medians and 80% Intervals")