This vignette illustrates the effects on posterior inference of pooling data (a.k.a sharing strength) across units for repeated binary trial data. It provides R code to fit and check predictive models for three situations: (a) complete pooling, which assumes each unit is the same, (b) no pooling, which assumes the units are unrelated, and (c) partial pooling, where the similarity among the units is estimated. The note explains with working examples how to (i) fit the models using rstanarm and plot the results, (ii) estimate event probabilities, (iii) evaluate posterior predictive densities to evaluate model predictions on held-out data, (iv) rank units by chance of success, (v) perform multiple comparisons in several settings, (vi) replicate new data for posterior \(p\)-values, and (vii) perform graphical posterior predictive checks.
The content of the vignette is based on Bob Carpenter’s Stan tutorial Hierarchical Partial Pooling for Repeated Binary Trials, but here we show how to fit the models and carry out predictions and model checking and comparison using rstanarm. Most of the text is taken from the original, with some additions and substractions to make the content more useful for rstanarm users. The Stan code from the original tutorial has also been entirely removed, as rstanarm will fit all of the models in Stan without the user having to write the underlying Stan programs. The Stan code in the original document is a good reference for anyone interested in how these models are estimated “under-the-hood”, though the parameterizations used internally by rstanarm differ somewhat from those in the original.
Suppose that for each of \(N\) units \(n \in 1{:}N\), we observe \(y_n\) successes out of \(K_n\) trials. For example, the data may consist of
rat tumor development, with \(y_n\) rats developing tumors of \(K_n\) total rats in experimental control group \(n \in 1{:}N\) (Tarone 1982)
surgical mortality, with \(y_n\) surgical patients dying in \(K_n\) surgeries for hospitals \(n \in 1{:}N\) (Spiegelhalter et al. 1996)
baseball batting ability, with \(y_n\) hits in \(K_n\) at bats for baseball players \(n \in 1{:}N\) (Efron and Morris 1975; Carpenter 2009)
machine learning system accuracy, with \(y_n\) correct classifications out of \(K_n\) examples for systems \(n \in 1{:}N\) (ML conference proceedings; Kaggle competitions)
In this vignette we use the small baseball data set of Efron and Morris (1975), but we also provide the rat control data of Tarone (1982), the surgical mortality data of Spiegelhalter et al. (1996) and the extended baseball data set of Carpenter (2009).
As a running example, we will use the data from Table 1 of (Efron and Morris 1975), which is included in rstanarm under the name bball1970
(it was downloaded 24 Dec 2015 from here). It is drawn from the 1970 Major League Baseball season (from both leagues).
library(rstanarm)
data(bball1970)
bball <- bball1970
print(bball)
Player AB Hits RemainingAB RemainingHits
1 Clemente 45 18 367 127
2 Robinson 45 17 426 127
3 Howard 45 16 521 144
4 Johnstone 45 15 275 61
5 Berry 45 14 418 114
6 Spencer 45 14 466 126
7 Kessinger 45 13 586 155
8 Alvarado 45 12 138 29
9 Santo 45 11 510 137
10 Swaboda 45 11 200 46
11 Petrocelli 45 10 538 142
12 Rodriguez 45 10 186 42
13 Scott 45 10 435 132
14 Unser 45 10 277 73
15 Williams 45 10 591 195
16 Campaneris 45 9 558 159
17 Munson 45 8 408 129
18 Alvis 45 7 70 14
# A few quantities we'll use throughout
N <- nrow(bball)
K <- bball$AB
y <- bball$Hits
K_new <- bball$RemainingAB
y_new <- bball$RemainingHits
The data separates the outcome from the initial 45 at-bats from the rest of the season. After running this code, N
is the number of units (players). Then for each unit n
, K[n]
is the number of initial trials (at-bats), y[n]
is the number of initial successes (hits), K_new[n]
is the remaining number of trials (remaining at-bats), and y_new[n]
is the number of successes in the remaining trials (remaining hits).
The remaining data can be used to evaluate the predictive performance of our models conditioned on the observed data. That is, we will “train” on the first 45 at bats and see how well our various models do at predicting the rest of the season.
With complete pooling, each unit is assumed to have the same chance of success. With no pooling, each unit is assumed to have a completely unrelated chance of success. With partial pooling, each unit is assumed to have a different chance of success, but the data for all of the observed units informs the estimates for each unit.
Partial pooling is typically accomplished through hierarchical models. Hierarchical models directly model the population of units. From a population model perspective, no pooling corresponds to infinite population variance, whereas complete pooling corresponds to zero population variance.
In the following sections, all three types of pooling models will be fit for the baseball data.
First we’ll create some useful objects to use throughout the rest of this vignette. One of them is a function batting_avg
, which just formats a number to include three decimal places to the right of zero when printing, as is customary for batting averages.
batting_avg <- function(x) print(format(round(x, digits = 3), nsmall = 3), quote = FALSE)
player_avgs <- y / K # player avgs through 45 AB
tot_avg <- sum(y) / sum(K) # overall avg through 45 AB
cat("Player averages through 45 at-bats:\n")
batting_avg(player_avgs)
cat("Overall average through 45 at-bats:\n")
batting_avg(tot_avg)
Player averages through 45 at-bats:
[1] 0.400 0.378 0.356 0.333 0.311 0.311 0.289 0.267 0.244 0.244 0.222
[12] 0.222 0.222 0.222 0.222 0.200 0.178 0.156
Overall average through 45 at-bats:
[1] 0.265
The complete pooling model assumes a single parameter \(\theta\) representing the chance of success for all units (in this case players).
Assuming each player’s at-bats are independent Bernoulli trials, the probability distribution for each player’s number of hits \(y_n\) is modeled as
\[ p(y_n \, | \, \theta) \ = \ \mathsf{Binomial}(y_n \, | \, K_n, \theta). \]
When viewed as a function of \(\theta\) for fixed \(y_n\), this is called the likelihood function.
Assuming each player is independent leads to the complete data likelihood
\[ p(y \, | \, \theta) = \prod_{n=1}^N \mathsf{Binomial}(y_n \, | \, K_n, \theta). \]
Using family=binomial("logit")
, the stan_glm
function in rstanarm will parameterize the model in terms of the log-odds \(\alpha\), which are defined by the logit transform as
\[ \alpha = \mathrm{logit}(\theta) = \log \, \frac{\theta}{1 - \theta}. \]
For example, \(\theta = 0.25\) corresponds to odds of \(.25\) to \(.75\) (equivalently, \(1\) to \(3\)), or log-odds of \(\log .25 / .75 = -1.1\).
The model is therefore
\[ p(y_n \, | \, K_n, \alpha) \ = \ \mathsf{Binomial}(y_n \, | \, K_n, \ \mathrm{logit}^{-1}(\alpha)) \]
The inverse logit function is the logistic sigmoid from which logistic regression gets its name because the inverse logit function is also the standard logistic Cumulative Distribution Function (CDF),
\[ \mathrm{logit}^{-1}(\alpha) = \frac{1}{1 + \exp(-\alpha)} = \theta. \]
By construction, for any \(\alpha \in (-\infty, \infty)\), \(\mathrm{logit}^{-1}(\alpha) \in (0, 1)\); the sigmoid converts arbitrary log odds back to the probability scale.
We will use a normal distribution with mean \(-1\) and standard deviation \(1\) as the prior on the log-odds \(\alpha\). This is a weakly informative prior that places about 95% of the prior probability in the interval \((-3, 1)\), which inverse-logit transforms to the interval \((0.05, 0.73)\). The prior median \(-1\) corresponds to a \(0.27\) chance of success. In fact, an even narrower prior is actually motivated here from substantial baseball knowledge.
The figure below shows both this prior on \(\alpha\) as well as the prior it implies on the probability \(\theta\).
To fit the model we call stan_glm
with the formula cbind(Hits, AB - Hits) ~ 1
. The left-hand side of the formula specifies the binomial outcome by providing the number of successes (hits) and failures (at-bats) for each player, and the right-hand side indicates that we want an intercept-only model.
SEED <- 101
wi_prior <- normal(-1, 1) # weakly informative prior on log-odds
fit_pool <- stan_glm(cbind(Hits, AB - Hits) ~ 1, data = bball, family = binomial("logit"),
prior_intercept = wi_prior, seed = SEED)
The summary
function will compute all sorts of summary statistics from the fitted model, but here we’ll create a small function that will compute just a few posterior summary statistics that we’ll want for each of the models we estimate. The summary_stats
function, defined below, will take a matrix of posterior draws as its input, apply an inverse-logit transformation (to convert from log-odds to probabilities) and then compute the median and 80% interval.
invlogit <- plogis # function(x) 1/(1 + exp(-x))
summary_stats <- function(posterior) {
x <- invlogit(posterior) # log-odds -> probabilities
t(apply(x, 2, quantile, probs = c(0.1, 0.5, 0.9)))
}
pool <- summary_stats(as.matrix(fit_pool)) # as.matrix extracts the posterior draws
pool <- matrix(pool, # replicate to give each player the same estimates
nrow(bball), ncol(pool), byrow = TRUE,
dimnames = list(bball$Player, c("10%", "50%", "90%")))
batting_avg(pool)
10% 50% 90%
Clemente 0.246 0.265 0.285
Robinson 0.246 0.265 0.285
Howard 0.246 0.265 0.285
Johnstone 0.246 0.265 0.285
Berry 0.246 0.265 0.285
Spencer 0.246 0.265 0.285
Kessinger 0.246 0.265 0.285
Alvarado 0.246 0.265 0.285
Santo 0.246 0.265 0.285
Swaboda 0.246 0.265 0.285
Petrocelli 0.246 0.265 0.285
Rodriguez 0.246 0.265 0.285
Scott 0.246 0.265 0.285
Unser 0.246 0.265 0.285
Williams 0.246 0.265 0.285
Campaneris 0.246 0.265 0.285
Munson 0.246 0.265 0.285
Alvis 0.246 0.265 0.285
With more data, such as from more players or from the rest of the season, the posterior approaches a delta function around the maximum likelihood estimate and the posterior interval around the centeral posterior intervals will shrink. Nevertheless, even if we know a player’s chance of success exactly, there is a large amount of uncertainty in running \(K\) binary trials with that chance of success; using a binomial model fundamentally bounds our prediction accuracy.
Although this model will be a good baseline for comparison, we have good reason to believe from a large amount of prior data (players with as many as 10,000 trials) that it is very unlikely that all baseball players have the same chance of success.
A model with no pooling involves a separate chance-of-success parameter \(\theta_n \in [0,1]\) for each player \(n\), where the \(\theta_n\) are assumed to be independent.
rstanarm will again parameterize the model in terms of the log-odds, \(\alpha_n = \mathrm{logit}(\theta_n)\), so the likelihood then uses the log-odds of succes \(\alpha_n\) for unit \(n\) in modeling the number of successes \(y_n\) as
\[ p(y_n \, | \, \alpha_n) = \mathsf{Binomial}(y_n \, | \, K_n, \mathrm{logit}^{-1}(\alpha_n)). \]
Assuming the \(y_n\) are independent (conditional on \(\theta\)), this leads to the total data likelihood
\[ p(y \, | \, \alpha) = \prod_{n=1}^N \mathsf{Binomial}(y_n \, | \, K_n, \mathrm{logit}^{-1}(\alpha_n)). \]
To fit the model we need only tweak the model formula used for the full pooling model to drop the intercept and instead include as the only predictor the factor variable Player
. This is equivalent to estimating a separate intercept on the log-odds scale for each player. We’ll also use the prior
(rather than prior_intercept
) argument since Player
is considered a predictor rather than an intercept from R’s perspective. Using the same weakly informative prior now means that the each \(\alpha_n\) gets a \(\mathsf{Normal}(-1, 1)\) prior, independent of the others.
fit_nopool <- update(fit_pool, formula = . ~ 0 + Player, prior = wi_prior)
nopool <- summary_stats(as.matrix(fit_nopool))
rownames(nopool) <- as.character(bball$Player)
batting_avg(nopool)
parameters 10% 50% 90%
Clemente 0.303 0.388 0.482
Robinson 0.278 0.364 0.462
Howard 0.261 0.346 0.436
Johnstone 0.244 0.326 0.418
Berry 0.222 0.305 0.392
Spencer 0.227 0.305 0.395
Kessinger 0.209 0.285 0.369
Alvarado 0.192 0.263 0.347
Santo 0.174 0.244 0.327
Swaboda 0.174 0.244 0.325
Petrocelli 0.156 0.224 0.305
Rodriguez 0.156 0.225 0.304
Scott 0.153 0.224 0.306
Unser 0.155 0.223 0.304
Williams 0.156 0.225 0.307
Campaneris 0.140 0.204 0.284
Munson 0.122 0.185 0.259
Alvis 0.105 0.166 0.239
Each 80% interval is much wider than the estimated interval for the population in the complete pooling model; this is to be expected—there are only 45 data units for each parameter here as opposed to 810 in the complete pooling case. If the units each had different numbers of trials, the intervals would also vary based on size.
As the estimated chance of success goes up toward 0.5, the 80% intervals gets wider. This is to be expected for chance of success parameters, because the variance is maximized when \(\theta = 0.5\).
Based on our existing knowledge of baseball, the no-pooling model is almost certainly overestimating the high abilities and underestimating lower abilities (Ted Williams, 30 years prior to the year this data was collected, was the last player with a 40% observed success rate over a season, whereas 20% or less is too low for all but a few rare defensive specialists).
Complete pooling provides estimated abilities that are too narrowly distributed for the units and removes any chance of modeling population variation. Estimating each chance of success separately without any pooling provides estimated abilities that are too broadly distributed for the units and hence too variable. Clearly some amount of pooling between these two extremes is called for. But how much?
A hierarchical model treats the players as belonging to a population of players. The properties of this population will be estimated along with player abilities, implicitly controlling the amount of pooling that is applied. The more variable the (estimate of the) population, the less pooling is applied. Mathematically, the hierarchical model places a prior on the abilities with parameters that are themselves estimated.
This model can be estimated using the stan_glmer
function.
fit_partialpool <-
stan_glmer(cbind(Hits, AB - Hits) ~ (1 | Player), data = bball, family = binomial("logit"),
prior_intercept = wi_prior, seed = SEED)
Because stan_glmer
(like glmer
) estimates the varying intercepts for Player
by estimating a single global intercept \(\alpha_0\) and individual deviations from that intercept for each player \(\delta_n = \alpha_n - \alpha_0\), to get the posterior distribution for each \(\alpha_n\) we need to shift each of the posterior draws by the corresponding draw for the intercept. We can do this easily using the sweep
function.
# shift each player's estimate by intercept (and then drop intercept)
shift_draws <- function(draws) {
sweep(draws[, -1], MARGIN = 1, STATS = draws[, 1], FUN = "+")
}
alphas <- shift_draws(as.matrix(fit_partialpool))
partialpool <- summary_stats(alphas)
partialpool <- partialpool[-nrow(partialpool),]
rownames(partialpool) <- as.character(bball$Player)
batting_avg(partialpool)
parameters 10% 50% 90%
Clemente 0.249 0.282 0.347
Robinson 0.246 0.279 0.340
Howard 0.244 0.276 0.331
Johnstone 0.240 0.273 0.318
Berry 0.236 0.270 0.315
Spencer 0.236 0.270 0.316
Kessinger 0.233 0.267 0.309
Alvarado 0.228 0.264 0.305
Santo 0.222 0.261 0.298
Swaboda 0.220 0.260 0.299
Petrocelli 0.217 0.258 0.295
Rodriguez 0.218 0.258 0.293
Scott 0.217 0.259 0.292
Unser 0.218 0.259 0.295
Williams 0.217 0.259 0.295
Campaneris 0.212 0.256 0.290
Munson 0.203 0.252 0.287
Alvis 0.198 0.250 0.284
Here the estimates are less extreme than in the no-pooling case, which we should expect due to the partial pooling. It is also clear from the wide posteriors for the \(\theta_n\) that there is considerable uncertainty in the estimates of chance-of-success on an unit-by-unit (player-by-player) basis.
Figure 5.4 from (Gelman et al. 2013) plots the observed number of successes \(y_n\) for the first \(K_n\) trials versus the median and 80% intervals for the estimated chance-of-success parameters \(\theta_n\) in the posterior. The following R code reproduces a similar plot for our data.
library(ggplot2)
models <- c("complete pooling", "no pooling", "partial pooling")
estimates <- rbind(pool, nopool, partialpool)
colnames(estimates) <- c("lb", "median", "ub")
plotdata <- data.frame(estimates,
observed = rep(player_avgs, times = length(models)),
model = rep(models, each = N),
row.names = NULL)
ggplot(plotdata, aes(x = observed, y = median, ymin = lb, ymax = ub)) +
geom_hline(yintercept = tot_avg, color = "lightpink", size = 0.75) +
geom_abline(intercept = 0, slope = 1, color = "skyblue") +
geom_linerange(color = "gray60", size = 0.75) +
geom_point(size = 2.5, shape = 21, fill = "gray30", color = "white", stroke = 0.2) +
facet_grid(. ~ model) +
coord_fixed() +
scale_x_continuous(breaks = c(0.2, 0.3, 0.4)) +
labs(x = "Observed Hits / AB", y = "Predicted chance of hit") +
ggtitle("Posterior Medians and 80% Intervals")