23.4 Example: Hierarchical Logistic Regression
Consider a hierarchical model of American presidential voting behavior based on state of residence.43
Each of the fifty states \(k \in 1{:}50\) will have its own slope \(\beta_k\) and intercept \(\alpha_k\) to model the log odds of voting for the Republican candidate as a function of income. Suppose there are \(N\) voters and with voter \(n \in 1{:}N\) being in state \(s[n]\) with income \(x_n\). The likelihood for the vote \(y_n \in \{ 0, 1 \}\) is
\[ y_n \sim \mathsf{Bernoulli} \left( \mathrm{logit}^{-1}\left( \alpha_{s[n]} + \beta_{s[n]} \, x_n \right) \right). \]
The slopes and intercepts get hierarchical priors,
\[ \alpha_k \sim \mathsf{normal}(\mu_{\alpha}, \sigma_{\alpha}) \\ \beta_k \sim \mathsf{normal}(\mu_{\beta}, \sigma_{\beta}) \]
Unmapped Implementation
This model can be coded up in Stan directly as follows.
data {
int<lower = 0> K;
int<lower = 0> N;
int<lower = 1, upper = K> kk[N];
vector[N] x;
int<lower = 0, upper = 1> y[N];
}
parameters {
matrix[K,2] beta;
vector[2] mu;
vector<lower=0>[2] sigma;
}
model {
mu ~ normal(0, 2);
sigma ~ normal(0, 2);
for (i in 1:2)
beta[ , i] ~ normal(mu[i], sigma[i]);
y ~ bernoulli_logit(beta[kk, 1] + beta[kk, 2] .* x);
}
For this model the vector of predictors x is coded as a vector,
corresponding to how it is used in the likelihood.
The priors for mu and sigma are vectorized. The priors
on the two components of beta (intercept and slope,
respectively) are stored in a \(K \times 2\) matrix.
The likelihood is also
vectorized using multi-indexing with index kk for the states
and elementwise multiplication (.*) for the income x.
The vectorized likelihood works out to the same thing as the following
less efficient looped form.
for (n in 1:N)
y[n] ~ bernoulli_logit(beta[kk[n], 1] + beta[kk[n], 2] * x[n]);
Mapped Implementation
The mapped version of the model will map over the states K.
This means the group-level parameters, real data, and integer-data
must be arrays of the same size.
The mapped implementation requires a function to be mapped. The
following function evaluates both the likelihood for the data observed
for a group as well as the prior for the group-specific parameters
(the name bl_glm derives from the fact that it’s a generalized
linear model with a Bernoulli likelihood and logistic link function).
functions {
vector bl_glm(vector mu_sigma, vector beta,
real[] x, int[] y) {
vector[2] mu = mu_sigma[1:2];
vector[2] sigma = mu_sigma[3:4];
real lp = normal_lpdf(beta | mu, sigma);
real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
return [lp + ll]';
}
}
The shared parameter mu_sigma contains the locations
(mu_sigma[1:2]) and scales (mu_sigma[3:4]) of the
priors, which are extracted in the first two lines of the program.
The variable lp is assigned the log density of the prior on
beta. The vector beta is of size two, as are the
vectors mu and sigma, so everything lines up for the
vectorization. Next, the variable ll is assigned to the log
likelihood contribution for the group. Here beta[1] is the
intercept of the regression and beta[2] the slope. The
predictor array x needs to be converted to a vector allow the
multiplication.
The data block is identical to that of the previous program, but repeated here for convenience. A transformed data block computes the data structures needed for the mapping by organizing the data into arrays indexed by group.
data {
int<lower = 0> K;
int<lower = 0> N;
int<lower = 1, upper = K> kk[N];
vector[N] x;
int<lower = 0, upper = 1> y[N];
}
transformed data {
int<lower = 0> J = N / K;
real x_r[K, J];
int<lower = 0, upper = 1> x_i[K, J];
{
int pos = 1;
for (k in 1:K) {
int end = pos + J - 1;
x_r[k] = to_array_1d(x[pos:end]);
x_i[k] = to_array_1d(y[pos:end]);
pos += J;
}
}
}
The integer J is set to the number of observations per group.44
The real data array x_r holds the predictors and the integer
data array x_i holds the outcomes. The grouped data arrays
are constructed by slicing the predictor vector x (and
converting it to an array) and slicing the outcome array y.
Given the transformed data with groupings, the parameters are the same
as the previous program. The model has the same priors for the
hyperparameters mu and sigma, but moves the prior for
beta and the likelihood to the mapped function.
parameters {
vector[2] beta[K];
vector[2] mu;
vector<lower=0>[2] sigma;
}
model {
mu ~ normal(0, 2);
sigma ~ normal(0, 2);
target += sum(map_rect(bl_glm, append_row(mu, sigma), beta, x_r, x_i));
}
The model as written here computes the priors for each group’s parameters along with the likelihood contribution for the group. An alternative mapping would leave the prior in the model block and only map the likelihood. In a serial setting this shouldn’t make much of a difference, but with parallelization, there is reduced communication (the prior’s parameters need not be transmitted) and also reduced parallelization with the version that leaves the prior in the model block.