23.4 Example: Hierarchical Logistic Regression
Consider a hierarchical model of American presidential voting behavior based on state of residence.44
Each of the fifty states \(k \in 1{:}50\) will have its own slope \(\beta_k\) and intercept \(\alpha_k\) to model the log odds of voting for the Republican candidate as a function of income. Suppose there are \(N\) voters and with voter \(n \in 1{:}N\) being in state \(s[n]\) with income \(x_n\). The likelihood for the vote \(y_n \in \{ 0, 1 \}\) is
\[ y_n \sim \mathsf{Bernoulli} \left( \mathrm{logit}^{-1}\left( \alpha_{s[n]} + \beta_{s[n]} \, x_n \right) \right). \]
The slopes and intercepts get hierarchical priors,
\[ \alpha_k \sim \mathsf{normal}(\mu_{\alpha}, \sigma_{\alpha}) \\ \beta_k \sim \mathsf{normal}(\mu_{\beta}, \sigma_{\beta}) \]
Unmapped Implementation
This model can be coded up in Stan directly as follows.
data {
int<lower = 0> K;
int<lower = 0> N;
int<lower = 1, upper = K> kk[N];
vector[N] x;
int<lower = 0, upper = 1> y[N];
}
parameters {
matrix[K,2] beta;
vector[2] mu;
vector<lower=0>[2] sigma;
}
model {
mu ~ normal(0, 2);
sigma ~ normal(0, 2);
for (i in 1:2)
beta[ , i] ~ normal(mu[i], sigma[i]);
y ~ bernoulli_logit(beta[kk, 1] + beta[kk, 2] .* x);
}
For this model the vector of predictors x
is coded as a vector,
corresponding to how it is used in the likelihood.
The priors for mu
and sigma
are vectorized. The priors
on the two components of beta
(intercept and slope,
respectively) are stored in a \(K \times 2\) matrix.
The likelihood is also
vectorized using multi-indexing with index kk
for the states
and elementwise multiplication (.*
) for the income x
.
The vectorized likelihood works out to the same thing as the following
less efficient looped form.
for (n in 1:N)
y[n] ~ bernoulli_logit(beta[kk[n], 1] + beta[kk[n], 2] * x[n]);
Mapped Implementation
The mapped version of the model will map over the states K
.
This means the group-level parameters, real data, and integer-data
must be arrays of the same size.
The mapped implementation requires a function to be mapped. The
following function evaluates both the likelihood for the data observed
for a group as well as the prior for the group-specific parameters
(the name bl_glm
derives from the fact that it’s a generalized
linear model with a Bernoulli likelihood and logistic link function).
functions {
vector bl_glm(vector mu_sigma, vector beta,
real[] x, int[] y) {
vector[2] mu = mu_sigma[1:2];
vector[2] sigma = mu_sigma[3:4];
real lp = normal_lpdf(beta | mu, sigma);
real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
return [lp + ll]';
}
}
The shared parameter mu_sigma
contains the locations
(mu_sigma[1:2]
) and scales (mu_sigma[3:4]
) of the
priors, which are extracted in the first two lines of the program.
The variable lp
is assigned the log density of the prior on
beta
. The vector beta
is of size two, as are the
vectors mu
and sigma
, so everything lines up for the
vectorization. Next, the variable ll
is assigned to the log
likelihood contribution for the group. Here beta[1]
is the
intercept of the regression and beta[2]
the slope. The
predictor array x
needs to be converted to a vector allow the
multiplication.
The data block is identical to that of the previous program, but repeated here for convenience. A transformed data block computes the data structures needed for the mapping by organizing the data into arrays indexed by group.
data {
int<lower = 0> K;
int<lower = 0> N;
int<lower = 1, upper = K> kk[N];
vector[N] x;
int<lower = 0, upper = 1> y[N];
}
transformed data {
int<lower = 0> J = N / K;
real x_r[K, J];
int<lower = 0, upper = 1> x_i[K, J];
{
int pos = 1;
for (k in 1:K) {
int end = pos + J - 1;
x_r[k] = to_array_1d(x[pos:end]);
x_i[k] = to_array_1d(y[pos:end]);
pos += J;
}
}
}
The integer J
is set to the number of observations per group.45
The real data array x_r
holds the predictors and the integer
data array x_i
holds the outcomes. The grouped data arrays
are constructed by slicing the predictor vector x
(and
converting it to an array) and slicing the outcome array y
.
Given the transformed data with groupings, the parameters are the same
as the previous program. The model has the same priors for the
hyperparameters mu
and sigma
, but moves the prior for
beta
and the likelihood to the mapped function.
parameters {
vector[2] beta[K];
vector[2] mu;
vector<lower=0>[2] sigma;
}
model {
mu ~ normal(0, 2);
sigma ~ normal(0, 2);
target += sum(map_rect(bl_glm, append_row(mu, sigma), beta, x_r, x_i));
}
The model as written here computes the priors for each group’s parameters along with the likelihood contribution for the group. An alternative mapping would leave the prior in the model block and only map the likelihood. In a serial setting this shouldn’t make much of a difference, but with parallelization, there is reduced communication (the prior’s parameters need not be transmitted) and also reduced parallelization with the version that leaves the prior in the model block.
This makes the strong assumption that each group has the same number of observations!↩︎