23.4 Example: Hierarchical Logistic Regression

This is an old version, view current version.

23.4 Example: Hierarchical Logistic Regression

Consider a hierarchical model of American presidential voting behavior based on state of residence.⁴³

Each of the fifty states $k \in 1{:}50$ will have its own slope $\beta_k$ and intercept $\alpha_k$ to model the log odds of voting for the Republican candidate as a function of income. Suppose there are $N$ voters and with voter $n \in 1{:}N$ being in state $s[n]$ with income $x_n$ . The likelihood for the vote $y_n \in \{ 0, 1 \}$ is

$y_n \sim \mathsf{Bernoulli} \left( \mathrm{logit}^{-1}\left( \alpha_{s[n]} + \beta_{s[n]} \, x_n \right) \right).$

The slopes and intercepts get hierarchical priors,

$\alpha_k \sim \mathsf{normal}(\mu_{\alpha}, \sigma_{\alpha}) \\ \beta_k \sim \mathsf{normal}(\mu_{\beta}, \sigma_{\beta})$

Unmapped Implementation

This model can be coded up in Stan directly as follows.

data {
  int<lower = 0> K;
  int<lower = 0> N;
  int<lower = 1, upper = K> kk[N];
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  matrix[K,2] beta;
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  for (i in 1:2)
    beta[ , i] ~ normal(mu[i], sigma[i]);
  y ~ bernoulli_logit(beta[kk, 1] + beta[kk, 2] .* x);
}

For this model the vector of predictors x is coded as a vector, corresponding to how it is used in the likelihood. The priors for mu and sigma are vectorized. The priors on the two components of beta (intercept and slope, respectively) are stored in a $K \times 2$ matrix.

The likelihood is also vectorized using multi-indexing with index kk for the states and elementwise multiplication (.*) for the income x. The vectorized likelihood works out to the same thing as the following less efficient looped form.

for (n in 1:N)
  y[n] ~ bernoulli_logit(beta[kk[n], 1] + beta[kk[n], 2] * x[n]);

Mapped Implementation

The mapped version of the model will map over the states K. This means the group-level parameters, real data, and integer-data must be arrays of the same size.

The mapped implementation requires a function to be mapped. The following function evaluates both the likelihood for the data observed for a group as well as the prior for the group-specific parameters (the name bl_glm derives from the fact that it’s a generalized linear model with a Bernoulli likelihood and logistic link function).

functions {
 vector bl_glm(vector mu_sigma, vector beta,
               real[] x, int[] y) {
   vector[2] mu = mu_sigma[1:2];
   vector[2] sigma = mu_sigma[3:4];
   real lp = normal_lpdf(beta | mu, sigma);
   real ll = bernoulli_logit_lpmf(y | beta[1] + beta[2] * to_vector(x));
   return [lp + ll]';
 }
}

The shared parameter mu_sigma contains the locations (mu_sigma[1:2]) and scales (mu_sigma[3:4]) of the priors, which are extracted in the first two lines of the program. The variable lp is assigned the log density of the prior on beta. The vector beta is of size two, as are the vectors mu and sigma, so everything lines up for the vectorization. Next, the variable ll is assigned to the log likelihood contribution for the group. Here beta[1] is the intercept of the regression and beta[2] the slope. The predictor array x needs to be converted to a vector allow the multiplication.

The data block is identical to that of the previous program, but repeated here for convenience. A transformed data block computes the data structures needed for the mapping by organizing the data into arrays indexed by group.

data {
  int<lower = 0> K;
  int<lower = 0> N;
  int<lower = 1, upper = K> kk[N];
  vector[N] x;
  int<lower = 0, upper = 1> y[N];
}
transformed data {
  int<lower = 0> J = N / K;
  real x_r[K, J];
  int<lower = 0, upper = 1> x_i[K, J];
  {
    int pos = 1;
    for (k in 1:K) {
      int end = pos + J - 1;
      x_r[k] = to_array_1d(x[pos:end]);
      x_i[k] = to_array_1d(y[pos:end]);
      pos += J;
    }
  }
}

The integer J is set to the number of observations per group.⁴⁴

The real data array x_r holds the predictors and the integer data array x_i holds the outcomes. The grouped data arrays are constructed by slicing the predictor vector x (and converting it to an array) and slicing the outcome array y.

Given the transformed data with groupings, the parameters are the same as the previous program. The model has the same priors for the hyperparameters mu and sigma, but moves the prior for beta and the likelihood to the mapped function.

parameters {
  vector[2] beta[K];
  vector[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(0, 2);
  sigma ~ normal(0, 2);
  target += sum(map_rect(bl_glm, append_row(mu, sigma), beta, x_r, x_i));
}

The model as written here computes the priors for each group’s parameters along with the likelihood contribution for the group. An alternative mapping would leave the prior in the model block and only map the likelihood. In a serial setting this shouldn’t make much of a difference, but with parallelization, there is reduced communication (the prior’s parameters need not be transmitted) and also reduced parallelization with the version that leaves the prior in the model block.

This example is a simplified form of the model described in (Gelman and Hill 2007, Section 14.2)↩
This makes the strong assumption that each group has the same number of observations!↩