1.9 Hierarchical Logistic Regression
The simplest multilevel model is a hierarchical model in which the data are grouped into \(L\) distinct categories (or levels). An extreme approach would be to completely pool all the data and estimate a common vector of regression coefficients \(\beta\). At the other extreme, an approach with no pooling assigns each level \(l\) its own coefficient vector \(\beta_l\) that is estimated separately from the other levels. A hierarchical model is an intermediate solution where the degree of pooling is determined by the data and a prior on the amount of pooling.
Suppose each binary outcome \(y_n \in \{ 0, 1 \}\) has an associated level, \(ll_n \in \{ 1, \dotsc, L \}\). Each outcome will also have an associated predictor vector \(x_n \in \mathbb{R}^D\). Each level \(l\) gets its own coefficient vector \(\beta_l \in \mathbb{R}^D\). The hierarchical structure involves drawing the coefficients \(\beta_{l,d} \in \mathbb{R}\) from a prior that is also estimated with the data. This hierarchically estimated prior determines the amount of pooling. If the data in each level are similar, strong pooling will be reflected in low hierarchical variance. If the data in the levels are dissimilar, weaker pooling will be reflected in higher hierarchical variance.
The following model encodes a hierarchical logistic regression model with a hierarchical prior on the regression coefficients.
data {
int<lower=1> D;
int<lower=0> N;
int<lower=1> L;
int<lower=0,upper=1> y[N];
int<lower=1,upper=L> ll[N];
row_vector[D] x[N];
}
parameters {
real mu[D];
real<lower=0> sigma[D];
vector[D] beta[L];
}
model {
for (d in 1:D) {
mu[d] ~ normal(0, 100);
for (l in 1:L)
beta[l,d] ~ normal(mu[d], sigma[d]);
}
for (n in 1:N)
y[n] ~ bernoulli(inv_logit(x[n] * beta[ll[n]]));
}
The standard deviation parameter sigma
gets an implicit uniform prior on \((0,\infty)\) because of its declaration with a lower-bound constraint of zero. Stan allows improper priors as long as the posterior is proper. Nevertheless, it is usually helpful to have informative or at least weakly informative priors for all parameters; see the regression priors section for recommendations on priors for regression coefficients and scales.
Optimizing the Model
Where possible, vectorizing sampling statements leads to faster log probability and derivative evaluations. The speed boost is not because loops are eliminated, but because vectorization allows sharing subcomputations in the log probability and gradient calculations and because it reduces the size of the expression tree required for gradient calculations.
The first optimization vectorizes the for-loop over D
as
mu ~ normal(0, 100);
for (l in 1:L)
beta[l] ~ normal(mu, sigma);
The declaration of beta
as an array of vectors means that the expression beta[l]
denotes a vector. Although beta
could have been declared as a matrix, an array of vectors (or a two-dimensional array) is more efficient for accessing rows; see the indexing efficiency section for more information on the efficiency tradeoffs among arrays, vectors, and matrices.
This model can be further sped up and at the same time made more arithmetically stable by replacing the application of inverse-logit inside the Bernoulli distribution with the logit-parameterized Bernoulli,3
for (n in 1:N)
y[n] ~ bernoulli_logit(x[n] * beta[ll[n]]);
Unlike in R or BUGS, loops, array access and assignments are fast in Stan because they are translated directly to C++. In most cases, the cost of allocating and assigning to a container is more than made up for by the increased efficiency due to vectorizing the log probability and gradient calculations. Thus the following version is faster than the original formulation as a loop over a sampling statement.
{
vector[N] x_beta_ll;
for (n in 1:N)
x_beta_ll[n] = x[n] * beta[ll[n]];
y ~ bernoulli_logit(x_beta_ll);
}
The brackets introduce a new scope for the local variable x_beta_ll
; alternatively, the variable may be declared at the top of the model block.
In some cases, such as the above, the local variable assignment leads to models that are less readable. The recommended practice in such cases is to first develop and debug the more transparent version of the model and only work on optimizations when the simpler formulation has been debugged.
The Bernoulli-logit distribution builds in the log link function, taking \[\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).\]↩