1.13 Multivariate priors for hierarchical models
In hierarchical regression models (and other situations), several individual-level variables may be assigned hierarchical priors. For example, a model with multiple varying intercepts and slopes within might assign them a multivariate prior.
As an example, the individuals might be people and the outcome income, with predictors such as education level and age, and the groups might be states or other geographic divisions. The effect of education level and age as well as an intercept might be allowed to vary by state. Furthermore, there might be state-level predictors, such as average state income and unemployment level.
Multivariate regression example
Andrew Gelman and Hill (2007, chap. 13, Chapter 17) provide a discussion of a hierarchical model with \(N\) individuals organized into \(J\) groups. Each individual has a predictor row vector \(x_n\) of size \(K\); to unify the notation, they assume that \(x_{n,1} = 1\) is a fixed “intercept” predictor. To encode group membership, they assume individual \(n\) belongs to group \(jj[n] \in \{ 1, \dotsc, J \}\). Each individual \(n\) also has an observed outcome \(y_n\) taking on real values.
Likelihood
The model is a linear regression with slope and intercept coefficients varying by group, so that \(\beta_j\) is the coefficient \(K\)-vector for group \(j\). The likelihood function for individual \(n\) is then just \[ y_n \sim \textsf{normal}(x_n \, \beta_{jj[n]}, \, \sigma) \quad\text{for}\quad n \in \{ 1, \dotsc, N \}. \]
Coefficient prior
Gelman and Hill model the coefficient vectors \(\beta_j\) as being drawn from a multivariate distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\), \[ \beta_j \sim \textsf{multivariate normal}(\mu_j, \, \Sigma) \quad\text{for}\quad j \in \{ 1, \dotsc, J \}. \]
Below, we discuss the full model of Gelman and Hill, which uses group-level predictors to model \(\mu\); for now, we assume \(\mu\) is a simple vector parameter.
Hyperpriors
For hierarchical modeling, the group-level mean vector \(\mu\) and covariance matrix \(\Sigma\) must themselves be given priors. The group-level mean vector can be given a reasonable weakly-informative prior for independent coefficients, such as \[ \mu_j \sim \textsf{normal}(0,5). \] If more is known about the expected coefficient values \(\beta_{j, k}\), this information can be incorporated into the prior for \(\mu_j\).
For the prior on the covariance matrix, Gelman and Hill suggest using a scaled inverse Wishart. That choice was motivated primarily by convenience as it is conjugate to the multivariate likelihood function and thus simplifies Gibbs sampling
In Stan, there is no restriction to conjugacy for multivariate priors, and we in fact recommend a slightly different approach. Like Gelman and Hill, we decompose our prior into a scale and a matrix, but are able to do so in a more natural way based on the actual variable scales and a correlation matrix. Specifically, we define \[ \Sigma = \texttt{diag}\mathtt{\_}\texttt{matrix}(\tau) \times \Omega \times \texttt{diag}\mathtt{\_}\texttt{matrix}(\tau), \] where \(\Omega\) is a correlation matrix and \(\tau\) is the vector of coefficient scales. This mapping from scale vector \(\tau\) and correlation matrix \(\Omega\) can be inverted, using \[ \tau_k = \sqrt{\Sigma_{k,k}} \quad\textsf{and}\quad \Omega_{i, j} = \frac{\Sigma_{i, j}}{\tau_i \, \tau_j}. \]
The components of the scale vector \(\tau\) can be given any reasonable prior for scales, but we recommend something weakly informative like a half-Cauchy distribution with a small scale, such as \[ \tau_k \sim \textsf{Cauchy}(0, 2.5) \quad\text{for}\quad k \in \{ 1, \dotsc, K \} \quad\text{constrained\ by}\quad \tau_k > 0. \] As for the prior means, if there is information about the scale of variation of coefficients across groups, it should be incorporated into the prior for \(\tau\). For large numbers of exchangeable coefficients, the components of \(\tau\) itself (perhaps excluding the intercept) may themselves be given a hierarchical prior.
Our final recommendation is to give the correlation matrix \(\Omega\) an LKJ prior with shape \(\eta \geq 1\),5
\[ \Omega \sim \textsf{LKJCorr}(\eta). \]
The LKJ correlation distribution is defined by \[ \textsf{LKJCorr}\left(\Sigma \mid \eta\right) \propto \operatorname{det}\left(\Sigma\right)^{\eta - 1}. \]
The basic behavior of the LKJ correlation distribution is similar to that of a beta distribution. For \(\eta = 1\), the result is a uniform distribution. Despite being the identity over correlation matrices, the marginal distribution over the entries in that matrix (i.e., the correlations) is not uniform between -1 and 1. Rather, it concentrates around zero as the dimensionality increases due to the complex constraints.
For \(\eta > 1\), the density increasingly concentrates mass around the unit matrix, i.e., favoring less correlation. For \(\eta < 1\), it increasingly concentrates mass in the other direction, i.e., favoring more correlation.
The LKJ prior may thus be used to control the expected amount of correlation among the parameters \(\beta_j\). For a discussion of decomposing a covariance prior into a prior on correlation matrices and an independent prior on scales, see Barnard, McCulloch, and Meng (2000).
Group-level predictors for prior mean
To complete Gelman and Hill’s model, suppose each group \(j \in \{ 1, \dotsc, J \}\) is supplied with an \(L\)-dimensional row-vector of group-level predictors \(u_j\). The prior mean for the \(\beta_j\) can then itself be modeled as a regression, using an \(L\)-dimensional coefficient vector \(\gamma\). The prior for the group-level coefficients then becomes \[ \beta_j \sim \textsf{multivariate normal}(u_j \, \gamma, \Sigma) \]
The group-level coefficients \(\gamma\) may themselves be given independent weakly informative priors, such as \[ \gamma_l \sim \textsf{normal}(0,5). \] As usual, information about the group-level means should be incorporated into this prior.
Coding the model in Stan
The Stan code for the full hierarchical model with multivariate priors on the group-level coefficients and group-level prior means follows its definition.
data {
int<lower=0> N; // num individuals
int<lower=1> K; // num ind predictors
int<lower=1> J; // num groups
int<lower=1> L; // num group predictors
array[N] int<lower=1, upper=J> jj; // group for individual
matrix[N, K] x; // individual predictors
array[J] row_vector[L] u; // group predictors
vector[N] y; // outcomes
}parameters {
corr_matrix[K] Omega; // prior correlation
vector<lower=0>[K] tau; // prior scale
matrix[L, K] gamma; // group coeffs
array[J] vector[K] beta; // indiv coeffs by group
real<lower=0> sigma; // prediction error scale
}model {
0, 2.5);
tau ~ cauchy(2);
Omega ~ lkj_corr(0, 5);
to_vector(gamma) ~ normal(
{array[J] row_vector[K] u_gamma;
for (j in 1:J) {
u_gamma[j] = u[j] * gamma;
}
beta ~ multi_normal(u_gamma, quad_form_diag(Omega, tau));
}for (n in 1:N) {
y[n] ~ normal(x[n] * beta[jj[n]], sigma);
} }
The hyperprior covariance matrix is defined implicitly through the a
quadratic form in the code because the correlation matrix Omega
and
scale vector tau
are more natural to inspect in the output; to
output Sigma
, define it as a transformed parameter. The function
quad_form_diag
is defined so that quad_form_diag(Sigma, tau)
is
equivalent to diag_matrix(tau) * Sigma * diag_matrix(tau)
, where
diag_matrix(tau)
returns the matrix with tau
on the diagonal and
zeroes off diagonal; the version using quad_form_diag
should be
faster. For details on these and other matrix arithmetic operators and
functions, see the function reference manual.
Optimization through vectorization
The code in the Stan program above can be sped up dramatically by replacing:
for (n in 1:N) {
y[n] ~ normal(x[n] * beta[jj[n]], sigma); }
with the vectorized form:
{vector[N] x_beta_jj;
for (n in 1:N) {
x_beta_jj[n] = x[n] * beta[jj[n]];
}
y ~ normal(x_beta_jj, sigma); }
The outer brackets create a local scope in which to define the
variable x_beta_jj
, which is then filled in a loop and used
to define a vectorized sampling statement. The reason this is such a
big win is that it allows us to take the log of sigma only once and it
greatly reduces the size of the resulting expression graph by packing
all of the work into a single density function.
Although it is tempting to redeclare beta
and include a revised
model block sampling statement,
parameters {
matrix[J, K] beta;
// ...
}model {
y ~ normal(rows_dot_product(x, beta[jj]), sigma);// ...
}
this fails because it breaks the vectorization of sampling for
beta
,6
beta ~ multi_normal(...);
which requires beta
to be an array of vectors. Both
vectorizations are important, so the best solution is to just use the
loop above, because rows_dot_product
cannot do much
optimization in and of itself because there are no shared computations.
The code in the Stan program above also builds up an array of vectors for the outcomes and for the multivariate normal, which provides a major speedup by reducing the number of linear systems that need to be solved and differentiated.
{matrix[K, K] Sigma_beta;
Sigma_beta = quad_form_diag(Omega, tau);for (j in 1:J) {
beta[j] ~ multi_normal((u[j] * gamma)', Sigma_beta);
} }
In this example, the covariance matrix Sigma_beta
is defined
as a local variable so as not to have to repeat the quadratic form
computation \(J\) times. This vectorization can be combined with the
Cholesky-factor optimization in the next section.
Optimization through Cholesky factorization
The multivariate normal density and LKJ prior on correlation matrices both require their matrix parameters to be factored. Vectorizing, as in the previous section, ensures this is only done once for each density. An even better solution, both in terms of efficiency and numerical stability, is to parameterize the model directly in terms of Cholesky factors of correlation matrices using the multivariate version of the non-centered parameterization. For the model in the previous section, the program fragment to replace the full matrix prior with an equivalent Cholesky factorized prior is as follows.
data {
matrix[L, J] u; // group predictors transposed
// ...
}parameters {
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
matrix[K, L] gamma;
// ...
}transformed parameters {
matrix[K, J] beta;
beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}model {
to_vector(z) ~ std_normal();2);
L_Omega ~ lkj_corr_cholesky(// ...
}
The data variable u
was originally an array of vectors, which
is efficient for access; here it is redeclared as a matrix in order to
use it in matrix arithmetic. Moreover, it is transposed, along with
gamma
and beta
, to minimize the number of transposition operations.
The new parameter L_Omega
is the Cholesky factor of the original
correlation matrix Omega
, so that
Omega = L_Omega * L_Omega'
The prior scale vector tau
is unchanged, and furthermore,
pre-multiplying the Cholesky factor by the scale produces the Cholesky
factor of the final covariance matrix,
Sigma_beta
= quad_form_diag(Omega, tau)
= diag_pre_multiply(tau, L_Omega) * diag_pre_multiply(tau, L_Omega)'
where the diagonal pre-multiply compound operation is defined by
diag_pre_multiply(a, b) = diag_matrix(a) * b
The new variable z
is declared as a matrix, the entries of
which are given independent standard normal priors; the to_vector
operation turns the matrix into a vector so that it can be used as a
vectorized argument to the univariate normal density. This results in every
column of z
being a \(K\)-variate normal random vector with the identity as
covariance matrix. Therefore, multiplying z
by the Cholesky factor of the
covariance matrix and adding the mean (u * gamma)'
produces a beta
distributed as in the original model, where the variance is,
letting \(L = \mathrm{diag}(\tau)\,\Omega_L\),
\[ \begin{aligned} \mathbb{V}[\beta] &= \mathbb{E}\big((L \, z)(L \, z)^\top) \\ &= \mathbb{E}\big((L \, z \, z^\top \, L^\top) \\ &= L \, \mathbb{E}(z \, z^\top) \, L^\top \\ &= L \, L^\top =(\mathrm{diag}(\tau)\,\Omega_L)\,(\mathrm{diag}(\tau)\,\Omega_L)^\top \\ &= \mathrm{diag}(\tau)\,\Omega\,\mathrm{diag}(\tau) \\ &= \Sigma. \end{aligned} \] Where we have used the linearity of expectations (line 2 to 3), the definition of \(\Omega = \Omega_L \, \Omega_L^\top\), and the fact that \(\mathbb{E}(z \, z^\top) = I\) since \(z \sim \mathcal{N}(0, I)\).
Omitting the remaining data declarations, which are the same as before
with the exception of u
, the optimized model is as follows.
parameters {
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
vector<lower=0, upper=pi() / 2>[K] tau_unif; // prior scale
matrix[K, L] gamma; // group coeffs
real<lower=0> sigma; // prediction error scale
}transformed parameters {
vector<lower=0>[K] tau = 2.5 * tan(tau_unif);
matrix[K, J] beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}model {
vector[N] mu;
for(n in 1:N) {
mu[n] = x[n, ] * beta[, jj[n]];
}
to_vector(z) ~ std_normal();2);
L_Omega ~ lkj_corr_cholesky(0, 5);
to_vector(gamma) ~ normal(
y ~ normal(mu, sigma); }
This model also reparameterizes the prior scale tau
to avoid
potential problems with the heavy tails of the Cauchy
distribution. The statement tau_unif ~ uniform(0, pi() / 2)
can be
omitted from the model block because Stan increments the log posterior
for parameters with uniform priors without it.
References
The prior is named for Lewandowski, Kurowicka, and Joe, as it was derived by inverting the random correlation matrix generation strategy of Lewandowski, Kurowicka, and Joe (2009).↩︎
Thanks to Mike Lawrence for pointing this out in the GitHub issue for the manual.↩︎