1.13 Multivariate priors for hierarchical models

This is an old version, view current version.

1.13 Multivariate priors for hierarchical models

In hierarchical regression models (and other situations), several individual-level variables may be assigned hierarchical priors. For example, a model with multiple varying intercepts and slopes within might assign them a multivariate prior.

As an example, the individuals might be people and the outcome income, with predictors such as education level and age, and the groups might be states or other geographic divisions. The effect of education level and age as well as an intercept might be allowed to vary by state. Furthermore, there might be state-level predictors, such as average state income and unemployment level.

Multivariate regression example

Andrew Gelman and Hill (2007, chap. 13, Chapter 17) provide a discussion of a hierarchical model with \(N\) individuals organized into \(J\) groups. Each individual has a predictor row vector \(x_n\) of size \(K\); to unify the notation, they assume that \(x_{n,1} = 1\) is a fixed “intercept” predictor. To encode group membership, they assume individual \(n\) belongs to group \(jj[n] \in \{ 1, \dotsc, J \}\). Each individual \(n\) also has an observed outcome \(y_n\) taking on real values.

Likelihood

The model is a linear regression with slope and intercept coefficients varying by group, so that \(\beta_j\) is the coefficient \(K\)-vector for group \(j\). The likelihood function for individual \(n\) is then just \[ y_n \sim \textsf{normal}(x_n \, \beta_{jj[n]}, \, \sigma) \quad\text{for}\quad n \in \{ 1, \dotsc, N \}. \]

Coefficient prior

Gelman and Hill model the coefficient vectors \(\beta_j\) as being drawn from a multivariate distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\), \[ \beta_j \sim \textsf{multivariate normal}(\mu_j, \, \Sigma) \quad\text{for}\quad j \in \{ 1, \dotsc, J \}. \]

Below, we discuss the full model of Gelman and Hill, which uses group-level predictors to model \(\mu\); for now, we assume \(\mu\) is a simple vector parameter.

Hyperpriors

For hierarchical modeling, the group-level mean vector \(\mu\) and covariance matrix \(\Sigma\) must themselves be given priors. The group-level mean vector can be given a reasonable weakly-informative prior for independent coefficients, such as \[ \mu_j \sim \textsf{normal}(0,5). \] If more is known about the expected coefficient values \(\beta_{j, k}\), this information can be incorporated into the prior for \(\mu_j\).

For the prior on the covariance matrix, Gelman and Hill suggest using a scaled inverse Wishart. That choice was motivated primarily by convenience as it is conjugate to the multivariate likelihood function and thus simplifies Gibbs sampling

In Stan, there is no restriction to conjugacy for multivariate priors, and we in fact recommend a slightly different approach. Like Gelman and Hill, we decompose our prior into a scale and a matrix, but are able to do so in a more natural way based on the actual variable scales and a correlation matrix. Specifically, we define \[ \Sigma = \texttt{diag}\mathtt{\_}\texttt{matrix}(\tau) \times \Omega \times \texttt{diag}\mathtt{\_}\texttt{matrix}(\tau), \] where \(\Omega\) is a correlation matrix and \(\tau\) is the vector of coefficient scales. This mapping from scale vector \(\tau\) and correlation matrix \(\Omega\) can be inverted, using \[ \tau_k = \sqrt{\Sigma_{k,k}} \quad\textsf{and}\quad \Omega_{i, j} = \frac{\Sigma_{i, j}}{\tau_i \, \tau_j}. \]

The components of the scale vector \(\tau\) can be given any reasonable prior for scales, but we recommend something weakly informative like a half-Cauchy distribution with a small scale, such as \[ \tau_k \sim \textsf{Cauchy}(0, 2.5) \quad\text{for}\quad k \in \{ 1, \dotsc, K \} \quad\text{constrained\ by}\quad \tau_k > 0. \] As for the prior means, if there is information about the scale of variation of coefficients across groups, it should be incorporated into the prior for \(\tau\). For large numbers of exchangeable coefficients, the components of \(\tau\) itself (perhaps excluding the intercept) may themselves be given a hierarchical prior.

Our final recommendation is to give the correlation matrix \(\Omega\) an LKJ prior with shape \(\eta \geq 1\),⁵

\[ \Omega \sim \textsf{LKJCorr}(\eta). \]

The LKJ correlation distribution is defined by \[ \textsf{LKJCorr}\left(\Sigma \mid \eta\right) \propto \operatorname{det}\left(\Sigma\right)^{\eta - 1}. \]

The basic behavior of the LKJ correlation distribution is similar to that of a beta distribution. For \(\eta = 1\), the result is a uniform distribution. Despite being the identity over correlation matrices, the marginal distribution over the entries in that matrix (i.e., the correlations) is not uniform between -1 and 1. Rather, it concentrates around zero as the dimensionality increases due to the complex constraints.

For \(\eta > 1\), the density increasingly concentrates mass around the unit matrix, i.e., favoring less correlation. For \(\eta < 1\), it increasingly concentrates mass in the other direction, i.e., favoring more correlation.

The LKJ prior may thus be used to control the expected amount of correlation among the parameters \(\beta_j\). For a discussion of decomposing a covariance prior into a prior on correlation matrices and an independent prior on scales, see Barnard, McCulloch, and Meng (2000).

Group-level predictors for prior mean

To complete Gelman and Hill’s model, suppose each group \(j \in \{ 1, \dotsc, J \}\) is supplied with an \(L\)-dimensional row-vector of group-level predictors \(u_j\). The prior mean for the \(\beta_j\) can then itself be modeled as a regression, using an \(L\)-dimensional coefficient vector \(\gamma\). The prior for the group-level coefficients then becomes \[ \beta_j \sim \textsf{multivariate normal}(u_j \, \gamma, \Sigma) \]

The group-level coefficients \(\gamma\) may themselves be given independent weakly informative priors, such as \[ \gamma_l \sim \textsf{normal}(0,5). \] As usual, information about the group-level means should be incorporated into this prior.

Coding the model in Stan

The Stan code for the full hierarchical model with multivariate priors on the group-level coefficients and group-level prior means follows its definition.

data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  int<lower=1,upper=J> jj[N];  // group for individual
  matrix[N, K] x;              // individual predictors
  row_vector[L] u[J];          // group predictors
  vector[N] y;                 // outcomes
}
parameters {
  corr_matrix[K] Omega;        // prior correlation
  vector<lower=0>[K] tau;      // prior scale
  matrix[L, K] gamma;          // group coeffs
  vector[K] beta[J];           // indiv coeffs by group
  real<lower=0> sigma;         // prediction error scale
}
model {
  tau ~ cauchy(0, 2.5);
  Omega ~ lkj_corr(2);
  to_vector(gamma) ~ normal(0, 5);
  {
    row_vector[K] u_gamma[J];
    for (j in 1:J)
      u_gamma[j] = u[j] * gamma;
    beta ~ multi_normal(u_gamma, quad_form_diag(Omega, tau));
  }
  for (n in 1:N)
    y[n] ~ normal(x[n] * beta[jj[n]], sigma);
}

The hyperprior covariance matrix is defined implicitly through the a quadratic form in the code because the correlation matrix Omega and scale vector tau are more natural to inspect in the output; to output Sigma, define it as a transformed parameter. The function quad_form_diag is defined so that quad_form_diag(Sigma, tau) is equivalent to diag_matrix(tau) * Sigma * diag_matrix(tau), where diag_matrix(tau) returns the matrix with tau on the diagonal and zeroes off diagonal; the version using quad_form_diag should be faster. For details on these and other matrix arithmetic operators and functions, see the function reference manual.

Optimization through vectorization

The code in the Stan program above can be sped up dramatically by replacing:

for (n in 1:N)
  y[n] ~ normal(x[n] * beta[jj[n]], sigma);

with the vectorized form:

{
  vector[N] x_beta_jj;
  for (n in 1:N)
    x_beta_jj[n] = x[n] * beta[jj[n]];
  y ~ normal(x_beta_jj, sigma);
}

The outer brackets create a local scope in which to define the variable x_beta_jj, which is then filled in a loop and used to define a vectorized sampling statement. The reason this is such a big win is that it allows us to take the log of sigma only once and it greatly reduces the size of the resulting expression graph by packing all of the work into a single density function.

Although it is tempting to redeclare beta and include a revised model block sampling statement,

parameters {
  matrix[J, K] beta;
...
model {
  y ~ normal(rows_dot_product(x, beta[jj]), sigma);
  ...

this fails because it breaks the vectorization of sampling for beta,⁶

beta ~ multi_normal(...);

which requires beta to be an array of vectors. Both vectorizations are important, so the best solution is to just use the loop above, because rows_dot_product cannot do much optimization in and of itself because there are no shared computations.

The code in the Stan program above also builds up an array of vectors for the outcomes and for the multivariate normal, which provides a major speedup by reducing the number of linear systems that need to be solved and differentiated.

{
  matrix[K, K] Sigma_beta;
  Sigma_beta = quad_form_diag(Omega, tau);
  for (j in 1:J)
    beta[j] ~ multi_normal((u[j] * gamma)', Sigma_beta);
}

In this example, the covariance matrix Sigma_beta is defined as a local variable so as not to have to repeat the quadratic form computation \(J\) times. This vectorization can be combined with the Cholesky-factor optimization in the next section.

Optimization through Cholesky factorization

The multivariate normal density and LKJ prior on correlation matrices both require their matrix parameters to be factored. Vectorizing, as in the previous section, ensures this is only done once for each density. An even better solution, both in terms of efficiency and numerical stability, is to parameterize the model directly in terms of Cholesky factors of correlation matrices using the multivariate version of the non-centered parameterization. For the model in the previous section, the program fragment to replace the full matrix prior with an equivalent Cholesky factorized prior is as follows.

data {
  matrix[L, J] u;              // group predictors transposed
  ...
parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  matrix[K, L] gamma;
  ...
transformed parameters {
  matrix[K, J] beta;
  beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}
model {
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  ...

The data variable u was originally an array of vectors, which is efficient for access; here it is redeclared as a matrix in order to use it in matrix arithmetic. Moreover, it is transposed, along with gamma and beta, to minimize the number of transposition operations. The new parameter L_Omega is the Cholesky factor of the original correlation matrix Omega, so that

Omega = L_Omega * L_Omega'

The prior scale vector tau is unchanged, and furthermore, pre-multiplying the Cholesky factor by the scale produces the Cholesky factor of the final covariance matrix,

Sigma_beta
  = quad_form_diag(Omega, tau)
  = diag_pre_multiply(tau, L_Omega) * diag_pre_multiply(tau, L_Omega)'

where the diagonal pre-multiply compound operation is defined by

diag_pre_multiply(a, b) = diag_matrix(a) * b

The new variable z is declared as a matrix, the entries of which are given independent standard normal priors; the to_vector operation turns the matrix into a vector so that it can be used as a vectorized argument to the univariate normal density. This results in every column of z being a \(K\)-variate normal random vector with the identity as covariance matrix. Therefore, multiplying z by the Cholesky factor of the covariance matrix and adding the mean (u * gamma)' produces a beta distributed as in the original model, where the variance is, letting \(L = \mathrm{diag}(\tau)\,\Omega_L\),

\[ \begin{aligned} \mathbb{V}[\beta] &= \mathbb{E}\big((L \, z)(L \, z)^\top) \\ &= \mathbb{E}\big((L \, z \, z^\top \, L^\top) \\ &= L \, \mathbb{E}(z \, z^\top) \, L^\top \\ &= L \, L^\top =(\mathrm{diag}(\tau)\,\Omega_L)\,(\mathrm{diag}(\tau)\,\Omega_L)^\top \\ &= \mathrm{diag}(\tau)\,\Omega\,\mathrm{diag}(\tau) \\ &= \Sigma. \end{aligned} \] Where we have used the linearity of expectations (line 2 to 3), the definition of \(\Omega = \Omega_L \, \Omega_L^\top\), and the fact that \(\mathbb{E}(z \, z^\top) = I\) since \(z \sim \mathcal{N}(0, I)\).

Omitting the remaining data declarations, which are the same as before with the exception of u, the optimized model is as follows.

parameters {
  matrix[K, J] z;
  cholesky_factor_corr[K] L_Omega;
  vector<lower=0,upper=pi()/2>[K] tau_unif;  // prior scale
  matrix[K, L] gamma;                        // group coeffs
  real<lower=0> sigma;                       // prediction error scale
}
transformed parameters {
  vector<lower=0>[K] tau = 2.5 * tan(tau_unif);
  matrix[K, J] beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}
model {
  vector[N] mu;
  for(n in 1:N) {
    mu[n] = x[n, ] * beta[, jj[n]];
  }
  to_vector(z) ~ std_normal();
  L_Omega ~ lkj_corr_cholesky(2);
  to_vector(gamma) ~ normal(0, 5);
  y ~ normal(mu, sigma);
}

This model also reparameterizes the prior scale tau to avoid potential problems with the heavy tails of the Cauchy distribution. The statement tau_unif ~ uniform(0,pi()/2) can be omitted from the model block because Stan increments the log posterior for parameters with uniform priors without it.

References

Barnard, John, Robert McCulloch, and Xiao-Li Meng. 2000. “Modeling Covariance Matrices in Terms of Standard Deviations and Correlations, with Application to Shrinkage.” Statistica Sinica, 1281–311.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100: 1989–2001.

The prior is named for Lewandowski, Kurowicka, and Joe, as it was derived by inverting the random correlation matrix generation strategy of Lewandowski, Kurowicka, and Joe (2009).↩︎
Thanks to Mike Lawrence for pointing this out in the GitHub issue for the manual.↩︎