This document details sparse exact conditional autoregressive (CAR) models in Stan as an extension of previous work on approximate sparse CAR models in Stan. Sparse representations seem to give order of magnitude efficiency gains, scaling better for large spatial data sets.

CAR priors for spatial random effects

Conditional autoregressive (CAR) models are popular as prior distributions for spatial random effects with areal spatial data. If we have a random quantity \(\phi = (\phi_1, \phi_2, ..., \phi_n)'\) at \(n\) areal locations, the CAR model is often expressed via full conditional distributions:

\[\phi_i \mid \phi_j, j \neq i \sim N(\alpha \sum_{j = 1}^n b_{ij} \phi_j, \tau_i^{-1})\]

where \(\tau_i\) is a spatially varying precision parameter, and \(b_{ii} = 0\).

By Brook’s Lemma, the joint distribution of \(\phi\) is then:

\[\phi \sim N(0, [D_\tau (I - \alpha B)]^{-1}).\]

If we assume the following:

then the CAR prior specification simplifies to:

\[\phi \sim N(0, [\tau (D - \alpha W)]^{-1}).\]

The \(\alpha\) parameter ensures propriety of the joint distrbution of \(\phi\) as long as \(| \alpha | < 1\) (Gelfand & Vounatsou 2003). However, \(\alpha\) is often taken as 1, leading to the IAR specification which creates a singular precision matrix and an improper prior distribution.

A Poisson specification

Suppose we have aggregated count data \(y_1, y_2, ..., y_n\) at \(n\) locations, and we expect that neighboring locations will have similar counts. With a Poisson likelihood:

\[y_i \sim \text{Poisson}(\text{exp}(X_{i} \beta + \phi_i + \log(\text{offset}_i)))\]

where \(X_i\) is a design vector (the \(i^{th}\) row from a design matrix), \(\beta\) is a vector of coefficients, \(\phi_i\) is a spatial adjustment, and \(\log(\text{offset}_i)\) accounts for differences in expected values or exposures at the spatial units (popular choices include area for physical processes, or population size for disease applications).

If we specify a proper CAR prior for \(\phi\), then we have that \(\phi \sim \text{N}(0, [\tau (D - \alpha W)]^{-1})\) where \(\tau (D - \alpha W)\) is the precision matrix \(\Sigma^{-1}\). A complete Bayesian specification would include priors for the remaining parameters \(\alpha\), \(\tau\), and \(\beta\), such that our posterior distribution is:

\[p(\phi, \beta, \alpha, \tau \mid y) \propto p(y \mid \beta, \phi) p(\phi \mid \alpha, \tau) p(\alpha) p(\tau) p(\beta)\]

Example: Scottish lip cancer data

To demonstrate this approach we’ll use the Scottish lip cancer data example (some documentation here). This data set includes observed lip cancer case counts at 56 spatial units in Scotland, with an expected number of cases to be used as an offset, and an area-specific continuous covariate that represents the proportion of the population employed in agriculture, fishing, or forestry. The model structure is identical to the Poisson model outlined above.

Let’s start by loading packages and data, specifying the number of MCMC iterations and chains.

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
source('data/scotland_lip_cancer.RData')

# Define MCMC parameters 
niter <- 1E4   # definitely overkill, but good for comparison
nchains <- 4

To fit the full model, we’ll pull objects loaded with our Scotland lip cancer data. I’ll use model.matrix to generate a design matrix, centering and scaling the continuous covariate x to reduce correlation between the intercept and slope estimates.

W <- A # adjacency matrix
scaled_x <- c(scale(x))
X <- model.matrix(~scaled_x)
  
full_d <- list(n = nrow(X),         # number of observations
               p = ncol(X),         # number of coefficients
               X = X,               # design matrix
               y = O,               # observed number of cases
               log_offset = log(E), # log(expected) num. cases
               W = W)               # adjacency matrix

Stan implementation: CAR with multi_normal_prec

Our model statement mirrors the structure outlined above, with explicit normal and gamma priors on \(\beta\) and \(\tau\) respectively, and a \(\text{Uniform}(0, 1)\) prior for \(\alpha\). The prior on \(\phi\) is specified via the multi_normal_prec function, passing in \(\tau (D - \alpha W)\) as the precision matrix.

data {
  int<lower = 1> n;
  int<lower = 1> p;
  matrix[n, p] X;
  int<lower = 0> y[n];
  vector[n] log_offset;
  matrix<lower = 0, upper = 1>[n, n] W;
}
transformed data{
  vector[n] zeros;
  matrix<lower = 0>[n, n] D;
  {
    vector[n] W_rowsums;
    for (i in 1:n) {
      W_rowsums[i] = sum(W[i, ]);
    }
    D = diag_matrix(W_rowsums);
  }
  zeros = rep_vector(0, n);
}
parameters {
  vector[p] beta;
  vector[n] phi;
  real<lower = 0> tau;
  real<lower = 0, upper = 1> alpha;
}
model {
  phi ~ multi_normal_prec(zeros, tau * (D - alpha * W));
  beta ~ normal(0, 1);
  tau ~ gamma(2, 2);
  y ~ poisson_log(X * beta + phi + log_offset);
}

Fitting the model with rstan:

full_fit <- stan('stan/car_prec.stan', data = full_d, 
                 iter = niter, chains = nchains, verbose = FALSE)
print(full_fit, pars = c('beta', 'tau', 'alpha', 'lp__'))
## Inference for Stan model: car_prec.
## 4 chains, each with iter=10000; warmup=5000; thin=1; 
## post-warmup draws per chain=5000, total post-warmup draws=20000.
## 
##           mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta[1]   0.00    0.01 0.27  -0.54  -0.15  -0.01   0.15   0.58   452    1
## beta[2]   0.27    0.00 0.09   0.09   0.21   0.27   0.33   0.45  4239    1
## tau       1.66    0.01 0.49   0.88   1.30   1.59   1.94   2.80  6790    1
## alpha     0.93    0.00 0.06   0.77   0.91   0.95   0.98   0.99  3647    1
## lp__    821.11    0.09 6.66 807.04 816.83 821.48 825.75 833.27  4982    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Sep 23 14:48:27 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
# visualize results 
to_plot <- c('beta', 'tau', 'alpha', 'phi[1]', 'phi[2]', 'phi[3]', 'lp__')
traceplot(full_fit, pars = to_plot)

A more efficient sparse representation

Although we could specify our multivariate normal prior for \(\phi\) directly in Stan via multi_normal_prec, as we did above, in this case we will accrue computational efficiency gains by manually specifying \(p(\phi \mid \tau, \alpha)\) directly via the log probability accumulator. The log probability of \(\phi\) is:

\[\log(p(\phi \mid \tau, \alpha)) = - \frac{n}{2} \log(2 \pi) + \frac{1}{2} \log(\text{det}( \Sigma^{-1})) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

In Stan, we only need the log posterior up to an additive constant so we can drop the first term. Then, substituting \(\tau (D - \alpha W)\) for \(\Sigma^{-1}\):

\[\frac{1}{2} \log(\text{det}(\tau (D - \alpha W))) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

\[ = \frac{1}{2} \log(\tau ^ n \text{det}(D - \alpha W)) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

\[ = \frac{n}{2} \log(\tau) + \frac{1}{2} \log(\text{det}(D - \alpha W)) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

There are two ways that we can accrue computational efficiency gains:

  1. Sparse representations of \(\Sigma^{-1}\) to expedite computation of \(\phi^T \Sigma^{-1} \phi\) (this work was done by Kyle foreman previously, e.g., https://groups.google.com/d/topic/stan-users/M7T7EIlyhoo/discussion).

  2. Efficient computation of the determinant. Jin, Carlin, and Banerjee (2005) show that:

\[\text{det}(D - \alpha W) \propto \prod_{i = 1}^n (1 - \alpha \lambda_i)\]

where \(\lambda_1, ..., \lambda_n\) are the eigenvalues of \(D^{-\frac{1}{2}} W D^{-\frac{1}{2}}\), which can be computed ahead of time and passed in as data. Because we only need the log posterior up to an additive constant, we can use this result which is proportional up to some multiplicative constant \(c\):

\[\frac{n}{2} \log(\tau) + \frac{1}{2} \log(c \prod_{i = 1}^n (1 - \alpha \lambda_i)) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

\[= \frac{n}{2} \log(\tau) + \frac{1}{2} \log(c) + \frac{1}{2} \log(\prod_{i = 1}^n (1 - \alpha \lambda_i)) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

Again dropping additive constants:

\[\frac{n}{2} \log(\tau) + \frac{1}{2} \log(\prod_{i = 1}^n (1 - \alpha \lambda_i)) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

\[= \frac{n}{2} \log(\tau) + \frac{1}{2} \sum_{i = 1}^n \log(1 - \alpha \lambda_i) - \frac{1}{2} \phi^T \Sigma^{-1} \phi\]

Stan implementation: sparse CAR

In the Stan model statement’s transformed data block, we compute \(\lambda_1, ..., \lambda_n\) (the eigenvalues of \(D^{-\frac{1}{2}} W D^{-\frac{1}{2}}\)), and generate a sparse representation for W (Wsparse), which is assumed to be symmetric, such that the adjacency relationships can be represented in a two column matrix where each row is an adjacency relationship between two sites.

The Stan model statement for the sparse implementation never constructs the precision matrix, and does not call any of the multi_normal* functions. Instead, we use define a sparse_car_lpdf() function and use it in the model block.

functions {
  /**
  * Return the log probability of a proper conditional autoregressive (CAR) prior 
  * with a sparse representation for the adjacency matrix
  *
  * @param phi Vector containing the parameters with a CAR prior
  * @param tau Precision parameter for the CAR prior (real)
  * @param alpha Dependence (usually spatial) parameter for the CAR prior (real)
  * @param W_sparse Sparse representation of adjacency matrix (int array)
  * @param n Length of phi (int)
  * @param W_n Number of adjacent pairs (int)
  * @param D_sparse Number of neighbors for each location (vector)
  * @param lambda Eigenvalues of D^{-1/2}*W*D^{-1/2} (vector)
  *
  * @return Log probability density of CAR prior up to additive constant
  */
  real sparse_car_lpdf(vector phi, real tau, real alpha, 
    int[,] W_sparse, vector D_sparse, vector lambda, int n, int W_n) {
      row_vector[n] phit_D; // phi' * D
      row_vector[n] phit_W; // phi' * W
      vector[n] ldet_terms;
    
      phit_D = (phi .* D_sparse)';
      phit_W = rep_row_vector(0, n);
      for (i in 1:W_n) {
        phit_W[W_sparse[i, 1]] = phit_W[W_sparse[i, 1]] + phi[W_sparse[i, 2]];
        phit_W[W_sparse[i, 2]] = phit_W[W_sparse[i, 2]] + phi[W_sparse[i, 1]];
      }
    
      for (i in 1:n) ldet_terms[i] = log1m(alpha * lambda[i]);
      return 0.5 * (n * log(tau)
                    + sum(ldet_terms)
                    - tau * (phit_D * phi - alpha * (phit_W * phi)));
  }
}
data {
  int<lower = 1> n;
  int<lower = 1> p;
  matrix[n, p] X;
  int<lower = 0> y[n];
  vector[n] log_offset;
  matrix<lower = 0, upper = 1>[n, n] W; // adjacency matrix
  int W_n;                // number of adjacent region pairs
}
transformed data {
  int W_sparse[W_n, 2];   // adjacency pairs
  vector[n] D_sparse;     // diagonal of D (number of neigbors for each site)
  vector[n] lambda;       // eigenvalues of invsqrtD * W * invsqrtD
  
  { // generate sparse representation for W
  int counter;
  counter = 1;
  // loop over upper triangular part of W to identify neighbor pairs
    for (i in 1:(n - 1)) {
      for (j in (i + 1):n) {
        if (W[i, j] == 1) {
          W_sparse[counter, 1] = i;
          W_sparse[counter, 2] = j;
          counter = counter + 1;
        }
      }
    }
  }
  for (i in 1:n) D_sparse[i] = sum(W[i]);
  {
    vector[n] invsqrtD;  
    for (i in 1:n) {
      invsqrtD[i] = 1 / sqrt(D_sparse[i]);
    }
    lambda = eigenvalues_sym(quad_form(W, diag_matrix(invsqrtD)));
  }
}
parameters {
  vector[p] beta;
  vector[n] phi;
  real<lower = 0> tau;
  real<lower = 0, upper = 1> alpha;
}
model {
  phi ~ sparse_car(tau, alpha, W_sparse, D_sparse, lambda, n, W_n);
  beta ~ normal(0, 1);
  tau ~ gamma(2, 2);
  y ~ poisson_log(X * beta + phi + log_offset);
}

Fitting the model:

sp_d <- list(n = nrow(X),         # number of observations
             p = ncol(X),         # number of coefficients
             X = X,               # design matrix
             y = O,               # observed number of cases
             log_offset = log(E), # log(expected) num. cases
             W_n = sum(W) / 2,    # number of neighbor pairs
             W = W)               # adjacency matrix

sp_fit <- stan('stan/car_sparse.stan', data = sp_d, 
               iter = niter, chains = nchains, verbose = FALSE)

print(sp_fit, pars = c('beta', 'tau', 'alpha', 'lp__'))
## Inference for Stan model: car_sparse.
## 4 chains, each with iter=10000; warmup=5000; thin=1; 
## post-warmup draws per chain=5000, total post-warmup draws=20000.
## 
##           mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta[1]   0.00    0.02 0.29  -0.60  -0.16   0.00   0.15   0.55   300 1.02
## beta[2]   0.27    0.00 0.09   0.09   0.21   0.27   0.34   0.45  4582 1.00
## tau       1.64    0.01 0.49   0.86   1.29   1.58   1.92   2.79  6068 1.00
## alpha     0.93    0.00 0.06   0.77   0.91   0.95   0.97   0.99  2869 1.00
## lp__    782.95    0.10 6.66 769.16 778.59 783.23 787.60 795.16  4817 1.00
## 
## Samples were drawn using NUTS(diag_e) at Fri Sep 23 14:48:41 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
traceplot(sp_fit, pars = to_plot)