NNGP Based Models

Nearest Neighbor Gaussian Processes (NNGP) based models is a family of highly scalable Gaussian processes based models. In brief, NNGP extends the Vecchia’s approximation (Vecchia (1988)) to a process using conditional independence given information from neighboring locations. In this section, I will briefly review response and latent NNGP models. For more details and applications of NNGP, please refer to (A. Datta, Banerjee, Finley, and Gelfand 2016) and (A. Datta, Banerjee, Finley, Hamm, et al. 2016).

Spatial regression model

We envision a spatial regression model at any location \(s\)

\[\begin{equation}\label{eq: spatial_regression_model} y(s) = m_{\theta}(s) + w(s) + \epsilon(s)\;,\quad \epsilon(s) \stackrel{iid}{\sim} \mathcal{N}(0,\tau^2) \;, \end{equation}\]

where, usually, \(m_{\theta}(s) = x(s)^{\top}\beta\) and \(w(s)\) is a latent spatial process capturing spatial dependence. Let \(S\) be the set of \(n\) observed locations. If we model the process \(w(s)\) with a Gaussian process \(w(s) \sim GP(0, C_\theta(\cdot, \cdot))\), then a customary Bayesian hierarchical models for observations on \(S = \{s_1, \ldots, s_N\}\) can be constructed as

\[\begin{equation} \label{eq: latent_model} \mathcal{p}(\theta) \; \times \; \mathcal{N}(w(S)\;|\;0, C_\theta(S,S)) \; \times \; \mathcal{N}(y(S)\;|\; m_\theta(S) + w(S), \tau^2 I_n) \end{equation}\]

Another implementation of GP is to marginalize over the latent process \(w(s)\) and construct the outcome process \(y(s)\) directly with a GP. By integrating out \(w(s)\), we have a more parsimonious model whose parameters set collapses from \(\{\theta, w\}\) to \(\{\theta\}\). In a Bayesian setting, \(\theta\) will be sampled from its posterior distribution

\[\begin{equation}\label{eq: response model} \mathcal{p}(\theta \;|\; y(S)) \; \propto \; \mathcal{p}(\theta) \; \times \; \mathcal{N} (y(S)\;|\;m_{\theta}(S), C_\theta(S, S) + \tau^2 I_n) \end{equation}\]

To distinguish these two models, we shall call the former as a latent GP model, and the latter as a response GP model. These two models are referred as the latent variable GP and the marginal likelihood GP in Stan reference manual (Stan Development Team (2017)).

Latent and Response NNGP models

Nearest neighbor Gaussian process (NNGP) provides an alternative to the Gaussian process in the models discussed in the preceding subsection. The likelihoods of two models basing on NNGP derived from the original Gaussian process coincide with the Vecchia’s approximation (Vecchia (1988)) of the original models. In particular, a latent NNGP model has a posterior distribution proportional to

\[\begin{equation}\label{eq: latent NNGP model} \mathcal{p} (\theta) \; \times \; \mathcal{N}(w(S) \;|\; 0, C^*) \; \times \; \mathcal{N}(y(S) \;|\; m_\theta(S)+w(S), \tau^2 I_n) \;, \end{equation}\]

where \(C^{*-1} = (I-A^*)^\top D^{*-1}(I-A^*)\) is the precision matrix of the latent process \(w(s)\) over \(S\). Here \(A^*\) is a sparse and strictly lower triangular matrix with at most \(M\)(\(M \ll N\)) non-zero entries in each row, and \(D^*\) is a diagonal matrix. One can readily calculate the determinant of \(C^*\) by the product of the diagonal elements in \(D^*\). The likelihood of \(w(S)\) based on precision matrix \(C^{*-1}\) serves as a good approximation to the likelihood of \(w(s)\) in \eqref{eq: latent_model}, while the storage and computational burden of the former is linear in \(N\).

A response NNGP model yields posterior distribution:

\[\begin{equation} \label{eq:response_NNGP} \mathcal{p}(\theta \;|\; y(S)) \; \propto \; \mathcal{p}(\theta) \; \times \; \mathcal{N}(y(S)\;|\;m_{\theta}(S), \{C_\theta + \tau^2 I\}^*) \;, \end{equation}\]

where \(\{C_\theta + \tau^2 I\}^{*-1} = (I-A)^\top D^{-1}(I-A)\), analogous to the latent NNGP model, can be treated as an approximation of \(\{C_\theta(S, S)+\tau^2I_n\}^{-1}\). Notice that although one can obtain the response GP model by integrating out the latent process in a latent GP model, the corresponding NNGP model doesn’t have this property.

Construction of \(A\), \(D\)

The details of Matrices \(A\), \(D\) and two models can be found in Finley et al. (2017). Here we use the response NNGP model to show how to construct matrix \(A\) and \(D\). Let \(N(s_i)\) be at most \(M\) closest points to \(s_i\) among the locations indexed less than \(i\). The \(i\) th row (\(i > 1\)) of \(A\) has nonzero entries on positions indexed by \(N(s_i)\), and the nonzero entries are calculated by

\[\begin{equation} \label{eq: A_construct} A(i, N(s_i)) = C_\theta(s_i, N(s_i))(C_\theta(N(s_i), N(s_i)) + \tau^2I)^{-1} \end{equation}\]

And the \(i\) th element on the diagonal of \(D\) satisfies

\[\begin{equation} \label{eq: D_construct} D(i, i) = C_\theta(s_i, s_i) + \tau^2 - C_\theta(s_i, N(s_i))(C_\theta(N(s_i), N(s_i)) + \tau^2I)^{-1}C_\theta(N(s_i), s_i) \end{equation}\]

These equations are derived from the distribution of \(E[y(s_i) \; | \; y(N(s_i))]\). The nonzero entries in each row of \(A\) are precisely the weights obtained by predicting \(y(s_i)\), or “kriging”, based upon the values of \(y(s)\) at neighboring locations, i.e., \(N(s_i)\). And the diagonal elements in \(D\) are the variance of \(y(s_i)\) conditioning on its’ neighbors in the “past” \(y(N(s_i))\).

Code NNGP Based Model in Stan

In this section, I will use a simulation data to show how to code NNGP based models efficiently in Stan.

Intro of simulation data

We generated response \(Y\) along with a covariate \(x\) at \(n = 500\) randomly sited locations in a unit square domain by the following model:

\[\begin{equation} y(s) = \beta_0 + x(s)\beta_1 + w(s) + \epsilon(s), \hspace{1cm} \epsilon(s) \sim N(0, \tau^2) \;, \end{equation}\]

where the zero-centered spatial latent process \(w(s)\) were sampled from a Gaussian process with a covariance function \(C_\theta\) specified by exponential:

\[\begin{equation}\label{exp_K} C_\theta(s_i, s_j) = \sigma^2 \exp(-\phi||s_i-s_j||)\;, \hspace{0.5cm} s_i,s_j \in S \end{equation}\]

The predictor \(x\) were generated from \(N(0, 1)\). The setting of parameters is listed in the code.

rmvn <- function(N, mu = 0, V = matrix(1)){
  P <- length(mu)
  if(any(is.na(match(dim(V), P))))
    stop("Dimension problem!")
  D <- chol(V)
  t(matrix(rnorm(N * P), ncol = P) %*% D + rep(mu, rep(N, P)))
}

set.seed(1234)
N <- 500
coords <- cbind(runif(N), runif(N))
X <- as.matrix(cbind(1, rnorm(N)))
B <- as.matrix(c(1, 5))
sigma.sq <- 2
tau.sq <- 0.1
phi <- 3 / 0.5

D <- as.matrix(dist(coords))
R <- exp(-phi*D)
w <- rmvn(1, rep(0, N), sigma.sq*R)
Y <- rnorm(N, X %*% B + w, sqrt(tau.sq))

Data block in Stan

The following block shows the elements needed for NNGP based models

  data {
      int<lower=1> N;
      int<lower=1> M;
      int<lower=1> P;
      vector[N] Y;
      matrix[N, P + 1] X;
      int NN_ind[N - 1, M];
      matrix[N - 1, M] NN_dist;
      matrix[N - 1, (M * (M - 1) / 2)] NN_distM;
  }

Here the design matrix X contains an initial column of 1s, P is the number of regression coefficients, and M is the number of nearest neighbors (maximum number of elements in each row of \(n \times n\) sparse matrix \(A\)). Notice that we provide three matrices NN_ind, NN_dist and NN_distM:

NN_ind is a two-dimensional array of indices whose \(i - 1\)th row shows at most \(M\) closest points to \(s_i\) among the locations indexed less than \(i\).

NN_dist is a matrix whose \(i - 1\)th row contains the distance of \(i\)th location to its selected neighbors.

NN_dist is a matrix whose \(i - 1\)th row contains the strictly lower triangular part of the distance matrix of the selected neighbors of \(i\)th location.

These three matrices are required for constructing the sparse lower triangular matrix \(A\), and the diagonal matrix \(D\). Since they are fixed across the MCMC updates, we recommend user to provide them in the data segment. Next, I will show how to efficiently generate the matrices listed above.

Build neighbor index

File “NNmatrix.R” provides a wrapper function in R that uses package spNNGP, which has a fast algorithm of building neighbor index, to generate the required matrices in the Stan data segment. Here n.omp.threads indicates the number of threads to use for parallel processing.

source("NNmatrix.R")
M = 6                 # Number of Nearest Neighbors
NN.matrix <- NNMatrix(coords = coords, n.neighbors = M, n.omp.threads = 2)
str(NN.matrix)

Check Neighbors (for fun)

We can use function Check_Neighbors in “NNmatrix.R” for checking the nearest neighbor index. It is important to point out that NNMatrix sorts coordinates on the first column before building the neighbor index. Thus we should use the sorted response and design matrix instead of the raw data in the data block.

Check_Neighbors(NN.matrix$coords.ord, n.neighbors = M, NN.matrix, ind = 200)

Parameter and model block in Stan

We assign Gaussian priors for \(\sigma\) and \(\tau\), an inverse gamma prior for \(\phi\) and a Gaussian prior for \(\beta = \{\beta_0, \beta_1\}\). The following is the parameter and model block for a latent NNGP model.

  parameters{
      vector[P + 1] beta;
      real<lower = 0> sigma;
      real<lower = 0> tau;
      real<lower = 0> phi;
      vector[N] w;
  }

  transformed parameters {
      real sigmasq = sigma^2;
      real tausq = tau^2;
  }

  model{
      beta ~ multi_normal_cholesky(uB, L_VB);
      phi ~ inv_gamma(ap, bp);
      sigma ~ normal(0, ss);
      tau ~ normal(0, st);
      w ~ nngp_w(sigmasq, phi, NN_dist, NN_distM, NN_ind, N, M);
      Y ~ normal(X * beta + w, tau);
  }

A small modification will make the code work for a response NNGP model:

  parameters{
      vector[P + 1] beta;
      real<lower = 0> sigma;
      real<lower = 0> tau;
      real<lower = 0> phi;
  }

  transformed parameters {
      real sigmasq = sigma^2;
      real tausq = tau^2;
  }

  model{
      beta ~ multi_normal_cholesky(uB, L_VB);
      phi ~ inv_gamma(ap, bp);
      sigma ~ normal(0, ss);
      tau ~ normal(0, st);
      Y ~ nngp(X * beta, sigmasq, tausq, phi, NN_dist, NN_distM, NN_ind, N, M);
  }

Here, the user-defined functions nngp_w and nngp will be given in the simulation study section.

User-defined likelihood function for NNGP models

The hardest part in coding NNGP in Stan is the user-defined likelihood, specifically, the function nngp_w and nngp in the last subsection. Here we use nngp to illustrate the main idea of coding NNGP likelihood.

The log-likelihood of \(y(S)\) in \eqref{eq:response_NNGP} is given by:

\[ -{1 \over 2}\sum_{i = 1}^N\log{D_{ii}} - {1 \over 2}{(y(S) - X(S)^\top\beta)^\top (I-A)^T D^{-1} (I-A)(y(S) - X(S)^\top\beta)} \]

In the code below, vector U saves the results of \((I-A)(y(S) - X(S)^\top\beta)\), and vector V saves all the diagonal elements of Matrix \(D\) scaled by \(\sigma^2\). With this notation, the log-likelihood can be simplified as

\[\begin{equation} \label{eq: ll_code} -{1 \over 2}\{\sum_{i = 1}^N\log{V_{i}} +N \log{(\sigma^2)}+ {1\over \sigma^2} U^\top(U \oslash V)\} \;, \end{equation}\]

where all the elements in the likelihood are vectors.

In the calculation of vector \(U = (I-A)(y(S) - X(S)^\top\beta)\), since we know that matrix \(A\) has at most \(M\) nonzero elements and the index of nonzero elements is given in NN_ind, there is no need for saving the \(n \times n\) matrix \(I-A\). Instead, we use a for loop to calculate \(U\). Within each iteration, we first use NN_dist and NN_distM along with the updated parameter to obtain \(A(i, N(s_i))\) by \eqref{eq: D_construct} and \(D(i, i)\) by \eqref{eq: A_construct}, then use NN_ind and \(y(S) - X(S)^\top\beta\) to calculate the \(i\) th element of \(U\). The flops required in each iteration is in the order of \(M^3\).

  functions{
      real nngp_lpdf(vector Y, vector X_beta, real sigmasq, real tausq,
                     real phi, matrix NN_dist, matrix NN_distM, int[,] NN_ind,
                     int N, int M){

          vector[N] V;
          vector[N] YXb = Y - X_beta;
          vector[N] U = YXb;
          real kappa_p_1 = tausq / sigmasq + 1;
          int dim;
          int h;

          for (i in 2:N) {
              matrix[ i < (M + 1) ? (i - 1) : M, i < (M + 1) ? (i - 1): M]
              iNNdistM;
              matrix[ i < (M + 1) ? (i - 1) : M, i < (M + 1) ? (i - 1): M]
              iNNCholL;
              vector[ i < (M + 1) ? (i - 1) : M] iNNcorr;
              vector[ i < (M + 1) ? (i - 1) : M] v;
              row_vector[i < (M + 1) ? (i - 1) : M] v2;
              dim = (i < (M + 1))? (i - 1) : M;

              if(dim == 1){iNNdistM[1, 1] = kappa_p_1;}
              else{
                  h = 0;
                  for (j in 1:(dim - 1)){
                      for (k in (j + 1):dim){
                          h = h + 1;
                          iNNdistM[j, k] = exp(- phi * NN_distM[(i - 1), h]);
                          iNNdistM[k, j] = iNNdistM[j, k];
                      }
                  }
                  for(j in 1:dim){
                      iNNdistM[j, j] = kappa_p_1;
                  }
              }

              iNNCholL = cholesky_decompose(iNNdistM);
              for (j in 1: dim){
                  iNNcorr[j] = exp(- phi * NN_dist[(i - 1), j]);
              }

             v = mdivide_left_tri_low(iNNCholL, iNNcorr);

             V[i] = kappa_p_1 - dot_self(v);

             v2 = mdivide_right_tri_low(v', iNNCholL);

             for (j in 1:dim){
                  U[i] = U[i] - v2[j] * YXb[NN_ind[(i - 1), j]];
              }
          }
          V[1] = kappa_p_1;
          return - 0.5 * ( 1 / sigmasq * dot_product(U, (U ./ V)) +
                          sum(log(V)) + N * log(sigmasq));
      }
  }

Simulation study

Now let’s run the NNGP based models for the simulation data in the last section. First set parameters of priors:

P = 1                  # number of regression coefficients
uB = rep(0, P + 1)     # mean vector in the Gaussian prior of beta
VB = diag(P + 1)*1000  # covariance matrix in the Gaussian prior of beta
ss = 3 * sqrt(2)       # scale parameter in the normal prior of sigma 
st = 3 * sqrt(0.1)     # scale parameter in the normal prior of tau     
ap = 3; bp = 0.5       # shape and scale parameter in the inv-gamma prior of phi 

Response NNGP models in Stan

The following chunk is the R code for running response NNGP models. We use the response and design matrix sorted by the order from NNMatrix instead of the raw Y and X in the data block.

library(rstan)
options(mc.cores = parallel::detectCores())
data <- list(N = N, M = M, P = P, 
             Y = Y[NN.matrix$ord], X = X[NN.matrix$ord, ],    # sorted Y and X
             NN_ind = NN.matrix$NN_ind, 
             NN_dist = NN.matrix$NN_dist, 
             NN_distM = NN.matrix$NN_distM,  
             uB = uB, VB = VB, ss = ss, st = st, ap = ap, bp = bp)

myinits <-list(list(beta = c(1, 5), sigma = 1, tau = 0.5, phi = 12), 
               list(beta = c(5, 5), sigma = 1.5, tau = 0.2, phi = 5), 
               list(beta = c(0, 0), sigma = 2.5, tau = 0.1, phi = 9))

parameters <- c("beta", "sigmasq", "tausq", "phi")
samples <- stan(
  file = "nngp_response.stan",
  data = data,
  init = myinits,
  pars = parameters,
  iter = 600, 
  chains = 3,
  thin = 1,
  seed = 123
)

The full Stan program for the response NNGP model is in the file “nngp_response.stan”.

writeLines(readLines('nngp_response.stan'))
  /* Response NNGP model */

  functions{
      real nngp_lpdf(vector Y, vector X_beta, real sigmasq, real tausq,
                     real phi, matrix NN_dist, matrix NN_distM, int[,] NN_ind,
                     int N, int M){

          vector[N] V;
          vector[N] YXb = Y - X_beta;
          vector[N] U = YXb;
          real kappa_p_1 = tausq / sigmasq + 1;
          int dim;
          int h;

          for (i in 2:N) {
              matrix[ i < (M + 1) ? (i - 1) : M, i < (M + 1) ? (i - 1): M]
              iNNdistM;
              matrix[ i < (M + 1) ? (i - 1) : M, i < (M + 1) ? (i - 1): M]
              iNNCholL;
              vector[ i < (M + 1) ? (i - 1) : M] iNNcorr;
              vector[ i < (M + 1) ? (i - 1) : M] v;
              row_vector[i < (M + 1) ? (i - 1) : M] v2;
              dim = (i < (M + 1))? (i - 1) : M;

              if(dim == 1){iNNdistM[1, 1] = kappa_p_1;}
              else{
                  h = 0;
                  for (j in 1:(dim - 1)){
                      for (k in (j + 1):dim){
                          h = h + 1;
                          iNNdistM[j, k] = exp(- phi * NN_distM[(i - 1), h]);
                          iNNdistM[k, j] = iNNdistM[j, k];
                      }
                  }
                  for(j in 1:dim){
                      iNNdistM[j, j] = kappa_p_1;
                  }
              }

              iNNCholL = cholesky_decompose(iNNdistM);
              iNNcorr = to_vector(exp(- phi * NN_dist[(i - 1), 1: dim]));

             v = mdivide_left_tri_low(iNNCholL, iNNcorr);

             V[i] = kappa_p_1 - dot_self(v);

             v2 = mdivide_right_tri_low(v', iNNCholL);

             U[i] = U[i] - v2 * YXb[NN_ind[(i - 1), 1:dim]];
          }
          V[1] = kappa_p_1;
          return - 0.5 * ( 1 / sigmasq * dot_product(U, (U ./ V)) +
                          sum(log(V)) + N * log(sigmasq));
      }
  }

  data {
      int<lower=1> N;
      int<lower=1> M;
      int<lower=1> P;
      vector[N] Y;
      matrix[N, P + 1] X;
      int NN_ind[N - 1, M];
      matrix[N - 1, M] NN_dist;
      matrix[N - 1, (M * (M - 1) / 2)] NN_distM;
      vector[P + 1] uB;
      matrix[P + 1, P + 1] VB;
      real ss;
      real st;
      real ap;
      real bp;
  }

  transformed data {
      cholesky_factor_cov[P + 1] L_VB;
      L_VB = cholesky_decompose(VB);
  }

  parameters{
      vector[P + 1] beta;
      real<lower = 0> sigma;
      real<lower = 0> tau;
      real<lower = 0> phi;
  }

  transformed parameters {
      real sigmasq = square(sigma);
      real tausq = square(tau);
  }

  model{
      beta ~ multi_normal_cholesky(uB, L_VB);
      phi ~ gamma(ap, bp);
      sigma ~ normal(0, ss);
      tau ~ normal(0, st);
      Y ~ nngp(X * beta, sigmasq, tausq, phi, NN_dist, NN_distM, NN_ind, N, M);
  }

Latent NNGP model for simulation study

The following chunk is the R code for running the latent NNGP model:

options(mc.cores = parallel::detectCores())
data <- list(N = N, M = M, P = P, 
             Y = Y[NN.matrix$ord], X = X[NN.matrix$ord, ],    # sorted Y and X
             NN_ind = NN.matrix$NN_ind, 
             NN_dist = NN.matrix$NN_dist, 
             NN_distM = NN.matrix$NN_distM, 
             uB = uB, VB = VB, ss = ss, st = st, ap = ap, bp = bp)

myinits <-list(list(beta = c(1, 5), sigma = 1, tau = 0.5, phi = 12, 
                    w_b1 = rep(0, N)), 
               list(beta = c(5, 5), sigma = 1.5, tau = 0.2, phi = 5, 
                    w_b1 = rep(0.1, N)), 
               list(beta = c(0, 0), sigma = 2.5, tau = 0.1, phi = 9 ,
                    w_b1 = rep(0, N)))

parameters <- c("beta", "sigmasq", "tausq", "phi", "w")
samples_w <- stan(
  file = "nngp_latent.stan",
  data = data,
  init = myinits,
  pars = parameters,
  iter = 600, 
  chains = 3,
  thin = 1,
  seed = 123
)

The full Stan program for the latent NNGP model is in the file “nngp_latent.stan”.

writeLines(readLines('nngp_latent.stan'))
  /* Latent NNGP model*/

  functions{
      real nng