10.3 Fitting a Gaussian Process
GP with a normal outcome
The full generative model for a GP with a normal outcome, \(y \in \mathbb{R}^N\), with inputs \(x \in \mathbb{R}^N\), for a finite \(N\):
\[ \begin{aligned} \rho & \sim \mathsf{InvGamma}(5, 5) \\ \alpha & \sim \mathsf{normal}(0, 1) \\ \sigma & \sim \mathsf{normal}(0, 1) \\ f & \sim \mathsf{multivariate\ normal}\left(0, K(x | \alpha, \rho)\right) \\ y_i & \sim \mathsf{normal}(f_i, \sigma) \, \forall i \in \{1, \dots, N\} \end{aligned} \]
With a normal outcome, it is possible to integrate out the Gaussian process \(f\), yielding the more parsimonious model:
\[ \begin{aligned} \rho & \sim \mathsf{InvGamma}(5, 5) \\ \alpha & \sim \mathsf{normal}(0, 1) \\ \sigma & \sim \mathsf{normal}(0, 1) \\ y & \sim \mathsf{multivariate\ normal} \left(0, K(x | \alpha, \rho) + \mathbf{I}_N \sigma^2\right) \\ \end{aligned} \]
It can be more computationally efficient when dealing with a normal outcome to integrate out the Gaussian process, because this yields a lower-dimensional parameter space over which to do inference. We’ll fit both models in Stan. The former model will be referred to as the latent variable GP, while the latter will be called the marginal likelihood GP.
The hyperparameters controlling the covariance function of a Gaussian process can be fit by assigning them priors, like we have in the generative models above, and then computing the posterior distribution of the hyperparameters given observed data. The priors on the parameters should be defined based on prior knowledge of the scale of the output values (\(\alpha\)), the scale of the output noise (\(\sigma\)), and the scale at which distances are measured among inputs (\(\rho\)). See the Gaussian process priors section for more information about how to specify appropriate priors for the hyperparameters.
The Stan program implementing the marginal likelihood GP is shown below. The
program is similar to the Stan programs that implement the simulation GPs
above, but because we are doing inference on the hyperparameters, we need to
calculate the covariance matrix K
in the model block, rather than
the transformed data block.
data {
int<lower=1> N;
real x[N];
vector[N] y;
}
transformed data {
vector[N] mu = rep_vector(0, N);
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
model {
matrix[N, N] L_K;
matrix[N, N] K = cov_exp_quad(x, alpha, rho);
real sq_sigma = square(sigma);
// diagonal elements
for (n in 1:N)
K[n, n] = K[n, n] + sq_sigma;
L_K = cholesky_decompose(K);
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
sigma ~ std_normal();
y ~ multi_normal_cholesky(mu, L_K);
}
The data block now declares a vector y
of observed values y[n]
for inputs x[n]
. The transformed data block now only defines the mean
vector to be zero. The three hyperparameters are defined as parameters
constrained to be non-negative. The computation of the covariance matrix
K
is now in the model block because it involves unknown parameters and
thus can’t simply be precomputed as transformed data. The rest of the model
consists of the priors for the hyperparameters and the multivariate
Cholesky-parameterized normal likelihood, only now the value y
is known
and the covariance matrix K
is an unknown dependent on the
hyperparameters, allowing us to learn the hyperparameters.
We have used the Cholesky parameterized multivariate normal rather
than the standard parameterization because it allows us to the
cholesky_decompose
function which has been optimized for both small
and large matrices. When working with small matrices the differences
in computational speed between the two approaches will not be
noticeable, but for larger matrices (\(N \gtrsim 100\)) the Cholesky
decomposition version will be faster.
Hamiltonian Monte Carlo sampling is fast and effective for hyperparameter inference in this model Neal (1997). If the posterior is well-concentrated for the hyperparameters the Stan implementation will fit hyperparameters in models with a few hundred data points in seconds.
Latent variable GP
We can also explicitly code the latent variable formulation of a GP in Stan. This will be useful for when the outcome is not normal. We’ll need to add a small positive term, \(\delta\) to the diagonal of the covariance matrix in order to ensure that our covariance matrix remains positive definite.
data {
int<lower=1> N;
real x[N];
vector[N] y;
}
transformed data {
real delta = 1e-9;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N] eta;
}
model {
vector[N] f;
{
matrix[N, N] L_K;
matrix[N, N] K = cov_exp_quad(x, alpha, rho);
// diagonal elements
for (n in 1:N)
K[n, n] = K[n, n] + delta;
L_K = cholesky_decompose(K);
f = L_K * eta;
}
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
sigma ~ std_normal();
eta ~ std_normal();
y ~ normal(f, sigma);
}
Two differences between the latent variable GP and the marginal likelihood GP
are worth noting. The first is that we have augmented our parameter block with
a new parameter vector of length \(N\) called \(`eta`\). This is used in the model
block to generate a multivariate normal vector called \(f\), corresponding to the
latent GP. We put a \(\mathsf{normal}(0,1)\) prior on eta
like we did in the
Cholesky-parameterized GP in the simulation section. The second difference is
that our likelihood is now univariate, though we could code \(N\) likelihood
terms as one \(N\)-dimensional multivariate normal with an identity covariance
matrix multiplied by \(\sigma^2\). However, it is more efficient to use the
vectorized statement as shown above.
Discrete outcomes with Gaussian Processes
Gaussian processes can be generalized the same way as standard linear models by introducing a link function. This allows them to be used as discrete data models.
Poisson GP
If we want to model count data, we can remove the \(\sigma\) parameter, and use
poisson_log
, which implements a log link, for our likelihood rather
than normal
. We can also add an overall mean parameter, \(a\), which
will account for the marginal expected value for \(y\). We do this because we
cannot center count data like we would for normally distributed data.
data {
...
int<lower=0> y[N];
...
}
...
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}
model {
...
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
a ~ std_normal();
eta ~ std_normal();
y ~ poisson_log(a + f);
}
Logistic Gaussian Process Regression
For binary classification problems, the observed outputs \(z_n \in \{ 0,1 \}\) are binary. These outputs are modeled using a Gaussian process with (unobserved) outputs \(y_n\) through the logistic link, \[ z_n \sim \mathsf{Bernoulli}(\mbox{logit}^{-1}(y_n)), \] or in other words, \[ \mbox{Pr}[z_n = 1] = \mbox{logit}^{-1}(y_n). \]
We can extend our latent variable GP Stan program to deal with classification problems. Below \(a\) is the bias term, which can help account for imbalanced classes in the training data:
data {
...
int<lower=0, upper=1> z[N];
...
}
...
model {
...
y ~ bernoulli_logit(a + f);
}
Automatic Relevance Determination
If we have multivariate inputs \(x \in \mathbb{R}^D\), the squared exponential covariance function can be further generalized by fitting a scale parameter \(\rho_d\) for each dimension \(d\), \[ k(x | \alpha, \vec{\rho}, \sigma)_{i, j} = \alpha^2 \exp \left(-\dfrac{1}{2} \sum_{d=1}^D \dfrac{1}{\rho_d^2} (x_{i,d} - x_{j,d})^2 \right) + \delta_{i, j}\sigma^2. \] The estimation of \(\rho\) was termed “automatic relevance determination” in Neal (1996a), but this is misleading, because the magnitude the scale of the posterior for each \(\rho_d\) is dependent on the scaling of the input data along dimension \(d\). Moreover, the scale of the parameters \(\rho_d\) measures non-linearity along the \(d\)-th dimension, rather than “relevance” Piironen and Vehtari (2016).
A priori, the closer \(\rho_d\) is to zero, the more nonlinear the conditional mean in dimension \(d\) is. A posteriori, the actual dependencies between \(x\) and \(y\) play a role. With one covariate \(x_1\) having a linear effect and another covariate \(x_2\) having a nonlinear effect, it is possible that \(\rho_1 > \rho_2\) even if the predictive relevance of \(x_1\) is higher (Rasmussen and Williams 2006, 80). The collection of \(\rho_d\) (or \(1/\rho_d\)) parameters can also be modeled hierarchically.
The implementation of automatic relevance determination in Stan is
straightforward, though it currently requires the user to directly code the
covariance matrix. We’ll write a function to generate the Cholesky of the
covariance matrix called L_cov_exp_quad_ARD
.
functions {
matrix L_cov_exp_quad_ARD(vector[] x,
real alpha,
vector rho,
real delta) {
int N = size(x);
matrix[N, N] K;
real sq_alpha = square(alpha);
for (i in 1:(N-1)) {
K[i, i] = sq_alpha + delta;
for (j in (i + 1):N) {
K[i, j] = sq_alpha
* exp(-0.5 * dot_self((x[i] - x[j]) ./ rho));
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha + delta;
return cholesky_decompose(K);
}
}
data {
int<lower=1> N;
int<lower=1> D;
vector[D] x[N];
vector[N] y;
}
transformed data {
real delta = 1e-9;
}
parameters {
vector<lower=0>[D] rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N] eta;
}
model {
vector[N] f;
{
matrix[N, N] L_K = L_cov_exp_quad_ARD(x, alpha, rho, delta);
f = L_K * eta;
}
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
sigma ~ std_normal();
eta ~ std_normal();
y ~ normal(f, sigma);
}
10.3.1 Priors for Gaussian Process Parameters
Formulating priors for GP hyperparameters requires the analyst to consider the inherent statistical properties of a GP, the GP’s purpose in the model, and the numerical issues that may arise in Stan when estimating a GP.
Perhaps most importantly, the parameters \(\rho\) and \(\alpha\) are weakly identified Zhang (2004). The ratio of the two parameters is well-identified, but in practice we put independent priors on the two hyperparameters because these two quantities are more interpretable than their ratio.
Priors for length-scale
GPs are a flexible class of priors and, as such, can represent a wide spectrum of functions. For length scales below the minimum spacing of the covariates the GP likelihood plateaus. Unless regularized by a prior, this flat likelihood induces considerable posterior mass at small length scales where the observation variance drops to zero and the functions supported by the GP being to exactly interpolate between the input data. The resulting posterior not only significantly overfits to the input data, it also becomes hard to accurately sample using Euclidean HMC.
We may wish to put further soft constraints on the length-scale, but these are dependent on how the GP is used in our statistical model.
If our model consists of only the GP, i.e.:
\[ \begin{aligned} f & \sim \mathsf{multivariate\ normal}\left(0, K(x | \alpha, \rho)\right) \\ y_i & \sim \mathsf{normal}(f_i, \sigma) \, \forall i \in \{1, \dots, N\} \\ & x \in \mathbb{R}^{N \times D}, \, f \in \mathbb{R}^N \end{aligned} \]
we likely don’t need constraints beyond penalizing small
length-scales. We’d like to allow the GP prior to represent both
high-frequency and low-frequency functions, so our prior should put
non-negligible mass on both sets of functions. In this case, an
inverse gamma, inv_gamma_lpdf
in Stan’s language, will work
well as it has a sharp left tail that puts negligible mass on
infinitesimal length-scales, but a generous right tail, allowing for
large length-scales. Inverse gamma priors will avoid infinitesimal length-scales
because the density is zero at zero, so the posterior for length-scale will be
pushed away from zero. An inverse gamma distribution is one of many
zero-avoiding or boundary-avoiding distributions. See
for more on boundary-avoiding priors.
If we’re using the GP as a component in a larger model that includes an overall mean and fixed effects for the same variables we’re using as the domain for the GP, i.e.:
\[ \begin{aligned} f & \sim \mathsf{multivariate\ normal}\left(0, K(x | \alpha, \rho)\right) \\ y_i & \sim \mathsf{normal}(\beta_0 + x_i \beta_{[1:D]} + f_i, \sigma) \, \forall i \in \{1, \dots, N\} \\ & x_i^T, \beta_{[1:D]} \in \mathbb{R}^D,\, x \in \mathbb{R}^{N \times D},\, f \in \mathbb{R}^N \end{aligned} \]
we’ll likely want to constrain large length-scales as well. A length scale that is larger than the scale of the data yields a GP posterior that is practically linear (with respect to the particular covariate) and increasing the length scale has little impact on the likelihood. This will introduce nonidentifiability in our model, as both the fixed effects and the GP will explain similar variation. In order to limit the amount of overlap between the GP and the linear regression, we should use a prior with a sharper right tail to limit the GP to higher-frequency functions. We can use a generalized inverse Gaussian distribution:
\[ \begin{aligned} f(x | a, b, p) & = \dfrac{(a/b)^{p/2}}{2K_p(\sqrt{ab})} x^{p - 1}\exp(-(ax + b / x)/2) \\ & x, a, b \in \mathbb{R}^{+}, \, p \in \mathbb{Z} \end{aligned} \]
which has an inverse gamma left tail if \(p \leq 0\) and an inverse Gaussian right tail. This has not yet been implemented in Stan’s math library, but it is possible to implement as a user defined function:
functions {
real generalized_inverse_gaussian_lpdf(real x, int p,
real a, real b) {
return p * 0.5 * log(a / b)
- log(2 * modified_bessel_second_kind(p, sqrt(a * b)))
+ (p - 1) * log(x)
- (a * x + b / x) * 0.5;
}
}
data {
...
If we have high-frequency covariates in our fixed effects, we may wish to further regularize the GP away from high-frequency functions, which means we’ll need to penalize smaller length-scales. Luckily, we have a useful way of thinking about how length-scale affects the frequency of the functions supported the GP. If we were to repeatedly draw from a zero-mean GP with a length-scale of \(\rho\) in a fixed-domain \([0,T]\), we would get a distribution for the number of times each draw of the GP crossed the zero axis. The expectation of this random variable, the number of zero crossings, is \(T / \pi \rho\). You can see that as \(\rho\) decreases, the expectation of the number of upcrossings increases as the GP is representing higher-frequency functions. Thus, this is a good statistic to keep in mind when setting a lower-bound for our prior on length-scale in the presence of high-frequency covariates. However, this statistic is only valid for one-dimensional inputs.
Priors for marginal standard deviation
The parameter \(\alpha\) corresponds to how much of the variation is explained by the regression function and has a similar role to the prior variance for linear model weights. This means the prior can be the same as used in linear models, such as a half-\(t\) prior on \(\alpha\).
A half-\(t\) or half-Gaussian prior on alpha also has the benefit of putting nontrivial prior mass around zero. This allows the GP support the zero functions and allows the possibility that the GP won’t contribute to the conditional mean of the total output.
Predictive Inference with a Gaussian Process
Suppose for a given sequence of inputs \(x\) that the corresponding outputs \(y\) are observed. Given a new sequence of inputs \(\tilde{x}\), the posterior predictive distribution of their labels is computed by sampling outputs \(\tilde{y}\) according to \[ p(\tilde{y}|\tilde{x},x,y) \ = \ \frac{p(\tilde{y}, y|\tilde{x},x)} {p(y|x)} \ \propto \ p(\tilde{y}, y|\tilde{x},x). \]
A direct implementation in Stan defines a model in terms of the joint distribution of the observed \(y\) and unobserved \(\tilde{y}\).
data {
int<lower=1> N1;
real x1[N1];
vector[N1] y1;
int<lower=1> N2;
real x2[N2];
}
transformed data {
real delta = 1e-9;
int<lower=1> N = N1 + N2;
real x[N];
for (n1 in 1:N1) x[n1] = x1[n1];
for (n2 in 1:N2) x[N1 + n2] = x2[n2];
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N] eta;
}
transformed parameters {
vector[N] f;
{
matrix[N, N] L_K;
matrix[N, N] K = cov_exp_quad(x, alpha, rho);
// diagonal elements
for (n in 1:N)
K[n, n] = K[n, n] + delta;
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
sigma ~ std_normal();
eta ~ std_normal();
y1 ~ normal(f[1:N1], sigma);
}
generated quantities {
vector[N2] y2;
for (n2 in 1:N2)
y2[n2] = normal_rng(f[N1 + n2], sigma);
}
The input vectors x1
and x2
are declared as data, as is the
observed output vector y1
. The unknown output vector y2
, which
corresponds to input vector x2
, is declared in the generated quantities
block and will be sampled when the model is executed.
A transformed data block is used to combine the input vectors
x1
and x2
into a single vector x
.
The model block declares and defines a local variable for the combined output
vector f
, which consists of the concatenation of the conditional mean
for known outputs y1
and unknown outputs y2
. Thus the
combined output vector f
is aligned with the combined
input vector x
. All that is left is to define the univariate
normal sampling statement for y
.
The generated quantities block defines the quantity y2
. We generate
y2
by sampling N2
univariate normals with each mean corresponding
to the appropriate element in f
.
Predictive Inference in non-Gaussian GPs
We can do predictive inference in non-Gaussian GPs in much the same way as we do with Gaussian GPs.
Consider the following full model for prediction using logistic Gaussian process regression.
data {
int<lower=1> N1;
real x1[N1];
int<lower=0, upper=1> z1[N1];
int<lower=1> N2;
real x2[N2];
}
transformed data {
real delta = 1e-9;
int<lower=1> N = N1 + N2;
real x[N];
for (n1 in 1:N1) x[n1] = x1[n1];
for (n2 in 1:N2) x[N1 + n2] = x2[n2];
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}
transformed parameters {
vector[N] f;
{
matrix[N, N] L_K;
matrix[N, N] K = cov_exp_quad(x, alpha, rho);
// diagonal elements
for (n in 1:N)
K[n, n] = K[n, n] + delta;
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model {
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
a ~ std_normal();
eta ~ std_normal();
z1 ~ bernoulli_logit(a + f[1:N1]);
}
generated quantities {
int z2[N2];
for (n2 in 1:N2)
z2[n2] = bernoulli_logit_rng(a + f[N1 + n2]);
}
Analytical Form of Joint Predictive Inference
Bayesian predictive inference for Gaussian processes with Gaussian observations can be sped up by deriving the posterior analytically, then directly sampling from it.
Jumping straight to the result, \[ p(\tilde{y}|\tilde{x},y,x) = \mathsf{normal}(K^{\top}\Sigma^{-1}y,\ \Omega - K^{\top}\Sigma^{-1}K), \] where \(\Sigma = K(x | \alpha, \rho, \sigma)\) is the result of applying the covariance function to the inputs \(x\) with observed outputs \(y\), \(\Omega = K(\tilde{x} | \alpha, \rho)\) is the result of applying the covariance function to the inputs \(\tilde{x}\) for which predictions are to be inferred, and \(K\) is the matrix of covariances between inputs \(x\) and \(\tilde{x}\), which in the case of the exponentiated quadratic covariance function would be
\[ K(x | \alpha, \rho)_{i, j} = \eta^2 \exp(-\dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - \tilde{x}_{j,d})^2). \]
There is no noise term including \(\sigma^2\) because the indexes of elements in \(x\) and \(\tilde{x}\) are never the same.
This Stan code below uses the analytic form of the posterior and provides
sampling of the resulting multivariate normal through the Cholesky
decomposition. The data declaration is the same as for the latent variable
example, but we’ve defined a function called gp_pred_rng
which will
generate a draw from the posterior predictive mean conditioned on observed data
y1
. The code uses a Cholesky decomposition in triangular solves in order
to cut down on the the number of matrix-matrix multiplications when computing
the conditional mean and the conditional covariance of \(p(\tilde{y})\).
functions {
vector gp_pred_rng(real[] x2,
vector y1,
real[] x1,
real alpha,
real rho,
real sigma,
real delta) {
int N1 = rows(y1);
int N2 = size(x2);
vector[N2] f2;
{
matrix[N1, N1] L_K;
vector[N1] K_div_y1;
matrix[N1, N2] k_x1_x2;
matrix[N1, N2] v_pred;
vector[N2] f2_mu;
matrix[N2, N2] cov_f2;
matrix[N2, N2] diag_delta;
matrix[N1, N1] K;
K = cov_exp_quad(x1, alpha, rho);
for (n in 1:N1)
K[n, n] = K[n,n] + square(sigma);
L_K = cholesky_decompose(K);
K_div_y1 = mdivide_left_tri_low(L_K, y1);
K_div_y1 = mdivide_right_tri_low(K_div_y1', L_K)';
k_x1_x2 = cov_exp_quad(x1, x2, alpha, rho);
f2_mu = (k_x1_x2' * K_div_y1);
v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
cov_f2 = cov_exp_quad(x2, alpha, rho) - v_pred' * v_pred;
diag_delta = diag_matrix(rep_vector(delta, N2));
f2 = multi_normal_rng(f2_mu, cov_f2 + diag_delta);
}
return f2;
}
}
data {
int<lower=1> N1;
real x1[N1];
vector[N1] y1;
int<lower=1> N2;
real x2[N2];
}
transformed data {
vector[N1] mu = rep_vector(0, N1);
real delta = 1e-9;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}
model {
matrix[N1, N1] L_K;
{
matrix[N1, N1] K = cov_exp_quad(x1, alpha, rho);
real sq_sigma = square(sigma);
// diagonal elements
for (n1 in 1:N1)
K[n1, n1] = K[n1, n1] + sq_sigma;
L_K = cholesky_decompose(K);
}
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
sigma ~ std_normal();
y1 ~ multi_normal_cholesky(mu, L_K);
}
generated quantities {
vector[N2] f2;
vector[N2] y2;
f2 = gp_pred_rng(x2, y1, x1, alpha, rho, sigma, delta);
for (n2 in 1:N2)
y2[n2] = normal_rng(f2[n2], sigma);
}
Multiple-output Gaussian processes
Suppose we have observations \(y_i \in \mathbb{R}^M\) observed at \(x_i \in \mathbb{R}^K\). One can model the data like so: \[ \begin{aligned} y_i & \sim \mathsf{multivariate\ normal}(f(x_i), \mathbf{I}_M \sigma^2) \\ f(x) & \sim \mathsf{GP}(m(x), K(x | \theta, \phi)) \\ K(x & | \theta) \in \mathbb{R}^{M \times M}, \, f(x), \, m(x) \in \mathbb{R}^M \end{aligned} \] where the \(K(x, x^\prime | \theta, \phi)_{[m, m^\prime]}\) entry defines the covariance between \(f_m(x)\) and \(f_{m^\prime}(x^\prime)(x)\). This construction of Gaussian processes allows us to learn the covariance between the output dimensions of \(f(x)\). If we parameterize our kernel \(K\): \[ \begin{aligned} K(x, x^\prime | \theta, \phi)_{[m, m^\prime]} = k(x, x^\prime | \theta) k(m, m^\prime | \phi) \end{aligned} \] then our finite dimensional generative model for the above is: \[ \begin{aligned} f & \sim \mathsf{Matrixnormalal}(m(x), K(x | \alpha, \rho), C(\phi)) \\ y_{i, m} & \sim \mathsf{normal}(f_{i,m}, \sigma) \\ f & \in \mathbb{R}^{N \times M} \end{aligned} \] where \(K(x | \alpha, \rho)\) is the exponentiated quadratic kernel we’ve used throughout this chapter, and \(C(\phi)\) is a positive-definite matrix, parameterized by some vector \(\phi\).
The matrix normal distribution has two covariance matrices: \(K(x | \alpha, \rho)\) to encode column covariance, and \(C(\phi)\) to define row covariance. The salient features of the matrix normal are that the rows of the matrix \(f\) are distributed: \[ \begin{aligned} f_{[n,]} \sim \mathsf{multivariate\ normal}(m(x)_{[n,]}, K(x | \alpha, \rho)_{[n,n]} C(\phi)) \end{aligned} \] and that the columns of the matrix \(f\) are distributed: \[ \begin{aligned} f_{[,m]} \sim \mathsf{multivariate\ normal}(m(x)_{[,m]}, K(x | \alpha, \rho) C(\phi)_{[m,m]}) \end{aligned} \] This also means means that \(\mathbb{E}\left[f^T f\right]\) is equal to \(\text{trace}(K(x | \alpha, \rho)) * C\), whereas \(\mathbb{E}\left[ff^T\right]\) is \(\text{trace}(C) * K(x | \alpha, \rho)\). We can derive this using properties of expectation and the matrix normal density.
We should set \(\alpha\) to \(1.0\) because the parameter is not identified unless we constrain \(\text{trace}(C) = 1\). Otherwise, we can multiply \(\alpha\) by a scalar \(d\) and \(C\) by \(1/d\) and our likelihood will not change.
We can generate a random variable \(f\) from a matrix normal density in \(\mathbb{R}^{N \times M}\) using the following algorithm:
\[ \begin{aligned} \eta_{i,j} & \sim \mathsf{normal}(0, 1) \, \forall i,j \\ f & = L_{K(x | 1.0, \rho)} \, \eta \, L_C(\phi)^T \\ f & \sim \mathsf{MatrixNormal}(0, K(x | 1.0, \rho), C(\phi)) \\ \eta & \in \mathbb{R}^{N \times M} \\ L_C(\phi) & = \text{cholesky\_decompose}(C(\phi)) \\ L_{K(x | 1.0, \rho)} & = \text{cholesky\_decompose}(K(x | 1.0, \rho)) \end{aligned} \]
This can be implemented in Stan using a latent-variable GP formulation. We’ve used \(\mathsf{LkjCorr}\) for \(C(\phi)\), but any positive-definite matrix will do.
data {
int<lower=1> N;
int<lower=1> D;
real x[N];
matrix[N, D] y;
}
transformed data {
real delta = 1e-9;
}
parameters {
real<lower=0> rho;
vector<lower=0>[D] alpha;
real<lower=0> sigma;
cholesky_factor_corr[D] L_Omega;
matrix[N, D] eta;
}
model {
matrix[N, D] f;
{
matrix[N, N] K = cov_exp_quad(x, 1.0, rho);
matrix[N, N] L_K;
// diagonal elements
for (n in 1:N)
K[n, n] = K[n, n] + delta;
L_K = cholesky_decompose(K);
f = L_K * eta
* diag_pre_multiply(alpha, L_Omega)';
}
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
sigma ~ std_normal();
L_Omega ~ lkj_corr_cholesky(3);
to_vector(eta) ~ std_normal();
to_vector(y) ~ normal(to_vector(f), sigma);
}
generated quantities {
matrix[D, D] Omega;
Omega = L_Omega * L_Omega';
}
References
Neal, Radford M. 1997. “Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification.” 9702. University of Toronto, Department of Statistics.
Neal, Radford M. 1996a. Bayesian Learning for Neural Networks. Lecture Notes in Statistics 118. New York: Springer.
Piironen, Juho, and Aki Vehtari. 2016. “Projection Predictive Model Selection for Gaussian Processes.” In Machine Learning for Signal Processing (Mlsp), 2016 Ieee 26th International Workshop on. IEEE.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press.
Zhang, Hao. 2004. “Inconsistent Estimation and Asymptotically Equal Interpolations in Model-Based Geostatistics.” Journal of the American Statistical Association 99 (465). Taylor & Francis: 250–61.