Gaussian Processes
Gaussian processes are continuous stochastic processes and thus may be interpreted as providing a probability distribution over functions. A probability distribution over continuous functions may be viewed, roughly, as an uncountably infinite collection of random variables, one for each valid input. The generality of the supported functions makes Gaussian priors popular choices for priors in general multivariate (non-linear) regression problems.
The defining feature of a Gaussian process is that the joint distribution of the function’s value at a finite number of input points is a multivariate normal distribution. This makes it tractable to both fit models from finite amounts of observed data and make predictions for finitely many new data points.
Unlike a simple multivariate normal distribution, which is parameterized by a mean vector and covariance matrix, a Gaussian process is parameterized by a mean function and covariance function. The mean and covariance functions apply to vectors of inputs and return a mean vector and covariance matrix which provide the mean and covariance of the outputs corresponding to those input points in the functions drawn from the process.
Gaussian processes can be encoded in Stan by implementing their mean and covariance functions or by using the specialized covariance functions outlined below, and plugging the result into the Gaussian model.
This form of model is straightforward and may be used for simulation, model fitting, or posterior predictive inference. A more efficient Stan implementation for the GP with a normally distributed outcome marginalizes over the latent Gaussian process, and applies a Cholesky-factor reparameterization of the Gaussian to compute the likelihood and the posterior predictive distribution analytically.
After defining Gaussian processes, this chapter covers the basic implementations for simulation, hyperparameter estimation, and posterior predictive inference for univariate regressions, multivariate regressions, and multivariate logistic regressions. Gaussian processes are general, and by necessity this chapter only touches on some basic models. For more information, see Rasmussen and Williams (2006).
Note that fitting Gaussian processes as described below using exact inference by computing Cholesky of the covariance matrix scales cubicly with the size of data. Due to how Stan autodiff is implemented, Stan is also slower than Gaussian process specialized software. It is likely that Gaussian processes using exact inference by computing Cholesky of the covariance matrix with
Gaussian process regression
The data for a multivariate Gaussian process regression consists of a series of
A popular covariance function, which will be used in the implementations later in this chapter, is an exponentiated quadratic function,
The addition of
The hyperparameter
The only term in the squared exponential covariance function involving the inputs
The summation involved is just the squared Euclidean distance between
Changing the notion of distance from Euclidean to taxicab distance (i.e., an
Simulating from a Gaussian process
It is simplest to start with a Stan model that does nothing more than simulate draws of functions
The Stan model defines the mean and covariance functions in a transformed data block and then samples outputs
data {
int<lower=1> N;
array[N] real x;
}transformed data {
matrix[N, N] K;
vector[N] mu = rep_vector(0, N);
for (i in 1:(N - 1)) {
1 + 0.1;
K[i, i] = for (j in (i + 1):N) {
0.5 * square(x[i] - x[j]));
K[i, j] = exp(-
K[j, i] = K[i, j];
}
}1 + 0.1;
K[N, N] =
}parameters {
vector[N] y;
}model {
y ~ multi_normal(mu, K); }
The above model can also be written more compactly using the specialized covariance function that implements the exponentiated quadratic kernel.
data {
int<lower=1> N;
array[N] real x;
}transformed data {
matrix[N, N] K = gp_exp_quad_cov(x, 1.0, 1.0);
vector[N] mu = rep_vector(0, N);
for (n in 1:N) {
0.1;
K[n, n] = K[n, n] +
}
}parameters {
vector[N] y;
}model {
y ~ multi_normal(mu, K); }
The input data are just the vector of inputs x
and its size N
. Such a model can be used with values of x
evenly spaced over some interval in order to plot sample draws of functions from a Gaussian process.
Multivariate inputs
Only the input data needs to change in moving from a univariate model to a multivariate model.
The only lines that change from the univariate model above are as follows.
data {
int<lower=1> N;
int<lower=1> D;
array[N] vector[D] x;
}transformed data {
// ...
}
The data are now declared as an array of vectors instead of an array of scalars; the dimensionality D
is also declared.
In the remainder of the chapter, univariate models will be used for simplicity, but any of the models could be changed to multivariate in the same way as the simple sampling model. The only extra computational overhead from a multivariate model is in the distance calculation.
Cholesky factored and transformed implementation
A more efficient implementation of the simulation model can be coded in Stan by relocating, rescaling and rotating an isotropic standard normal variate. Suppose
This transform can be applied directly to Gaussian process simulation.
This model has the same data declarations for N
and x
, and the same transformed data definitions of mu
and K
as the previous model, with the addition of a transformed data variable for the Cholesky decomposition. The parameters change to the raw parameters sampled from an isotropic standard normal, and the actual samples are defined as generated quantities.
// ...
transformed data {
matrix[N, N] L;
// ...
L = cholesky_decompose(K);
}parameters {
vector[N] eta;
}model {
eta ~ std_normal();
}generated quantities {
vector[N] y;
y = mu + L * eta; }
The Cholesky decomposition is only computed once, after the data are loaded and the covariance matrix K
computed. The isotropic normal distribution for eta
is specified as a vectorized univariate distribution for efficiency; this specifies that each eta[n]
has an independent standard normal distribution. The sampled vector y
is then defined as a generated quantity using a direct encoding of the transform described above.
Fitting a Gaussian process
GP with a normal outcome
The full generative model for a GP with a normal outcome,
It can be more computationally efficient when dealing with a normal outcome to integrate out the Gaussian process, because this yields a lower-dimensional parameter space over which to do inference. We’ll fit both models in Stan. The former model will be referred to as the latent variable GP, while the latter will be called the marginal likelihood GP.
The hyperparameters controlling the covariance function of a Gaussian process can be fit by assigning them priors, like we have in the generative models above, and then computing the posterior distribution of the hyperparameters given observed data. The priors on the parameters should be defined based on prior knowledge of the scale of the output values (
The Stan program implementing the marginal likelihood GP is shown below. The program is similar to the Stan programs that implement the simulation GPs above, but because we are doing inference on the hyperparameters, we need to calculate the covariance matrix K
in the model block, rather than the transformed data block.
data {
int<lower=1> N;
array[N] real x;
vector[N] y;
}transformed data {
vector[N] mu = rep_vector(0, N);
}parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}model {
matrix[N, N] L_K;
matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);
real sq_sigma = square(sigma);
// diagonal elements
for (n in 1:N) {
K[n, n] = K[n, n] + sq_sigma;
}
L_K = cholesky_decompose(K);
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
sigma ~ std_normal();
y ~ multi_normal_cholesky(mu, L_K); }
The data block declares a vector y
of observed values y[n]
for inputs x[n]
. The transformed data block now only defines the mean vector to be zero. The three hyperparameters are defined as parameters constrained to be non-negative. The computation of the covariance matrix K
is now in the model block because it involves unknown parameters and thus can’t simply be precomputed as transformed data. The rest of the model consists of the priors for the hyperparameters and the multivariate Cholesky-parameterized normal distribution, only now the value y
is known and the covariance matrix K
is an unknown dependent on the hyperparameters, allowing us to learn the hyperparameters.
We have used the Cholesky parameterized multivariate normal rather than the standard parameterization because it allows us to the cholesky_decompose
function which has been optimized for both small and large matrices. When working with small matrices the differences in computational speed between the two approaches will not be noticeable, but for larger matrices (
Hamiltonian Monte Carlo sampling is fast and effective for hyperparameter inference in this model (Neal 1997). If the posterior is well-concentrated for the hyperparameters the Stan implementation will fit hyperparameters in models with a few hundred data points in seconds.
Latent variable GP
We can also explicitly code the latent variable formulation of a GP in Stan. This will be useful for when the outcome is not normal. We’ll need to add a small positive term,
data {
int<lower=1> N;
array[N] real x;
vector[N] y;
}transformed data {
real delta = 1e-9;
}parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N] eta;
}model {
vector[N] f;
{matrix[N, N] L_K;
matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);
// diagonal elements
for (n in 1:N) {
K[n, n] = K[n, n] + delta;
}
L_K = cholesky_decompose(K);
f = L_K * eta;
}
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
sigma ~ std_normal();
eta ~ std_normal();
y ~ normal(f, sigma); }
Two differences between the latent variable GP and the marginal likelihood GP are worth noting. The first is that we have augmented our parameter block with a new parameter vector of length eta
. This is used in the model block to generate a multivariate normal vector called eta
like we did in the Cholesky-parameterized GP in the simulation section. The second difference is that although we could code the distribution statement for
Discrete outcomes with Gaussian processes
Gaussian processes can be generalized the same way as standard linear models by introducing a link function. This allows them to be used as discrete data models.
Poisson GP
If we want to model count data, we can remove the poisson_log
, which implements a Poisson distribution with log link function, rather than normal
. We can also add an overall mean parameter,
data {
// ...
array[N] int<lower=0> y;
// ...
}// ...
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}model {
// ...
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
a ~ std_normal();
eta ~ std_normal();
y ~ poisson_log(a + f); }
Logistic Gaussian process regression
For binary classification problems, the observed outputs
We can extend our latent variable GP Stan program to deal with classification problems. Below a
is the bias term, which can help account for imbalanced classes in the training data:
data {
// ...
array[N] int<lower=0, upper=1> z;
// ...
}// ...
model {
// ...
y ~ bernoulli_logit(a + f); }
Automatic relevance determination
If we have multivariate inputs
A priori, the closer
The implementation of automatic relevance determination in Stan is straightforward, though it currently requires the user to directly code the covariance matrix. We’ll write a function to generate the Cholesky of the covariance matrix called L_gp_exp_quad_cov_ARD
.
functions {
matrix L_gp_exp_quad_cov_ARD(array[] vector x,
real alpha,
vector rho,
real delta) {
int N = size(x);
matrix[N, N] K;
real sq_alpha = square(alpha);
for (i in 1:(N-1)) {
K[i, i] = sq_alpha + delta;for (j in (i + 1):N) {
K[i, j] = sq_alpha0.5 * dot_self((x[i] - x[j]) ./ rho));
* exp(-
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha + delta;return cholesky_decompose(K);
}
}data {
int<lower=1> N;
int<lower=1> D;
array[N] vector[D] x;
vector[N] y;
}transformed data {
real delta = 1e-9;
}parameters {
vector<lower=0>[D] rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N] eta;
}model {
vector[N] f;
{matrix[N, N] L_K = L_gp_exp_quad_cov_ARD(x, alpha, rho, delta);
f = L_K * eta;
}
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
sigma ~ std_normal();
eta ~ std_normal();
y ~ normal(f, sigma); }
Priors for Gaussian process parameters
Formulating priors for GP hyperparameters requires the analyst to consider the inherent statistical properties of a GP, the GP’s purpose in the model, and the numerical issues that may arise in Stan when estimating a GP.
Perhaps most importantly, the parameters
Priors for length-scale
GPs are a flexible class of priors and, as such, can represent a wide spectrum of functions. For length scales below the minimum spacing of the covariates the GP likelihood plateaus. Unless regularized by a prior, this flat likelihood induces considerable posterior mass at small length scales where the observation variance drops to zero and the functions supported by the GP begin to exactly interpolate between the input data. The resulting posterior not only significantly overfits to the input data, it also becomes hard to accurately sample using Euclidean HMC.
We may wish to put further soft constraints on the length-scale, but these are dependent on how the GP is used in our statistical model.
If our model consists of only the GP, i.e.:
we likely don’t need constraints beyond penalizing small length-scales. We’d like to allow the GP prior to represent both high-frequency and low-frequency functions, so our prior should put non-negligible mass on both sets of functions. In this case, an inverse gamma, inv_gamma_lpdf
in Stan’s language, will work well as it has a sharp left tail that puts negligible mass on infinitesimal length-scales, but a generous right tail, allowing for large length-scales. Inverse gamma priors will avoid infinitesimal length-scales because the density is zero at zero, so the posterior for length-scale will be pushed away from zero. An inverse gamma distribution is one of many zero-avoiding or boundary-avoiding distributions.2.
If we’re using the GP as a component in a larger model that includes an overall mean and fixed effects for the same variables we’re using as the domain for the GP, i.e.:
we’ll likely want to constrain large length-scales as well. A length scale that is larger than the scale of the data yields a GP posterior that is practically linear (with respect to the particular covariate) and increasing the length scale has little impact on the likelihood. This will introduce nonidentifiability in our model, as both the fixed effects and the GP will explain similar variation. In order to limit the amount of overlap between the GP and the linear regression, we should use a prior with a sharper right tail to limit the GP to higher-frequency functions. We can use a generalized inverse Gaussian distribution:
which has an inverse gamma left tail if
functions {
real generalized_inverse_gaussian_lpdf(real x, int p,
real a, real b) {
return p * 0.5 * log(a / b)
2 * modified_bessel_second_kind(p, sqrt(a * b)))
- log(1) * log(x)
+ (p - 0.5;
- (a * x + b / x) *
}
}data {
// ...
}
If we have high-frequency covariates in our fixed effects, we may wish to further regularize the GP away from high-frequency functions, which means we’ll need to penalize smaller length-scales. Luckily, we have a useful way of thinking about how length-scale affects the frequency of the functions supported the GP. If we were to repeatedly draw from a zero-mean GP with a length-scale of
Priors for marginal standard deviation
The parameter
A half-
Predictive inference with a Gaussian process
Suppose for a given sequence of inputs
A direct implementation in Stan defines a model in terms of the joint distribution of the observed
data {
int<lower=1> N1;
array[N1] real x1;
vector[N1] y1;
int<lower=1> N2;
array[N2] real x2;
}transformed data {
real delta = 1e-9;
int<lower=1> N = N1 + N2;
array[N] real x;
for (n1 in 1:N1) {
x[n1] = x1[n1];
}for (n2 in 1:N2) {
x[N1 + n2] = x2[n2];
}
}parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N] eta;
}transformed parameters {
vector[N] f;
{matrix[N, N] L_K;
matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);
// diagonal elements
for (n in 1:N) {
K[n, n] = K[n, n] + delta;
}
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}model {
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
sigma ~ std_normal();
eta ~ std_normal();
1:N1], sigma);
y1 ~ normal(f[
}generated quantities {
vector[N2] y2;
for (n2 in 1:N2) {
y2[n2] = normal_rng(f[N1 + n2], sigma);
} }
The input vectors x1
and x2
are declared as data, as is the observed output vector y1
. The unknown output vector y2
, which corresponds to input vector x2
, is declared in the generated quantities block and will be sampled when the model is executed.
A transformed data block is used to combine the input vectors x1
and x2
into a single vector x
.
The model block declares and defines a local variable for the combined output vector f
, which consists of the concatenation of the conditional mean for known outputs y1
and unknown outputs y2
. Thus the combined output vector f
is aligned with the combined input vector x
. All that is left is to define the univariate normal distribution statement for y
.
The generated quantities block defines the quantity y2
. We generate y2
by randomly generating N2
values from univariate normals with each mean corresponding to the appropriate element in f
.
Predictive inference in non-Gaussian GPs
We can do predictive inference in non-Gaussian GPs in much the same way as we do with Gaussian GPs.
Consider the following full model for prediction using logistic Gaussian process regression.
data {
int<lower=1> N1;
array[N1] real x1;
array[N1] int<lower=0, upper=1> z1;
int<lower=1> N2;
array[N2] real x2;
}transformed data {
real delta = 1e-9;
int<lower=1> N = N1 + N2;
array[N] real x;
for (n1 in 1:N1) {
x[n1] = x1[n1];
}for (n2 in 1:N2) {
x[N1 + n2] = x2[n2];
}
}parameters {
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}transformed parameters {
vector[N] f;
{matrix[N, N] L_K;
matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho);
// diagonal elements
for (n in 1:N) {
K[n, n] = K[n, n] + delta;
}
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}model {
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
a ~ std_normal();
eta ~ std_normal();
1:N1]);
z1 ~ bernoulli_logit(a + f[
}generated quantities {
array[N2] int z2;
for (n2 in 1:N2) {
z2[n2] = bernoulli_logit_rng(a + f[N1 + n2]);
} }
Analytical form of joint predictive inference
Bayesian predictive inference for Gaussian processes with Gaussian observations can be sped up by deriving the posterior analytically, then directly sampling from it.
Jumping straight to the result,
There is no noise term including
This Stan code below uses the analytic form of the posterior and provides sampling of the resulting multivariate normal through the Cholesky decomposition. The data declaration is the same as for the latent variable example, but we’ve defined a function called gp_pred_rng
which will generate a draw from the posterior predictive mean conditioned on observed data y1
. The code uses a Cholesky decomposition in triangular solves in order to cut down on the number of matrix-matrix multiplications when computing the conditional mean and the conditional covariance of
functions {
vector gp_pred_rng(array[] real x2,
vector y1,
array[] real x1,
real alpha,
real rho,
real sigma,
real delta) {
int N1 = rows(y1);
int N2 = size(x2);
vector[N2] f2;
{matrix[N1, N1] L_K;
vector[N1] K_div_y1;
matrix[N1, N2] k_x1_x2;
matrix[N1, N2] v_pred;
vector[N2] f2_mu;
matrix[N2, N2] cov_f2;
matrix[N2, N2] diag_delta;
matrix[N1, N1] K;
K = gp_exp_quad_cov(x1, alpha, rho);for (n in 1:N1) {
K[n, n] = K[n, n] + square(sigma);
}
L_K = cholesky_decompose(K);
K_div_y1 = mdivide_left_tri_low(L_K, y1);
K_div_y1 = mdivide_right_tri_low(K_div_y1', L_K)';
k_x1_x2 = gp_exp_quad_cov(x1, x2, alpha, rho);
f2_mu = (k_x1_x2' * K_div_y1);
v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
cov_f2 = gp_exp_quad_cov(x2, alpha, rho) - v_pred' * v_pred;
diag_delta = diag_matrix(rep_vector(delta, N2));
f2 = multi_normal_rng(f2_mu, cov_f2 + diag_delta);
}return f2;
}
}data {
int<lower=1> N1;
array[N1] real x1;
vector[N1] y1;
int<lower=1> N2;
array[N2] real x2;
}transformed data {
vector[N1] mu = rep_vector(0, N1);
real delta = 1e-9;
}parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
}model {
matrix[N1, N1] L_K;
{matrix[N1, N1] K = gp_exp_quad_cov(x1, alpha, rho);
real sq_sigma = square(sigma);
// diagonal elements
for (n1 in 1:N1) {
K[n1, n1] = K[n1, n1] + sq_sigma;
}
L_K = cholesky_decompose(K);
}
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
sigma ~ std_normal();
y1 ~ multi_normal_cholesky(mu, L_K);
}generated quantities {
vector[N2] f2;
vector[N2] y2;
f2 = gp_pred_rng(x2, y1, x1, alpha, rho, sigma, delta);for (n2 in 1:N2) {
y2[n2] = normal_rng(f2[n2], sigma);
} }
Multiple-output Gaussian processes
Suppose we have observations
The matrix normal distribution has two covariance matrices:
We should set
We can generate a random variable
This can be implemented in Stan using a latent-variable GP formulation. We’ve used
data {
int<lower=1> N;
int<lower=1> D;
array[N] real x;
matrix[N, D] y;
}transformed data {
real delta = 1e-9;
}parameters {
real<lower=0> rho;
vector<lower=0>[D] alpha;
real<lower=0> sigma;
cholesky_factor_corr[D] L_Omega;
matrix[N, D] eta;
}model {
matrix[N, D] f;
{matrix[N, N] K = gp_exp_quad_cov(x, 1.0, rho);
matrix[N, N] L_K;
// diagonal elements
for (n in 1:N) {
K[n, n] = K[n, n] + delta;
}
L_K = cholesky_decompose(K);
f = L_K * eta
* diag_pre_multiply(alpha, L_Omega)';
}
5, 5);
rho ~ inv_gamma(
alpha ~ std_normal();
sigma ~ std_normal();3);
L_Omega ~ lkj_corr_cholesky(
to_vector(eta) ~ std_normal();
to_vector(y) ~ normal(to_vector(f), sigma);
}generated quantities {
matrix[D, D] Omega;
Omega = L_Omega * L_Omega'; }
References
Footnotes
Gaussian processes can be extended to covariance functions producing positive semi-definite matrices, but Stan does not support inference in the resulting models because the resulting distribution does not have unconstrained support.↩︎
A boundary-avoiding prior is just one where the limit of the density is zero at the boundary, the result of which is estimates that are pushed away from the boundary.↩︎