This is an old version, view current version.

10.1 Gaussian Process Regression

The data for a multivariate Gaussian process regression consists of a series of N inputs x1,,xNRD paired with outputs y1,,yNR. The defining feature of Gaussian processes is that the probability of a finite number of outputs y conditioned on their inputs x is Gaussian: ymultivariate normal(m(x),K(x|θ)), where m(x) is an N-vector and K(x|θ) is an N×N covariance matrix. The mean function m:RN×DRN can be anything, but the covariance function K:RN×DRN×N must produce a positive-definite matrix for any input x.22

A popular covariance function, which will be used in the implementations later in this chapter, is an exponentiated quadratic function, K(x|α,ρ,σ)i,j=α2exp(12ρ2Dd=1(xi,dxj,d)2)+δi,jσ2, where α, ρ, and σ are hyperparameters defining the covariance function and where δi,j is the Kronecker delta function with value 1 if i=j and value 0 otherwise; this test is between the indexes i and j, not between values xi and xj. This kernel is obtained through a convolution of two independent Gaussian processes, f1 and f2, with kernels K1(x|α,ρ)i,j=α2exp(12ρ2Dd=1(xi,dxj,d)2) and K2(x|σ)i,j=δi,jσ2,

The addition of σ2 on the diagonal is important to ensure the positive definiteness of the resulting matrix in the case of two identical inputs xi=xj. In statistical terms, σ is the scale of the noise term in the regression.

The hyperparameter ρ is the length-scale, and corresponds to the frequency of the functions represented by the Gaussian process prior with respect to the domain. Values of ρ closer to zero lead the GP to represent high-frequency functions, whereas larger values of ρ lead to low-frequency functions. The hyperparameter α is the marginal standard deviation. It controls the magnitude of the range of the function represented by the GP. If you were to take the standard deviation of many draws from the GP f1 prior at a single input x conditional on one value of α one would recover α.

The only term in the squared exponential covariance function involving the inputs xi and xj is their vector difference, xixj. This produces a process with stationary covariance in the sense that if an input vector x is translated by a vector ϵ to x+ϵ, the covariance at any pair of outputs is unchanged, because K(x|θ)=K(x+ϵ|θ).

The summation involved is just the squared Euclidean distance between xi and xj (i.e., the L2 norm of their difference, xixj). This results in support for smooth functions in the process. The amount of variation in the function is controlled by the free hyperparameters α, ρ, and σ.

Changing the notion of distance from Euclidean to taxicab distance (i.e., an L1 norm) changes the support to functions which are continuous but not smooth.


  1. Gaussian processes can be extended to covariance functions producing positive semi-definite matrices, but Stan does not support inference in the resulting models because the resulting distribution does not have unconstrained support.