10.1 Gaussian Process Regression

This is an old version, view current version.

The data for a multivariate Gaussian process regression consists of a series of $N$ inputs $x_1,\ldots,x_N \in \mathbb{R}^D$ paired with outputs $y_1,\ldots,y_N \in \mathbb{R}$ . The defining feature of Gaussian processes is that the probability of a finite number of outputs $y$ conditioned on their inputs $x$ is Gaussian: $y \sim \mathsf{multivariate\ normal}(m(x), K(x | \theta)),$ where $m(x)$ is an $N$ -vector and $K(x | \theta)$ is an $N \times N$ covariance matrix. The mean function $m : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N}$ can be anything, but the covariance function $K : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N \times N}$ must produce a positive-definite matrix for any input $x$ .²²

A popular covariance function, which will be used in the implementations later in this chapter, is an exponentiated quadratic function, $K(x | \alpha, \rho, \sigma)_{i, j} = \alpha^2 \exp \left( - \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2 \right) + \delta_{i, j} \sigma^2,$ where $\alpha$ , $\rho$ , and $\sigma$ are hyperparameters defining the covariance function and where $\delta_{i, j}$ is the Kronecker delta function with value 1 if $i = j$ and value 0 otherwise; this test is between the indexes $i$ and $j$ , not between values $x_i$ and $x_j$ . This kernel is obtained through a convolution of two independent Gaussian processes, $f_1$ and $f_2$ , with kernels $K_1(x | \alpha, \rho)_{i, j} = \alpha^2 \exp \left( - \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2 \right)$ and $K_2(x | \sigma)_{i, j} = \delta_{i, j} \sigma^2,$

The addition of $\sigma^2$ on the diagonal is important to ensure the positive definiteness of the resulting matrix in the case of two identical inputs $x_i = x_j$ . In statistical terms, $\sigma$ is the scale of the noise term in the regression.

The hyperparameter $\rho$ is the length-scale, and corresponds to the frequency of the functions represented by the Gaussian process prior with respect to the domain. Values of $\rho$ closer to zero lead the GP to represent high-frequency functions, whereas larger values of $\rho$ lead to low-frequency functions. The hyperparameter $\alpha$ is the marginal standard deviation. It controls the magnitude of the range of the function represented by the GP. If you were to take the standard deviation of many draws from the GP $f_1$ prior at a single input $x$ conditional on one value of $\alpha$ one would recover $\alpha$ .

The only term in the squared exponential covariance function involving the inputs $x_i$ and $x_j$ is their vector difference, $x_i - x_j$ . This produces a process with stationary covariance in the sense that if an input vector $x$ is translated by a vector $\epsilon$ to $x + \epsilon$ , the covariance at any pair of outputs is unchanged, because $K(x | \theta) = K(x + \epsilon| \theta)$ .

The summation involved is just the squared Euclidean distance between $x_i$ and $x_j$ (i.e., the $L_2$ norm of their difference, $x_i - x_j$ ). This results in support for smooth functions in the process. The amount of variation in the function is controlled by the free hyperparameters $\alpha$ , $\rho$ , and $\sigma$ .

Changing the notion of distance from Euclidean to taxicab distance (i.e., an $L_1$ norm) changes the support to functions which are continuous but not smooth.

Gaussian processes can be extended to covariance functions producing positive semi-definite matrices, but Stan does not support inference in the resulting models because the resulting distribution does not have unconstrained support.↩