## 10.1 Gaussian Process Regression

The data for a multivariate Gaussian process regression consists of a
series of \(N\) inputs \(x_1,\ldots,x_N \in \mathbb{R}^D\) paired with outputs
\(y_1,\ldots,y_N \in \mathbb{R}\). The defining feature of Gaussian
processes is that the probability of a finite number of outputs \(y\)
conditioned on their inputs \(x\) is Gaussian:
\[
y \sim \mathsf{multivariate\ normal}(m(x), K(x | \theta)),
\]
where \(m(x)\) is an \(N\)-vector and \(K(x | \theta)\) is an \(N \times N\)
covariance matrix. The mean function \(m : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N}\) can be anything, but the covariance function
\(K : \mathbb{R}^{N \times D} \rightarrow \mathbb{R}^{N \times N}\) must produce
a positive-definite matrix for any input \(x\).^{22}

A popular covariance function, which will be used in the implementations later in this chapter, is an exponentiated quadratic function, \[ K(x | \alpha, \rho, \sigma)_{i, j} = \alpha^2 \exp \left( - \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2 \right) + \delta_{i, j} \sigma^2, \] where \(\alpha\), \(\rho\), and \(\sigma\) are hyperparameters defining the covariance function and where \(\delta_{i, j}\) is the Kronecker delta function with value 1 if \(i = j\) and value 0 otherwise; this test is between the indexes \(i\) and \(j\), not between values \(x_i\) and \(x_j\). This kernel is obtained through a convolution of two independent Gaussian processes, \(f_1\) and \(f_2\), with kernels \[ K_1(x | \alpha, \rho)_{i, j} = \alpha^2 \exp \left( - \dfrac{1}{2 \rho^2} \sum_{d=1}^D (x_{i,d} - x_{j,d})^2 \right) \] and \[ K_2(x | \sigma)_{i, j} = \delta_{i, j} \sigma^2, \]

The addition of \(\sigma^2\) on the diagonal is important to ensure the positive definiteness of the resulting matrix in the case of two identical inputs \(x_i = x_j\). In statistical terms, \(\sigma\) is the scale of the noise term in the regression.

The hyperparameter \(\rho\) is the *length-scale*, and corresponds to the
frequency of the functions represented by the Gaussian process prior with
respect to the domain. Values of \(\rho\) closer to zero lead the GP to represent
high-frequency functions, whereas larger values of \(\rho\) lead to low-frequency
functions. The hyperparameter \(\alpha\) is the *marginal standard
deviation*. It controls the magnitude of the range of the function represented
by the GP. If you were to take the standard deviation of many draws from the GP
\(f_1\) prior at a single input \(x\) conditional on one value of \(\alpha\) one
would recover \(\alpha\).

The only term in the squared exponential covariance function involving the inputs \(x_i\) and \(x_j\) is their vector difference, \(x_i - x_j\). This produces a process with stationary covariance in the sense that if an input vector \(x\) is translated by a vector \(\epsilon\) to \(x + \epsilon\), the covariance at any pair of outputs is unchanged, because \(K(x | \theta) = K(x + \epsilon| \theta)\).

The summation involved is just the squared Euclidean distance between \(x_i\) and \(x_j\) (i.e., the \(L_2\) norm of their difference, \(x_i - x_j\)). This results in support for smooth functions in the process. The amount of variation in the function is controlled by the free hyperparameters \(\alpha\), \(\rho\), and \(\sigma\).

Changing the notion of distance from Euclidean to taxicab distance (i.e., an \(L_1\) norm) changes the support to functions which are continuous but not smooth.

Gaussian processes can be extended to covariance functions producing positive semi-definite matrices, but Stan does not support inference in the resulting models because the resulting distribution does not have unconstrained support.↩