1.2 The QR Reparameterization
In the previous example, the linear predictor can be written as \(\eta = x \beta\), where \(\eta\) is a \(N\)-vector of predictions, \(x\) is a \(N \times K\) matrix, and \(\beta\) is a \(K\)-vector of coefficients. Presuming \(N \geq K\), we can exploit the fact that any design matrix \(x\) can be decomposed using the thin QR decomposition into an orthogonal matrix \(Q\) and an upper-triangular matrix \(R\), i.e. \(x = Q R\).
The functions qr_thin_Q
and qr_thin_R
implement the thin QR decomposition, which is to be preferred to the fat QR decomposition that would be obtained by using qr_Q
and qr_R
, as the latter would more easily run out of memory (see the Stan Functions Reference for more information on the qr_thin_Q
and qr_thin_R
functions). In practice, it is best to write \(x = Q^\ast R^\ast\) where \(Q^\ast = Q * \sqrt{n - 1}\) and \(R^\ast = \frac{1}{\sqrt{n - 1}} R\). Thus, we can equivalently write \(\eta = x \beta = Q R \beta = Q^\ast R^\ast \beta\). If we let \(\theta = R^\ast \beta\), then we have \(\eta = Q^\ast \theta\) and \(\beta = R^{\ast^{-1}} \theta\). In that case, the previous Stan program becomes
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
transformed data {
matrix[N, K] Q_ast;
matrix[K, K] R_ast;
matrix[K, K] R_ast_inverse;
// thin and scale the QR decomposition
Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}
parameters {
real alpha; // intercept
vector[K] theta; // coefficients on Q_ast
real<lower=0> sigma; // error scale
}
model {
y ~ normal(Q_ast * theta + alpha, sigma); // likelihood
}
generated quantities {
vector[K] beta;
beta = R_ast_inverse * theta; // coefficients on x
}
Since this Stan program generates equivalent predictions for \(y\) and the same posterior distribution for \(\alpha\), \(\beta\), and \(\sigma\) as the previous Stan program, many wonder why the version with this QR reparameterization performs so much better in practice, often both in terms of wall time and in terms of effective sample size. The reasoning is threefold:
The columns of \(Q^\ast\) are orthogonal whereas the columns of \(x\) generally are not. Thus, it is easier for a Markov Chain to move around in \(\theta\)-space than in \(\beta\)-space.
The columns of \(Q^\ast\) have the same scale whereas the columns of \(x\) generally do not. Thus, a Hamiltonian Monte Carlo algorithm can move around the parameter space with a smaller number of larger steps
Since the covariance matrix for the columns of \(Q^\ast\) is an identity matrix, \(\theta\) typically has a reasonable scale if the units of \(y\) are also reasonable. This also helps HMC move efficiently without compromising numerical accuracy.
Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever \(K > 1\) and you do not have an informative prior on the location of \(\beta\). It can also be worthwhile to subtract the mean from each column of \(x\) before obtaining the QR decomposition, which does not affect the posterior distribution of \(\theta\) or \(\beta\) but does affect \(\alpha\) and allows you to interpret \(\alpha\) as the expectation of \(y\) in a linear model.