1.2 The QR Reparameterization

In the previous example, the linear predictor can be written as \(\eta = x \beta\), where \(\eta\) is a \(N\)-vector of predictions, \(x\) is a \(N \times K\) matrix, and \(\beta\) is a \(K\)-vector of coefficients. Presuming \(N \geq K\), we can exploit the fact that any design matrix, \(x\) can be decomposed using the thin QR decomposition into an orthogonal matrix \(Q\) and an upper-triangular matrix \(R\), i.e. \(x = Q R\). See the reference manual for a definition of the QR-decoposition. The functions qr_Q and qr_R implement the fat QR decomposition so here we thin it by including only \(K\) columns in \(Q\) and \(K\) rows in \(R\) (see the reference manual for more information on the qr_Q and qr_R functions). Also, in practice, it is best to write \(x = Q^\ast R^\ast\) where \(Q^\ast = Q * \sqrt{n - 1}\) and \(R^\ast = \frac{1}{\sqrt{n - 1}} R\). Thus, we can equivalently write \(\eta = x \beta = Q R \beta = Q^\ast R^\ast \beta\). If we let \(\theta = R^\ast \beta\), then we have \(\eta = Q^\ast \theta\) and \(\beta = R^{\ast^{-1}} \theta\). In that case, the previous Stan program becomes

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
transformed data {
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  matrix[K, K] R_ast_inverse;
  // thin and scale the QR decomposition
  Q_ast = qr_Q(x)[, 1:K] * sqrt(N - 1);
  R_ast = qr_R(x)[1:K, ] / sqrt(N - 1);
  R_ast_inverse = inverse(R_ast);
}
parameters {
  real alpha;           // intercept
  vector[K] theta;      // coefficients on Q_ast
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(Q_ast * theta + alpha, sigma);  // likelihood
}
generated quantities {
  vector[K] beta;
  beta = R_ast_inverse * theta; // coefficients on x
}

Since this Stan program generates equivalent predictions for \(y\) and the same posterior distribution for \(\alpha\), \(\beta\), and \(\sigma\) as the previous Stan program, many wonder why the version with this QR reparameterization performs so much better in practice, often both in terms of wall time and in terms of effective sample size. The reasoning is threefold:

  1. The columns of \(Q^\ast\) are orthogonal whereas the columns of \(x\) generally are not. Thus, it is easier for a Markov Chain to move around in \(\theta\)-space than in \(\beta\)-space.

  2. The columns of \(Q^\ast\) have the same scale whereas the columns of \(x\) generally do not. Thus, a Hamiltonian Monte Carlo algorithm can move around the parameter space with a smaller number of larger steps

  3. Since the covariance matrix for the columns of \(Q\ast\) is an identity matrix, \(\theta\) typically has a reasonable scale if the units of \(y\) are also reasonable. This also helps HMC move efficiently without compromising numerical accuracy.

Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever \(K > 1\) and you do not have an informative prior on the location of \(\beta\). It can also be worthwhile to subtract the mean from each column of \(x\) before obtaining the QR decomposition, which does not affect the posterior distribution of \(\theta\) or \(\beta\) but does affect \(\alpha\) and allows you to interpret \(\alpha\) as the expectation of \(y\) in a linear model.