1.1 Linear Regression

This is an old version, view current version.

The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as

\[ y_n = \alpha + \beta x_n + \epsilon_n \ \ \ \mbox{where} \ \ \ \epsilon_n \sim \mathsf{normal}(0,\sigma). \] This is equivalent to the following sampling involving the residual, \[ y_n - (\alpha + \beta X_n) \sim \mathsf{normal}(0,\sigma), \] and reducing still further, to \[ y_n \sim \mathsf{normal}(\alpha + \beta X_n, \, \sigma). \]

This latter form of the model is coded in Stan as follows.

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}

There are N observations, each with predictor x[n] and outcome y[n]. The intercept and slope parameters are alpha and beta. The model assumes a normally distributed noise term with scale sigma. This model has improper priors for the two regression coefficients.

Matrix Notation and Vectorization

The sampling statement in the previous model is vectorized, with

  y ~ normal(alpha + beta * x, sigma);

providing the same model as the unvectorized version,

  for (n in 1:N)
    y[n] ~ normal(alpha + beta * x[n], sigma);

In addition to being more concise, the vectorized form is much faster.¹

In general, Stan allows the arguments to distributions such as normal to be vectors. If any of the other arguments are vectors or arrays, they have to be the same size. If any of the other arguments is a scalar, it is reused for each vector entry. See the vectorization section for more information on vectorization of probability functions.

The other reason this works is that Stan’s arithmetic operators are overloaded to perform matrix arithmetic on matrices. In this case, because x is of type vector and beta of type real, the expression beta * x is of type vector. Because Stan supports vectorization, a regression model with more than one predictor can be written directly using matrix notation.

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(x * beta + alpha, sigma);  // likelihood
}

The constraint lower=0 in the declaration of sigma constrains the value to be greater than or equal to 0. With no prior in the model block, the effect is an improper prior on non-negative real numbers. Although a more informative prior may be added, improper priors are acceptable as long as they lead to proper posteriors.

In the model above, x is an \(N \times K\) matrix of predictors and beta a \(K\)-vector of coefficients, so x * beta is an \(N\)-vector of predictions, one for each of the \(N\) data items. These predictions line up with the outcomes in the \(N\)-vector y, so the entire model may be written using matrix arithmetic as shown. It would be possible to include a column of ones the data matrix x and remove the alpha parameter.

The sampling statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model.

model {
  for (n in 1:N)
    y[n] ~ normal(x[n] * beta, sigma);
}

With Stan’s matrix indexing scheme, x[n] picks out row n of the matrix x; because beta is a column vector, the product x[n] * beta is a scalar of type real.

Intercepts as Inputs

In the model formulation

  y ~ normal(x * beta, sigma);

there is no longer an intercept coefficient alpha. Instead, we have assumed that the first column of the input matrix x is a column of 1 values. This way, beta[1] plays the role of the intercept. If the intercept gets a different prior than the slope terms, then it would be clearer to break it out. It is also slightly more efficient in its explicit form with the intercept variable singled out because there’s one fewer multiplications; it should not make that much of a difference to speed, though, so the choice should be based on clarity.