1.1 Linear Regression
The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as \[ y_n = \alpha + \beta x_n + \epsilon_n \quad\text{where}\quad \epsilon_n \sim \operatorname{normal}(0,\sigma). \]
This is equivalent to the following sampling involving the residual, \[ y_n - (\alpha + \beta X_n) \sim \operatorname{normal}(0,\sigma), \] and reducing still further, to \[ y_n \sim \operatorname{normal}(\alpha + \beta X_n, \, \sigma). \]
This latter form of the model is coded in Stan as follows.
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
There are N
observations and or each observation, \(n \in N\), we have predictor x[n]
and outcome y[n]
. The intercept and slope parameters are alpha
and beta
. The model assumes a normally distributed noise term with scale sigma
. This model has improper priors for the two regression coefficients.
Matrix Notation and Vectorization
The sampling statement in the previous model is vectorized, with
y ~ normal(alpha + beta * x, sigma);
providing the same model as the unvectorized version,
for (n in 1:N)
y[n] ~ normal(alpha + beta * x[n], sigma);
In addition to being more concise, the vectorized form is much faster.1
In general, Stan allows the arguments to distributions such as normal
to be vectors. If any of the other arguments are vectors or arrays, they have to be the same size. If any of the other arguments is a scalar, it is reused for each vector entry. See the vectorization section for more information on vectorization of probability functions.
The other reason this works is that Stan’s arithmetic operators are overloaded to perform matrix arithmetic on matrices. In this case, because x
is of type vector
and beta
of type real
, the expression beta * x
is of type vector
. Because Stan supports vectorization, a regression model with more than one predictor can be written directly using matrix notation.
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}
The constraint lower=0
in the declaration of sigma
constrains the value to be greater than or equal to 0. With no prior in the model block, the effect is an improper prior on non-negative real numbers. Although a more informative prior may be added, improper priors are acceptable as long as they lead to proper posteriors.
In the model above, x
is an \(N \times K\) matrix of predictors and beta
a \(K\)-vector of coefficients, so x * beta
is an \(N\)-vector of predictions, one for each of the \(N\) data items. These predictions line up with the outcomes in the \(N\)-vector y
, so the entire model may be written using matrix arithmetic as shown. It would be possible to include a column of ones in the data matrix x
to remove the alpha
parameter.
The sampling statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model.
model {
for (n in 1:N)
y[n] ~ normal(x[n] * beta, sigma);
}
With Stan’s matrix indexing scheme, x[n]
picks out row n
of the matrix x
; because beta
is a column vector, the product x[n] * beta
is a scalar of type real
.
Intercepts as Inputs
In the model formulation
y ~ normal(x * beta, sigma);
there is no longer an intercept coefficient alpha
. Instead, we have assumed that the first column of the input matrix x
is a column of 1 values. This way, beta[1]
plays the role of the intercept. If the intercept gets a different prior than the slope terms, then it would be clearer to break it out. It is also slightly more efficient in its explicit form with the intercept variable singled out because there’s one fewer multiplications; it should not make that much of a difference to speed, though, so the choice should be based on clarity.
Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as
log(sigma)
in the above model, will be computed once and reused.↩