1.1 Linear Regression
The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as
\[ y_n = \alpha + \beta x_n + \epsilon_n \ \ \ \mbox{where} \ \ \ \epsilon_n \sim \mathsf{normal}(0,\sigma). \] This is equivalent to the following sampling involving the residual, \[ y_n - (\alpha + \beta X_n) \sim \mathsf{normal}(0,\sigma), \] and reducing still further, to \[ y_n \sim \mathsf{normal}(\alpha + \beta X_n, \, \sigma). \]
This latter form of the model is coded in Stan as follows.
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
There are N
observations, each with predictor x[n]
and
outcome y[n]
. The intercept and slope parameters are
alpha
and beta
. The model assumes a normally
distributed noise term with scale sigma
. This model has
improper priors for the two regression coefficients.
Matrix Notation and Vectorization
The sampling statement in the previous model is vectorized, with
y ~ normal(alpha + beta * x, sigma);
providing the same model as the unvectorized version,
for (n in 1:N)
y[n] ~ normal(alpha + beta * x[n], sigma);
In addition to being more concise, the vectorized form is much faster.1
In general, Stan allows the arguments to distributions such as
normal
to be vectors. If any of the other arguments are vectors or
arrays, they have to be the same size. If any of the other arguments
is a scalar, it is reused for each vector entry. See the
vectorization section for more information on
vectorization of probability functions.
The other reason this works is that Stan’s arithmetic operators are
overloaded to perform matrix arithmetic on matrices. In this case,
because x
is of type vector
and beta
of type
real
, the expression beta * x
is of type vector
.
Because Stan supports vectorization, a regression model with more than
one predictor can be written directly using matrix notation.
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}
The constraint lower=0
in the declaration of sigma
constrains the value to be greater than or equal to 0. With no prior
in the model block, the effect is an improper prior on non-negative
real numbers. Although a more informative prior may be added, improper
priors are acceptable as long as they lead to proper posteriors.
In the model above, x
is an \(N \times K\) matrix of predictors
and beta
a \(K\)-vector of coefficients, so x * beta
is an
\(N\)-vector of predictions, one for each of the \(N\) data items. These
predictions line up with the outcomes in the \(N\)-vector y
, so
the entire model may be written using matrix arithmetic as shown. It
would be possible to include a column of ones the data matrix x
and
remove the alpha
parameter.
The sampling statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model.
model {
for (n in 1:N)
y[n] ~ normal(x[n] * beta, sigma);
}
With Stan’s matrix indexing scheme, x[n]
picks out row n
of the matrix x
; because beta
is a column vector,
the product x[n] * beta
is a scalar of type real
.
Intercepts as Inputs
In the model formulation
y ~ normal(x * beta, sigma);
there is no longer an intercept coefficient alpha
. Instead, we
have assumed that the first column of the input matrix x
is a
column of 1 values. This way, beta[1]
plays the role of the
intercept. If the intercept gets a different prior than the slope
terms, then it would be clearer to break it out. It is also slightly
more efficient in its explicit form with the intercept variable
singled out because there’s one fewer multiplications; it should not
make that much of a difference to speed, though, so the choice should
be based on clarity.
Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as
log(sigma)
in the above model, will be computed once and reused.↩︎