## 1.1 Linear Regression

The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as \[ y_n = \alpha + \beta x_n + \epsilon_n \quad\text{where}\quad \epsilon_n \sim \operatorname{normal}(0,\sigma). \]

This is equivalent to the following sampling involving the residual, \[ y_n - (\alpha + \beta X_n) \sim \operatorname{normal}(0,\sigma), \] and reducing still further, to \[ y_n \sim \operatorname{normal}(\alpha + \beta X_n, \, \sigma). \]

This latter form of the model is coded in Stan as follows.

```
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
```

There are `N`

observations and or each observation, \(n \in N\), we have predictor `x[n]`

and outcome `y[n]`

. The intercept and slope parameters are `alpha`

and `beta`

. The model assumes a normally distributed noise term with scale `sigma`

. This model has improper priors for the two regression coefficients.

### Matrix Notation and Vectorization

The sampling statement in the previous model is vectorized, with

`y ~ normal(alpha + beta * x, sigma);`

providing the same model as the unvectorized version,

```
for (n in 1:N)
y[n] ~ normal(alpha + beta * x[n], sigma);
```

In addition to being more concise, the vectorized form is much faster.^{1}

In general, Stan allows the arguments to distributions such as `normal`

to be vectors. If any of the other arguments are vectors or arrays, they have to be the same size. If any of the other arguments is a scalar, it is reused for each vector entry. See the vectorization section for more information on vectorization of probability functions.

The other reason this works is that Stan’s arithmetic operators are overloaded to perform matrix arithmetic on matrices. In this case, because `x`

is of type `vector`

and `beta`

of type `real`

, the expression `beta * x`

is of type `vector`

. Because Stan supports vectorization, a regression model with more than one predictor can be written directly using matrix notation.

```
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // likelihood
}
```

The constraint `lower=0`

in the declaration of `sigma`

constrains the value to be greater than or equal to 0. With no prior in the model block, the effect is an improper prior on non-negative real numbers. Although a more informative prior may be added, improper priors are acceptable as long as they lead to proper posteriors.

In the model above, `x`

is an \(N \times K\) matrix of predictors and `beta`

a \(K\)-vector of coefficients, so `x * beta`

is an \(N\)-vector of predictions, one for each of the \(N\) data items. These predictions line up with the outcomes in the \(N\)-vector `y`

, so the entire model may be written using matrix arithmetic as shown. It would be possible to include a column of ones in the data matrix `x`

to remove the `alpha`

parameter.

The sampling statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model.

```
model {
for (n in 1:N)
y[n] ~ normal(x[n] * beta, sigma);
}
```

With Stan’s matrix indexing scheme, `x[n]`

picks out row `n`

of the matrix `x`

; because `beta`

is a column vector, the product `x[n] * beta`

is a scalar of type `real`

.

#### Intercepts as Inputs

In the model formulation

`y ~ normal(x * beta, sigma);`

there is no longer an intercept coefficient `alpha`

. Instead, we have assumed that the first column of the input matrix `x`

is a column of 1 values. This way, `beta[1]`

plays the role of the intercept. If the intercept gets a different prior than the slope terms, then it would be clearer to break it out. It is also slightly more efficient in its explicit form with the intercept variable singled out because there’s one fewer multiplications; it should not make that much of a difference to speed, though, so the choice should be based on clarity.

Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as

`log(sigma)`

in the above model, will be computed once and reused.↩