Regression Models
Stan supports regression models from simple linear regressions to multilevel generalized linear models.
Linear regression
The simplest linear regression model is the following, with a single predictor and a slope and intercept coefficient, and normally distributed noise. This model can be written using standard regression notation as
This is equivalent to the following sampling involving the residual,
This latter form of the model is coded in Stan as follows.
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}parameters {
real alpha;
real beta;
real<lower=0> sigma;
}model {
y ~ normal(alpha + beta * x, sigma); }
There are N
observations and for each observation, x[n]
and outcome y[n]
. The intercept and slope parameters are alpha
and beta
. The model assumes a normally distributed noise term with scale sigma
. This model has improper priors for the two regression coefficients.
Matrix notation and vectorization
The distribution statement in the previous model is vectorized, with
y ~ normal(alpha + beta * x, sigma);
providing the same model as the unvectorized version,
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n], sigma); }
In addition to being more concise, the vectorized form is much faster.1
In general, Stan allows the arguments to distributions such as normal
to be vectors. If any of the other arguments are vectors or arrays, they have to be the same size. If any of the other arguments is a scalar, it is reused for each vector entry.
The other reason this works is that Stan’s arithmetic operators are overloaded to perform matrix arithmetic on matrices. In this case, because x
is of type vector
and beta
of type real
, the expression beta * x
is of type vector
. Because Stan supports vectorization, a regression model with more than one predictor can be written directly using matrix notation.
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}model {
// data model
y ~ normal(x * beta + alpha, sigma); }
The constraint lower=0
in the declaration of sigma
constrains the value to be greater than or equal to 0. With no prior in the model block, the effect is an improper prior on non-negative real numbers. Although a more informative prior may be added, improper priors are acceptable as long as they lead to proper posteriors.
In the model above, x
is an beta
a x * beta
is an y
, so the entire model may be written using matrix arithmetic as shown. It would be possible to include a column of ones in the data matrix x
to remove the alpha
parameter.
The distribution statement in the model above is just a more efficient, vector-based approach to coding the model with a loop, as in the following statistically equivalent model.
model {
for (n in 1:N) {
y[n] ~ normal(x[n] * beta, sigma);
} }
With Stan’s matrix indexing scheme, x[n]
picks out row n
of the matrix x
; because beta
is a column vector, the product x[n] * beta
is a scalar of type real
.
Intercepts as inputs
In the model formulation
y ~ normal(x * beta, sigma);
there is no longer an intercept coefficient alpha
. Instead, we have assumed that the first column of the input matrix x
is a column of 1 values. This way, beta[1]
plays the role of the intercept. If the intercept gets a different prior than the slope terms, then it would be clearer to break it out. It is also slightly more efficient in its explicit form with the intercept variable singled out because there’s one fewer multiplications; it should not make that much of a difference to speed, though, so the choice should be based on clarity.
The QR reparameterization
In the previous example, the linear predictor can be written as
The functions qr_thin_Q
and qr_thin_R
implement the thin QR decomposition, which is to be preferred to the fat QR decomposition that would be obtained by using qr_Q
and qr_R
, as the latter would more easily run out of memory (see the Stan Functions Reference for more information on the qr_thin_Q
and qr_thin_R
functions). In practice, it is best to write
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}transformed data {
matrix[N, K] Q_ast;
matrix[K, K] R_ast;
matrix[K, K] R_ast_inverse;
// thin and scale the QR decomposition
1);
Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N -
R_ast_inverse = inverse(R_ast);
}parameters {
real alpha; // intercept
vector[K] theta; // coefficients on Q_ast
real<lower=0> sigma; // error scale
}model {
// data model
y ~ normal(Q_ast * theta + alpha, sigma);
}generated quantities {
vector[K] beta;
// coefficients on x
beta = R_ast_inverse * theta; }
Since this Stan program generates equivalent predictions for
The columns of
are orthogonal whereas the columns of generally are not. Thus, it is easier for a Markov Chain to move around in -space than in -space.The columns of
have the same scale whereas the columns of generally do not. Thus, a Hamiltonian Monte Carlo algorithm can move around the parameter space with a smaller number of larger stepsSince the covariance matrix for the columns of
is an identity matrix, typically has a reasonable scale if the units of are also reasonable. This also helps HMC move efficiently without compromising numerical accuracy.
Consequently, this QR reparameterization is recommended for linear and generalized linear models in Stan whenever
Priors for coefficients and scales
See our general discussion of priors for tips on priors for parameters in regression models.
Later sections discuss univariate hierarchical priors and multivariate hierarchical priors, as well as priors used to identify models.
However, as described in QR-reparameterization section, if you do not have an informative prior on the location of the regression coefficients, then you are better off reparameterizing your model so that the regression coefficients are a generated quantity. In that case, it usually does not matter much what prior is used on on the reparameterized regression coefficients and almost any weakly informative prior that scales with the outcome will do.
Robust noise models
The standard approach to linear regression is to model the noise term
data {
// ...
real<lower=0> nu;
}// ...
model {
y ~ student_t(nu, alpha + beta * x, sigma); }
The degrees of freedom constant nu
is specified as data.
Logistic and probit regression
For binary outcomes, either of the closely related logistic or probit regression models may be used. These generalized linear models vary only in the link function they use to map linear predictions in
A logistic regression model with one predictor and an intercept is coded as follows.
data {
int<lower=0> N;
vector[N] x;
array[N] int<lower=0, upper=1> y;
}parameters {
real alpha;
real beta;
}model {
y ~ bernoulli_logit(alpha + beta * x); }
The noise parameter is built into the Bernoulli formulation here rather than specified directly.
Logistic regression is a kind of generalized linear model with binary outcomes and the log odds (logit) link function, defined by
The inverse of the link function appears in the model:
The model formulation above uses the logit-parameterized version of the Bernoulli distribution, which is defined by
The formulation is also vectorized in the sense that alpha
and beta
are scalars and x
is a vector, so that alpha + beta * x
is a vector. The vectorized formulation is equivalent to the less efficient version
for (n in 1:N) {
y[n] ~ bernoulli_logit(alpha + beta * x[n]); }
Expanding out the Bernoulli logit, the model is equivalent to the more explicit, but less efficient and less arithmetically stable
for (n in 1:N) {
y[n] ~ bernoulli(inv_logit(alpha + beta * x[n])); }
Other link functions may be used in the same way. For example, probit regression uses the cumulative normal distribution function, which is typically written as
The cumulative standard normal distribution function Phi
. The probit regression model may be coded in Stan by replacing the logistic model’s distribution statement with the following.
y[n] ~ bernoulli(Phi(alpha + beta * x[n]));
A fast approximation to the cumulative standard normal distribution function Phi_approx
.2 The approximate probit regression model may be coded with the following.
y[n] ~ bernoulli(Phi_approx(alpha + beta * x[n]));
Multi-logit regression
Multiple outcome forms of logistic regression can be coded directly in Stan. For instance, suppose there are
data {
int K;
int N;
int D;
array[N] int y;
matrix[N, D] x;
}parameters {
matrix[D, K] beta;
}model {
matrix[N, K] x_beta = x * beta;
0, 5);
to_vector(beta) ~ normal(
for (n in 1:N) {
y[n] ~ categorical_logit(x_beta[n]');
} }
where x_beta[n]'
is the transpose of x_beta[n]
. The prior on beta
is coded in vectorized form. As of Stan 2.18, the categorical-logit distribution is not vectorized for parameter arguments, so the loop is required. The matrix multiplication is pulled out to define a local variable for all of the predictors for efficiency. Like the Bernoulli-logit, the categorical-logit distribution applies softmax internally to convert an arbitrary vector to a simplex,
The categorical distribution with log-odds (logit) scaled parameters used above is equivalent to writing
y[n] ~ categorical(softmax(x[n] * beta));
Constraints on data declarations
The data block in the above model is defined without constraints on sizes K
, N
, and D
or on the outcome array y
. Constraints on data declarations provide error checking at the point data are read (or transformed data are defined), which is before sampling begins. Constraints on data declarations also make the model author’s intentions more explicit, which can help with readability. The above model’s declarations could be tightened to
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;
These constraints arise because the number of categories, K
, must be at least two in order for a categorical model to be useful. The number of data items, N
, can be zero, but not negative; unlike R, Stan’s for-loops always move forward, so that a loop extent of 1:N
when N
is equal to zero ensures the loop’s body will not be executed. The number of predictors, D
, must be at least one in order for beta * x[n]
to produce an appropriate argument for softmax()
. The categorical outcomes y[n]
must be between 1
and K
in order for the discrete sampling to be well defined.
Constraints on data declarations are optional. Constraints on parameters declared in the parameters
block, on the other hand, are not optional—they are required to ensure support for all parameter values satisfying their constraints. Constraints on transformed data, transformed parameters, and generated quantities are also optional.
Identifiability
Because softmax is invariant under adding a constant to each component of its input, the model is typically only identified if there is a suitable prior on the coefficients.
An alternative is to use
parameters {
matrix[D, K - 1] beta_raw;
}
and then these are transformed to parameters to use in the model. First, a transformed data block is added before the parameters block to define a vector of zero values,
transformed data {
vector[D] zeros = rep_vector(0, D);
}
which can then be appended to beta_raw
to produce the coefficient matrix beta
,
transformed parameters {
matrix[D, K] beta = append_col(beta_raw, zeros);
}
The rep_vector(0, D)
call creates a column vector of size D
with all entries set to zero. The derived matrix beta
is then defined to be the result of appending the vector zeros
as a new column at the end of beta_raw
; the vector zeros
is defined as transformed data so that it doesn’t need to be constructed from scratch each time it is used.
This is not the same model as using
Parameterizing centered vectors
When there are varying effects in a regression, the resulting likelihood is not identified unless further steps are taken. For example, we might have a global intercept
The traditional approach to identifying such a model is to pin the first varing effect to zero, i.e.,
In a Bayesian setting, a proper prior on each of the
An alternative identification strategy that allows a symmetric prior is to enforce a sum-to-zero constraint on the varying effects, i.e.,
A parameter vector constrained to sum to zero may also be used to identify a multi-logit regression parameter vector (see the multi-logit section for details), or may be used for ability or difficulty parameters (but not both) in an IRT model (see the item-response model section for details).
Built-in sum-to-zero vector
As of Stan 2.36, there is a built in sum_to_zero_vector
type, which can be used as follows.
parameters {
sum_to_zero_vector[K] beta;
// ...
}
This produces a vector of size K
such that sum(beta) = 0
. In the unconstrained representation requires only K - 1
values because the last is determined by the first K - 1
.
Placing a prior on beta
in this parameterization, for example,
0, 1); beta ~ normal(
leads to a subtly different posterior than what you would get with the same prior on an unconstrained size-K
vector. As explained below, the variance is reduced.
The sum-to-zero constraint can be implemented naively by setting the last element to the negative sum of the first elements, i.e.,
The transform used in Stan eliminates these correlations by constructing an orthogonal basis and applying it to the zero-sum-constraint; Seyboldt (2024) provides an explanation. The Stan Reference Manual provides the details in the chapter on transforms. Although any orthogonal basis can be used, Stan uses the inverse isometric log transform because it is convenient to describe and the transform simplifies to efficient scalar operations rather than more expensive matrix operations.
Marginal distribution of sum-to-zero components
On the Stan forums, Aaron Goodman provided the following code to produce a prior with standard normal marginals on the components of beta
,
model {
0, inv(sqrt(1 - inv(K))));
beta ~ normal(// ...
}
The scale component can be multiplied by sigma
to produce a normal(0, sigma)
prior marginally.
To generate distributions with marginals other than standard normal, the resulting beta
may be scaled by some factor sigma
and translated to some new location mu
.
Soft centering
Adding a prior such as
Ordered logistic and probit regression
Ordered regression for an outcome
Ordered logistic regression
The ordered logistic model can be coded in Stan using the ordered
data type for the cutpoints and the built-in ordered_logistic
distribution.
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;
array[N] row_vector[D] x;
}parameters {
vector[D] beta;
ordered[K - 1] c;
}model {
for (n in 1:N) {
y[n] ~ ordered_logistic(x[n] * beta, c);
} }
The vector of cutpoints c
is declared as ordered[K - 1]
, which guarantees that c[k]
is less than c[k + 1]
.
If the cutpoints were assigned independent priors, the constraint effectively truncates the joint prior to support over points that satisfy the ordering constraint. Luckily, Stan does not need to compute the effect of the constraint on the normalizing term because the probability is needed only up to a proportion.
Ordered probit
An ordered probit model could be coded in exactly the same way by swapping the cumulative logistic (inv_logit
) for the cumulative normal (Phi
).
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;
array[N] row_vector[D] x;
}parameters {
vector[D] beta;
ordered[K - 1] c;
}model {
vector[K] theta;
for (n in 1:N) {
real eta;
eta = x[n] * beta;1] = 1 - Phi(eta - c[1]);
theta[for (k in 2:(K - 1)) {
1]) - Phi(eta - c[k]);
theta[k] = Phi(eta - c[k -
}1]);
theta[K] = Phi(eta - c[K -
y[n] ~ categorical(theta);
} }
The logistic model could also be coded this way by replacing Phi
with inv_logit
, though the built-in encoding based on the softmax transform is more efficient and more numerically stable. A small efficiency gain could be achieved by computing the values Phi(eta - c[k])
once and storing them for re-use.
Hierarchical regression
The simplest multilevel model is a hierarchical model in which the data are grouped into
Suppose each binary outcome
The following model encodes a hierarchical logistic regression model with a hierarchical prior on the regression coefficients.
data {
int<lower=1> D;
int<lower=0> N;
int<lower=1> L;
array[N] int<lower=0, upper=1> y;
array[N] int<lower=1, upper=L> ll;
array[N] row_vector[D] x;
}parameters {
array[D] real mu;
array[D] real<lower=0> sigma;
array[L] vector[D] beta;
}model {
for (d in 1:D) {
0, 100);
mu[d] ~ normal(for (l in 1:L) {
beta[l, d] ~ normal(mu[d], sigma[d]);
}
}for (n in 1:N) {
y[n] ~ bernoulli(inv_logit(x[n] * beta[ll[n]]));
} }
The standard deviation parameter sigma
gets an implicit uniform prior on
Optimizing the model
Where possible, vectorizing distribution statements leads to faster log probability and derivative evaluations. The speed boost is not because loops are eliminated, but because vectorization allows sharing subcomputations in the log probability and gradient calculations and because it reduces the size of the expression tree required for gradient calculations.
The first optimization vectorizes the for-loop over D
as
0, 100);
mu ~ normal(for (l in 1:L) {
beta[l] ~ normal(mu, sigma); }
The declaration of beta
as an array of vectors means that the expression beta[l]
denotes a vector. Although beta
could have been declared as a matrix, an array of vectors (or a two-dimensional array) is more efficient for accessing rows; see the indexing efficiency section for more information on the efficiency tradeoffs among arrays, vectors, and matrices.
This model can be further sped up and at the same time made more arithmetically stable by replacing the application of inverse-logit inside the Bernoulli distribution with the logit-parameterized Bernoulli,3
for (n in 1:N) {
y[n] ~ bernoulli_logit(x[n] * beta[ll[n]]); }
Unlike in R or BUGS, loops, array access and assignments are fast in Stan because they are translated directly to C++. In most cases, the cost of allocating and assigning to a container is more than made up for by the increased efficiency due to vectorizing the log probability and gradient calculations. Thus the following version is faster than the original formulation as a loop over a distribution statement.
{vector[N] x_beta_ll;
for (n in 1:N) {
x_beta_ll[n] = x[n] * beta[ll[n]];
}
y ~ bernoulli_logit(x_beta_ll); }
The brackets introduce a new scope for the local variable x_beta_ll
; alternatively, the variable may be declared at the top of the model block.
In some cases, such as the above, the local variable assignment leads to models that are less readable. The recommended practice in such cases is to first develop and debug the more transparent version of the model and only work on optimizations when the simpler formulation has been debugged.
Hierarchical priors
Priors on priors, also known as “hyperpriors,” should be treated the same way as priors on lower-level parameters in that as much prior information as is available should be brought to bear. Because hyperpriors often apply to only a handful of lower-level parameters, care must be taken to ensure the posterior is both proper and not overly sensitive either statistically or computationally to wide tails in the priors.
Boundary-avoiding priors for MLE in hierarchical models
The fundamental problem with maximum likelihood estimation (MLE) in the hierarchical model setting is that as the hierarchical variance drops and the values cluster around the hierarchical mean, the overall density grows without bound. As an illustration, consider a simple hierarchical linear regression (with fixed prior mean) of
In this case, as
There is obviously no MLE estimate for
Item-response theory models
Item-response theory (IRT) models the situation in which a number of students each answer one or more of a group of test questions. The model is based on parameters for the ability of the students, the difficulty of the questions, and in more articulated models, the discriminativeness of the questions and the probability of guessing correctly; see Gelman and Hill (2007, pps. 314–320) for a textbook introduction to hierarchical IRT models and Curtis (2010) for encodings of a range of IRT models in BUGS.
Data declaration with missingness
The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
array[N] int<lower=1, upper=J> jj; // student for observation n
array[N] int<lower=1, upper=K> kk; // question for observation n
array[N] int<lower=0, upper=1> y; // correctness for observation n
}
This declares a total of N
student-question pairs in the data set, where each n
in 1:N
indexes a binary observation y[n]
of the correctness of the answer of student jj[n]
on question kk[n]
.
The prior hyperparameters will be hard coded in the rest of this section for simplicity, though they could be coded as data in Stan for more flexibility.
1PL (Rasch) model
The 1PL item-response model, also known as the Rasch model, has one parameter (1P) for questions and uses the logistic link function (L).
The model parameters are declared as follows.
parameters {
real delta; // mean student ability
array[J] real alpha; // ability of student j - mean ability
array[K] real beta; // difficulty of question k
}
The parameter alpha[J]
is the ability coefficient for student j
and beta[k]
is the difficulty coefficient for question k
. The non-standard parameterization used here also includes an intercept term delta
, which represents the average student’s response to the average question.4
The model itself is as follows.
model {
// informative true prior
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); 0.75, 1); // informative true prior
delta ~ normal(for (n in 1:N) {
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
} }
This model uses the logit-parameterized Bernoulli distribution, where
The key to understanding it is the term inside the bernoulli_logit
distribution, from which it follows that
The model suffers from additive identifiability issues without the priors. For example, adding a term
For testing purposes, the IRT 1PL model distributed with Stan uses informative priors that match the actual data generation process used to simulate the data in R (the simulation code is supplied in the same directory as the models). This is unrealistic for most practical applications, but allows Stan’s inferences to be validated. A simple sensitivity analysis with fatter priors shows that the posterior is fairly sensitive to the prior even with 400 students and 100 questions and only 25% missingness at random. For real applications, the priors should be fit hierarchically along with the other parameters, as described in the next section.
Multilevel 2PL model
The simple 1PL model described in the previous section is generalized in this section with the addition of a discrimination parameter to model how noisy a question is and by adding multilevel priors for the question difficulty and discrimination parameters. The model parameters are declared as follows.
parameters {
real mu_beta; // mean question difficulty
vector[J] alpha; // ability for j - mean
vector[K] beta; // difficulty for k
vector<lower=0>[K] gamma; // discrimination of k
real<lower=0> sigma_beta; // scale of difficulties
real<lower=0> sigma_gamma; // scale of log discrimination
}
The parameters should be clearer after the model definition.
model {
alpha ~ std_normal();0, sigma_beta);
beta ~ normal(0, sigma_gamma);
gamma ~ lognormal(0, 5);
mu_beta ~ cauchy(0, 5);
sigma_beta ~ cauchy(0, 5);
sigma_gamma ~ cauchy(
y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - (beta[kk] + mu_beta))); }
The std_normal
function is used here, defined by
The distribution statement is also vectorized using elementwise multiplication; it is equivalent to
for (n in 1:N) {
y[n] ~ bernoulli_logit(gamma[kk[n]]
* (alpha[jj[n]] - (beta[kk[n]] + mu_beta)); }
The 2PL model is similar to the 1PL model, with the additional parameter gamma[k]
modeling how discriminative question k
is. If gamma[k]
is greater than 1, responses are more attenuated with less chance of getting a question right at random. The parameter gamma[k]
is constrained to be positive, which prohibits there being questions that are easier for students of lesser ability; such questions are not unheard of, but they tend to be eliminated from most testing situations where an IRT model would be applied.
The model is parameterized here with student abilities alpha
being given a standard normal prior. This is to identify both the scale and the location of the parameters, both of which would be unidentified otherwise; see the problematic posteriors chapter for further discussion of identifiability. The difficulty and discrimination parameters beta
and gamma
then have varying scales given hierarchically in this model. They could also be given weakly informative non-hierarchical priors, such as
0, 5);
beta ~ normal(0, 2); gamma ~ lognormal(
The point is that the alpha
determines the scale and location and beta
and gamma
are allowed to float.
The beta
parameter is here given a non-centered parameterization, with parameter mu_beta
serving as the mean beta
location. An alternative would’ve been to take:
beta ~ normal(mu_beta, sigma_beta);
and
y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]));
Non-centered parameterizations tend to be more efficient in hierarchical models; see the reparameterization section for more information on non-centered reparameterizations.
The intercept term mu_beta
can’t itself be modeled hierarchically, so it is given a weakly informative sigma_beta
, and sigma_gamma
, are given half-Cauchy priors. As mentioned earlier, the scale and location for alpha
are fixed to ensure identifiability. The truncation in the half-Cauchy prior is implicit; explicit truncation is not necessary because the log probability need only be calculated up to a proportion and the scale variables are constrained to
Priors for identifiability
Location and scale invariance
One application of (hierarchical) priors is to identify the scale and/or location of a group of parameters. For example, in the IRT models discussed in the previous section, there is both a location and scale non-identifiability. With uniform priors, the posteriors will float in terms of both scale and location. See the collinearity section for a simple example of the problems this poses for estimation.
The non-identifiability is resolved by providing a standard normal (i.e.,
Collinearity
Another case in which priors can help provide identifiability is in the case of collinearity in a linear regression. In linear regression, if two predictors are collinear (i.e, one is a linear function of the other), then their coefficients will have a correlation of 1 (or -1) in the posterior. This leads to non-identifiability. By placing normal priors on the coefficients, the maximum likelihood solution of two duplicated predictors (trivially collinear) will be half the value than would be obtained by only including one.
Separability
In a logistic regression, if a predictor is positive in cases of 1 outcomes and negative in cases of 0 outcomes, then the maximum likelihood estimate for the coefficient for that predictor diverges to infinity. This divergence can be controlled by providing a prior for the coefficient, which will “shrink” the estimate back toward zero and thus identify the model in the posterior.
Similar problems arise for sampling with improper flat priors. The sampler will try to draw large values. By providing a prior, the posterior will be concentrated around finite values, leading to well-behaved sampling.
Multivariate priors for hierarchical models
In hierarchical regression models (and other situations), several individual-level variables may be assigned hierarchical priors. For example, a model with multiple varying intercepts and slopes within might assign them a multivariate prior.
As an example, the individuals might be people and the outcome income, with predictors such as education level and age, and the groups might be states or other geographic divisions. The effect of education level and age as well as an intercept might be allowed to vary by state. Furthermore, there might be state-level predictors, such as average state income and unemployment level.
Multivariate regression example
Gelman and Hill (2007, chap. 13, Chapter 17) provide a discussion of a hierarchical model with
Data model
The model is a linear regression with slope and intercept coefficients varying by group, so that
Coefficient prior
Gelman and Hill model the coefficient vectors
Below, we discuss the full model of Gelman and Hill, which uses group-level predictors to model
Hyperpriors
For hierarchical modeling, the group-level mean vector
For the prior on the covariance matrix, Gelman and Hill suggest using a scaled inverse Wishart. That choice was motivated primarily by convenience as it is conjugate to the multivariate likelihood function and thus simplifies Gibbs sampling
In Stan, there is no restriction to conjugacy for multivariate priors, and we in fact recommend a slightly different approach. Like Gelman and Hill, we decompose our prior into a scale and a matrix, but are able to do so in a more natural way based on the actual variable scales and a correlation matrix. Specifically, we define
The components of the scale vector
Our final recommendation is to give the correlation matrix
The LKJ correlation distribution is defined by
The basic behavior of the LKJ correlation distribution is similar to that of a beta distribution. For
For
The LKJ prior may thus be used to control the expected amount of correlation among the parameters
Group-level predictors for prior mean
To complete Gelman and Hill’s model, suppose each group
The group-level coefficients
Coding the model in Stan
The Stan code for the full hierarchical model with multivariate priors on the group-level coefficients and group-level prior means follows its definition.
data {
int<lower=0> N; // num individuals
int<lower=1> K; // num ind predictors
int<lower=1> J; // num groups
int<lower=1> L; // num group predictors
array[N] int<lower=1, upper=J> jj; // group for individual
matrix[N, K] x; // individual predictors
array[J] row_vector[L] u; // group predictors
vector[N] y; // outcomes
}parameters {
corr_matrix[K] Omega; // prior correlation
vector<lower=0>[K] tau; // prior scale
matrix[L, K] gamma; // group coeffs
array[J] vector[K] beta; // indiv coeffs by group
real<lower=0> sigma; // prediction error scale
}model {
0, 2.5);
tau ~ cauchy(2);
Omega ~ lkj_corr(0, 5);
to_vector(gamma) ~ normal(
{array[J] row_vector[K] u_gamma;
for (j in 1:J) {
u_gamma[j] = u[j] * gamma;
}
beta ~ multi_normal(u_gamma, quad_form_diag(Omega, tau));
}for (n in 1:N) {
y[n] ~ normal(x[n] * beta[jj[n]], sigma);
} }
The hyperprior covariance matrix is defined implicitly through the quadratic form in the code because the correlation matrix Omega
and scale vector tau
are more natural to inspect in the output; to output Sigma
, define it as a transformed parameter. The function quad_form_diag
is defined so that quad_form_diag(Sigma, tau)
is equivalent to diag_matrix(tau) * Sigma * diag_matrix(tau)
, where diag_matrix(tau)
returns the matrix with tau
on the diagonal and zeroes off diagonal; the version using quad_form_diag
should be faster. For details on these and other matrix arithmetic operators and functions, see the function reference manual.
Optimization through vectorization
The code in the Stan program above can be sped up dramatically by replacing the the distribution statement inside the for loop:
for (n in 1:N) {
y[n] ~ normal(x[n] * beta[jj[n]], sigma); }
with the vectorized distribution statement:
{vector[N] x_beta_jj;
for (n in 1:N) {
x_beta_jj[n] = x[n] * beta[jj[n]];
}
y ~ normal(x_beta_jj, sigma); }
The outer brackets create a local scope in which to define the variable x_beta_jj
, which is then filled in a loop and used to define a vectorized distribution statement. The reason this is such a big win is that it allows us to take the log of sigma only once and it greatly reduces the size of the resulting expression graph by packing all of the work into a single distribution function.
Although it is tempting to redeclare beta
and include a revised model block distribution statement,
parameters {
matrix[J, K] beta;
// ...
}model {
y ~ normal(rows_dot_product(x, beta[jj]), sigma);// ...
}
this fails because it breaks the vectorization for beta
,6
beta ~ multi_normal(...);
which requires beta
to be an array of vectors. Both vectorizations are important, so the best solution is to just use the loop above, because rows_dot_product
cannot do much optimization in and of itself because there are no shared computations.
The code in the Stan program above also builds up an array of vectors for the outcomes and for the multivariate normal, which provides a major speedup by reducing the number of linear systems that need to be solved and differentiated.
{matrix[K, K] Sigma_beta;
Sigma_beta = quad_form_diag(Omega, tau);for (j in 1:J) {
beta[j] ~ multi_normal((u[j] * gamma)', Sigma_beta);
} }
In this example, the covariance matrix Sigma_beta
is defined as a local variable so as not to have to repeat the quadratic form computation
Optimization through Cholesky factorization
The multivariate normal density and LKJ prior on correlation matrices both require their matrix parameters to be factored. Vectorizing, as in the previous section, ensures this is only done once for each density. An even better solution, both in terms of efficiency and numerical stability, is to parameterize the model directly in terms of Cholesky factors of correlation matrices using the multivariate version of the non-centered parameterization. For the model in the previous section, the program fragment to replace the full matrix prior with an equivalent Cholesky factorized prior is as follows.
data {
matrix[L, J] u; // group predictors transposed
// ...
}parameters {
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
matrix[K, L] gamma;
// ...
}transformed parameters {
matrix[K, J] beta;
beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}model {
to_vector(z) ~ std_normal();2);
L_Omega ~ lkj_corr_cholesky(// ...
}
The data variable u
was originally an array of vectors, which is efficient for access; here it is redeclared as a matrix in order to use it in matrix arithmetic. Moreover, it is transposed, along with gamma
and beta
, to minimize the number of transposition operations. The new parameter L_Omega
is the Cholesky factor of the original correlation matrix Omega
, so that
Omega = L_Omega * L_Omega'
The prior scale vector tau
is unchanged, and furthermore, pre-multiplying the Cholesky factor by the scale produces the Cholesky factor of the final covariance matrix,
Sigma_beta
= quad_form_diag(Omega, tau)
= diag_pre_multiply(tau, L_Omega) * diag_pre_multiply(tau, L_Omega)'
where the diagonal pre-multiply compound operation is defined by
diag_pre_multiply(a, b) = diag_matrix(a) * b
The new variable z
is declared as a matrix, the entries of which are given independent standard normal priors; the to_vector
operation turns the matrix into a vector so that it can be used as a vectorized argument to the univariate normal density. This results in every column of z
being a z
by the Cholesky factor of the covariance matrix and adding the mean (u * gamma)'
produces a beta
distributed as in the original model, where the variance is, letting
Omitting the remaining data declarations, which are the same as before with the exception of u
, the optimized model is as follows.
parameters {
matrix[K, J] z;
cholesky_factor_corr[K] L_Omega;
vector<lower=0, upper=pi() / 2>[K] tau_unif; // prior scale
matrix[K, L] gamma; // group coeffs
real<lower=0> sigma; // prediction error scale
}transformed parameters {
vector<lower=0>[K] tau = 2.5 * tan(tau_unif);
matrix[K, J] beta = gamma * u + diag_pre_multiply(tau, L_Omega) * z;
}model {
vector[N] mu;
for(n in 1:N) {
mu[n] = x[n, ] * beta[, jj[n]];
}
to_vector(z) ~ std_normal();2);
L_Omega ~ lkj_corr_cholesky(0, 5);
to_vector(gamma) ~ normal(
y ~ normal(mu, sigma); }
This model also reparameterizes the prior scale tau
to avoid potential problems with the heavy tails of the Cauchy distribution. The statement tau_unif ~ uniform(0, pi() / 2)
can be omitted from the model block because Stan increments the log posterior for parameters with uniform priors without it.
Prediction, forecasting, and backcasting
Stan models can be used for “predicting” the values of arbitrary model unknowns. When predictions are about the future, they’re called “forecasts;” when they are predictions about the past, as in climate reconstruction or cosmology, they are sometimes called “backcasts” (or “aftcasts” or “hindcasts” or “antecasts,” depending on the author’s feelings about the opposite of “fore”).
Programming predictions
As a simple example, the following linear regression provides the same setup for estimating the coefficients beta
as in our very first example, using y
for the N
observations and x
for the N
predictor vectors. The model parameters and model for observations are exactly the same as before.
To make predictions, we need to be given the number of predictions, N_new
, and their predictor matrix, x_new
. The predictions themselves are modeled as a parameter y_new
. The model statement for the predictions is exactly the same as for the observations, with the new outcome vector y_new
and prediction matrix x_new
.
data {
int<lower=1> K;
int<lower=0> N;
matrix[N, K] x;
vector[N] y;
int<lower=0> N_new;
matrix[N_new, K] x_new;
}parameters {
vector[K] beta;
real<lower=0> sigma;
vector[N_new] y_new; // predictions
}model {
// observed model
y ~ normal(x * beta, sigma);
// prediction model
y_new ~ normal(x_new * beta, sigma); }
Predictions as generated quantities
Where possible, the most efficient way to generate predictions is to use the generated quantities block. This provides proper Monte Carlo (not Markov chain Monte Carlo) inference, which can have a much higher effective sample size per iteration.
// ...data as above...
parameters {
vector[K] beta;
real<lower=0> sigma;
}model {
y ~ normal(x * beta, sigma);
}generated quantities {
vector[N_new] y_new;
for (n in 1:N_new) {
y_new[n] = normal_rng(x_new[n] * beta, sigma);
} }
Now the data are just as before, but the parameter y_new
is now declared as a generated quantity, and the prediction model is removed from the model and replaced by a pseudo-random draw from a normal distribution.
Overflow in generated quantities
It is possible for values to overflow or underflow in generated quantities. The problem is that if the result is NaN, then any constraints placed on the variables will be violated. It is possible to check a value assigned by an RNG and reject it if it overflows, but this is both inefficient and leads to biased posterior estimates. Instead, the conditions causing overflow, such as trying to generate a negative binomial random variate with a mean of
Multivariate outcomes
Most regressions are set up to model univariate observations (be they scalar, boolean, categorical, ordinal, or count). Even multinomial regressions are just repeated categorical regressions. In contrast, this section discusses regression when each observed value is multivariate. To relate multiple outcomes in a regression setting, their error terms are provided with covariance structure.
This section considers two cases, seemingly unrelated regressions for continuous multivariate quantities and multivariate probit regression for boolean multivariate quantities.
Multivariate probit regression
The multivariate probit model generates sequences of boolean variables by applying a step function to the output of a seemingly unrelated regression.
The observations
These are then put through the step function to produce a
Unlike in the seemingly unrelated regressions case, here the covariance matrix
Multivariate probit regression can be coded in Stan using the trick introduced by Albert and Chib (1993), where the underlying continuous value vectors
First, we introduce a sum function for two-dimensional arrays of integers; this is going to help us calculate how many total 1 values there are in
functions {
int sum2d(array[,] int a) {
int s = 0;
for (i in 1:size(a)) {
s += sum(a[i]);
}return s;
} }
The function is trivial, but it’s not a built-in for Stan and it’s easier to understand the rest of the model if it’s pulled into its own function so as not to create a distraction.
The data declaration block is much like for the seemingly unrelated regressions, but the observations y
are now integers constrained to be 0 or 1.
data {
int<lower=1> K;
int<lower=1> D;
int<lower=0> N;
array[N, D] int<lower=0, upper=1> y;
array[N] vector[K] x;
}
After declaring the data, there is a rather involved transformed data block whose sole purpose is to sort the data array y
into positive and negative components, keeping track of indexes so that z
can be easily reassembled in the transformed parameters block.
transformed data {
int<lower=0> N_pos;
array[sum2d(y)] int<lower=1, upper=N> n_pos;
array[size(n_pos)] int<lower=1, upper=D> d_pos;
int<lower=0> N_neg;
array[(N * D) - size(n_pos)] int<lower=1, upper=N> n_neg;
array[size(n_neg)] int<lower=1, upper=D> d_neg;
N_pos = size(n_pos);
N_neg = size(n_neg);
{int i;
int j;
1;
i = 1;
j = for (n in 1:N) {
for (d in 1:D) {
if (y[n, d] == 1) {
n_pos[i] = n;
d_pos[i] = d;1;
i += else {
}
n_neg[j] = n;
d_neg[j] = d;1;
j +=
}
}
}
} }
The variables N_pos
and N_neg
are set to the number of true (1) and number of false (0) observations in y
. The loop then fills in the sequence of indexes for the positive and negative values in four arrays.
The parameters are declared as follows.
parameters {
matrix[D, K] beta;
cholesky_factor_corr[D] L_Omega;
vector<lower=0>[N_pos] z_pos;
vector<upper=0>[N_neg] z_neg;
}
These include the regression coefficients beta
and the Cholesky factor of the correlation matrix, L_Omega
. This time there is no scaling because the covariance matrix has unit scale (i.e., it is a correlation matrix; see above).
The critical part of the parameter declaration is that the latent real value
transformed parameters {
array[N] vector[D] z;
for (n in 1:N_pos) {
z[n_pos[n], d_pos[n]] = z_pos[n];
}for (n in 1:N_neg) {
z[n_neg[n], d_neg[n]] = z_neg[n];
} }
At this point, the model is simple, pretty much recreating the seemingly unrelated regression.
model {
4);
L_Omega ~ lkj_corr_cholesky(0, 5);
to_vector(beta) ~ normal(
{array[N] vector[D] beta_x;
for (n in 1:N) {
beta_x[n] = beta * x[n];
}
z ~ multi_normal_cholesky(beta_x, L_Omega);
} }
This simple form of model is made possible by the Albert and Chib-style constraints on z
.
Finally, the correlation matrix itself can be put back together in the generated quantities block if desired.
generated quantities {
corr_matrix[D] Omega;
Omega = multiply_lower_tri_self_transpose(L_Omega); }
The same could be done for the seemingly unrelated regressions in the previous section.
Applications of pseudorandom number generation
The main application of pseudorandom number generator (PRNGs) is for posterior inference, including prediction and posterior predictive checks. They can also be used for pure data simulation, which is like a posterior predictive check with no conditioning. See the function reference manual for a complete description of the syntax and usage of pseudorandom number generators.
Prediction
Consider predicting unobserved outcomes using linear regression. Given predictors
For this model, the posterior predictive inference for a new outcome
To code the posterior predictive inference in Stan, a standard linear regression is combined with a random number in the generated quantities block.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
int<lower=0> N_tilde;
vector[N_tilde] x_tilde;
}parameters {
real alpha;
real beta;
real<lower=0> sigma;
}model {
y ~ normal(alpha + beta * x, sigma);
}generated quantities {
vector[N_tilde] y_tilde;
for (n in 1:N_tilde) {
y_tilde[n] = normal_rng(alpha + beta * x_tilde[n], sigma);
} }
Given observed predictors y_tilde
will be drawn according to y_tilde
is the estimate of the outcome that minimizes expected square error (conditioned on the data and model).
Posterior predictive checks
A good way to investigate the fit of a model to the data, a critical step in Bayesian data analysis, is to generate simulated data according to the parameters of the model. This is carried out with exactly the same procedure as before, only the observed data predictors
To code posterior predictive checks in Stan requires only a slight modification of the prediction code to use
generated quantities {
vector[N] y_tilde;
for (n in 1:N) {
y_tilde[n] = normal_rng(alpha + beta * x[n], sigma);
} }
Gelman et al. (2013) recommend choosing several posterior draws
References
Footnotes
Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as
log(sigma)
in the above model, will be computed once and reused.↩︎The
Phi_approx
function is a rescaled version of the inverse logit function, so while the scale is roughly the same , the tails do not match.↩︎The Bernoulli-logit distribution builds in the log link function, taking
↩︎Gelman and Hill (2007) treat the
term equivalently as the location parameter in the distribution of student abilities.↩︎The prior is named for Lewandowski, Kurowicka, and Joe, as it was derived by inverting the random correlation matrix generation strategy of Lewandowski, Kurowicka, and Joe (2009).↩︎
Thanks to Mike Lawrence for pointing this out in the GitHub issue for the manual.↩︎