1.6 Multi-Logit Regression
Multiple outcome forms of logistic regression can be coded directly in Stan. For instance, suppose there are \(K\) possible outcomes for each output variable \(y_n\). Also suppose that there is a \(D\)-dimensional vector \(x_n\) of predictors for \(y_n\). The multi-logit model with \(\textsf{normal}(0,5)\) priors on the coefficients is coded as follows.
data {
int K;
int N;
int D;
int y[N];
matrix[N, D] x;
}
parameters {
matrix[D, K] beta;
}
model {
matrix[N, K] x_beta = x * beta;
to_vector(beta) ~ normal(0, 5);
for (n in 1:N)
y[n] ~ categorical_logit(x_beta[n]');
}
The prior on beta
is coded in vectorized form. As of Stan 2.18, the
categorical-logit distribution is not vectorized for parameter
arguments, so the loop is required. The matrix multiplication is
pulled out to define a local variable for all of the predictors for
efficiency. Like the Bernoulli-logit, the categorical-logit
distribution applies softmax internally to convert an arbitrary vector
to a simplex,
\[
\texttt{categorical}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right)
=
\texttt{categorical}\left(y \mid \texttt{softmax}(\alpha)\right),
\]
where
\[
\texttt{softmax}(u) = \exp(u) / \operatorname{sum}\left(\exp(u)\right).
\]
The categorical distribution with log-odds (logit) scaled parameters used above is equivalent to writing
y[n] ~ categorical(softmax(x[n] * beta));
Constraints on Data Declarations
The data block in the above model is defined without constraints on
sizes K
, N
, and D
or on the outcome array
y
. Constraints on data declarations provide error checking at
the point data are read (or transformed data are defined), which is
before sampling begins. Constraints on data declarations also make
the model author’s intentions more explicit, which can help with
readability. The above model’s declarations could be tightened to
int<lower = 2> K;
int<lower = 0> N;
int<lower = 1> D;
int<lower = 1, upper = K> y[N];
These constraints arise because the number of categories, K
,
must be at least two in order for a categorical model to be useful.
The number of data items, N
, can be zero, but not negative;
unlike R, Stan’s for-loops always move forward, so that a loop extent
of 1:N
when N
is equal to zero ensures the loop’s body
will not be executed. The number of predictors, D
, must be at
least one in order for beta * x[n]
to produce an
appropriate argument for softmax()
. The categorical outcomes
y[n]
must be between 1
and K
in order for the
discrete sampling to be well defined.
Constraints on data declarations are optional. Constraints on
parameters declared in the parameters
block, on the other hand,
are not optional—they are required to ensure support for all
parameter values satisfying their constraints. Constraints on
transformed data, transformed parameters, and generated quantities are
also optional.
Identifiability
Because softmax is invariant under adding a constant to each component of its input, the model is typically only identified if there is a suitable prior on the coefficients.
An alternative is to use \((K-1)\)-vectors by fixing one of them to be zero. The partially known parameters section discusses how to mix constants and parameters in a vector. In the multi-logit case, the parameter block would be redefined to use \((K - 1)\)-vectors
parameters {
matrix[K - 1, D] beta_raw;
}
and then these are transformed to parameters to use in the model. First, a transformed data block is added before the parameters block to define a row vector of zero values,
transformed data {
row_vector[D] zeros = rep_row_vector(0, D);
}
which can then be appended to beta_row
to produce the
coefficient matrix beta
,
transformed parameters {
matrix[K, D] beta;
beta = append_row(beta_raw, zeros);
}
The rep_row_vector(0, D)
call creates a row vector of size D
with
all entries set to zero. The derived matrix beta
is then defined to
be the result of appending the row-vector zeros
as a new row at the
end of beta_raw
; the row vector zeros
is defined as transformed
data so that it doesn’t need to be constructed from scratch each time
it is used.
This is not the same model as using \(K\)-vectors as parameters, because now the prior only applies to \((K-1)\)-vectors. In practice, this will cause the maximum likelihood solutions to be different and also the posteriors to be slightly different when taking priors centered around zero, as is typical for regression coefficients.