1.6 Multi-Logit Regression

Multiple outcome forms of logistic regression can be coded directly in Stan. For instance, suppose there are \(K\) possible outcomes for each output variable \(y_n\). Also suppose that there is a \(D\)-dimensional vector \(x_n\) of predictors for \(y_n\). The multi-logit model with \(\mathsf{normal}(0,5)\) priors on the coefficients is coded as follows.

data {
  int K;
  int N;
  int D;
  int y[N];
  matrix[N, D] x;
}
parameters {
  matrix[D, K] beta;
}
model {
  matrix[N, K] x_beta = x * beta;

  to_vector(beta) ~ normal(0, 2);

  for (n in 1:N)
    y[n] ~ categorical_logit(x_beta[n]);
}

The prior on beta is coded in vectorized form. As of Stan 2.18, the categorical-logit distribution is not vectorized for parameter arguments, so the loop is required. The matrix multiplication is pulled out to define a local variable for all of the predictors for efficiency. Like the Bernoulli-logit, the categorical-logit distribution applies softmax internally to convert an arbitrary vector to a simplex,

\[ \mathsf{CategoricalLogit}(y \mid \alpha) \ = \ \mathsf{Categorical}(y \mid \mathsf{softmax}(\alpha)), \]

where

\[ \mathrm{softmax}(u) = \mathrm{exp}(u) / \mathrm{sum}(\mathrm{exp}(u)). \]

The categorical distribution with log-odds (logit) scaled parameters used above is equivalent to writing

    y[n] ~ categorical(softmax(x[n] * beta));

Constraints on Data Declarations

The data block in the above model is defined without constraints on sizes K, N, and D or on the outcome array y. Constraints on data declarations provide error checking at the point data are read (or transformed data are defined), which is before sampling begins. Constraints on data declarations also make the model author’s intentions more explicit, which can help with readability. The above model’s declarations could be tightened to

  int<lower = 2> K;
  int<lower = 0> N;
  int<lower = 1> D;
  int<lower = 1, upper = K> y[N];

These constraints arise because the number of categories, K, must be at least two in order for a categorical model to be useful. The number of data items, N, can be zero, but not negative; unlike R, Stan’s for-loops always move forward, so that a loop extent of 1:N when N is equal to zero ensures the loop’s body will not be executed. The number of predictors, D, must be at least one in order for beta * x[n] to produce an appropriate argument for softmax(). The categorical outcomes y[n] must be between 1 and K in order for the discrete sampling to be well defined.

Constraints on data declarations are optional. Constraints on parameters declared in the parameters block, on the other hand, are not optional—they are required to ensure support for all parameter values satisfying their constraints. Constraints on transformed data, transformed parameters, and generated quantities are also optional.

Identifiability

Because softmax is invariant under adding a constant to each component of its input, the model is typically only identified if there is a suitable prior on the coefficients.

An alternative is to use \((K-1)\)-vectors by fixing one of them to be zero. The partially known parameters section discusses how to mix constants and parameters in a vector. In the multi-logit case, the parameter block would be redefined to use \((K - 1)\)-vectors

parameters {
  matrix[K - 1, D] beta_raw;
}

and then these are transformed to parameters to use in the model. First, a transformed data block is added before the parameters block to define a row vector of zero values,

transformed data {
  row_vector[D] zeros = rep_row_vector(0, D);
}

which can then be appended to beta_row to produce the coefficient matrix beta,

transformed parameters {
  matrix[K, D] beta;
  beta = append_row(beta_raw, zeros);
}

The rep_row_vector(0, D) call creates a row vector of size D with all entries set to zero. The derived matrix beta is then defined to be the result of appending the row-vector zeros as a new row at the end of beta_raw; the row vector zeros is defined as transformed data so that it doesn’t need to be constructed from scratch each time it is used.

This is not the same model as using \(K\)-vectors as parameters, because now the prior only applies to \((K-1)\)-vectors. In practice, this will cause the maximum likelihood solutions to be different and also the posteriors to be slightly different when taking priors centered around zero, as is typical for regression coefficients.