Stan User’s Guide

This is an old version, view current version.

28.5 Multilevel regression and poststratification

With large numbers of demographic features, each cell may have very few items in it with which to estimate regression coefficients. For example, even in a national-level poll of 10,000 respondents, if they are divided by the 50 states, that’s only 200 respondents per state on average. When data sizes are small, parameter estimation can be stabilized and sharpened by providing hierarchical priors. With hierarchical priors, the data determines the amount of partial pooling among the groups. The only drawback is that if the number of groups is small, it can be hard to fit these models without strong hyperpriors.

The model introduced in the previous section had likelihood \[ y_n \sim \textrm{bernoulli}(\textrm{logit}^{-1}( \alpha + \beta_{\textrm{age}[n]} + \gamma_{\textrm{sex}[n]} + \delta_{\textrm{income}[n]} + \epsilon_{\textrm{state}[n]} )). \] The overall intercept can be given a broad fixed prior, \[ \alpha \sim \textrm{normal}(0, 5). \] The other regression parameters can be given hierarchical priors, \[\begin{eqnarray*} \beta_{1:4} & \sim & \textrm{normal}(0, \sigma^{\beta}) \\[2pt] \gamma_{1:2} & \sim & \textrm{normal}(0, \sigma^{\gamma}) \\[2pt] \delta_{1:5} & \sim & \textrm{normal}(0, \sigma^{\delta}) \\[2pt] \epsilon_{1:50} & \sim & \textrm{normal}(0, \sigma^{\epsilon}) \end{eqnarray*}\]

The hyperparameters for scale of variation within a group can be given simple standard hyperpriors, \[ \sigma^{\beta}, \sigma^{\gamma}, \sigma^{\delta}, \sigma^{\epsilon} \sim \textrm{normal}(0, 1). \] The scales of these fixed hyperpriors need to be determined on a problem-by-problem basis, though ideally they will be close to standard (mean zero, unit variance).

28.5.1 Dealing with small partitions and non-identifiability

The multilevel structure of the models used for multilevel regression and poststratification consist of a sum of intercepts that vary by demographic feature. This immediately introduces non-identifiability. A constant added to each state coefficient and subtracted from each age coefficient leads to exactly the same likelihood.

This is non-identifiability that is only mitigated by the (hierarchical) priors. When demographic partitions are small, as they are with several categories in the example, it can be more computationally tractable to enforce a sum-to-zero constraint on the coefficients. Other values than zero will by necessity be absorbed into the intercept, which is why it typically gets a broader prior even with standardized data. With a sum to zero constraint, coefficients for binary features will be negations of each other. For example, because there are only two sex categories, \(\gamma_2 = -\gamma_1.\)

To implement sum-to-zero constraints,

parameters {
  vector[K - 1] alpha_raw;
...
transformed parameters {
  vector<multiplier = sigma_alpha>[K] alpha
    = append_row(alpha_raw, -sum(alpha_raw));
...    
model {
  alpha ~ normal(0, sigma_alpha);

This prior is hard to interpret in that there are K normal distributions, but only K - 1 free parameters. An alternative is to put the prior only on alpha_raw, but that is also difficult to interpret.

Soft constraints can be more computationally tractable. They are also simpler to implement.

parameters {
  vector<multiplier = sigma_alpha>[K] alpha;
...
model {
  alpha ~ normal(0, sigma_alpha);
  sum(alpha) ~ normal(0, 0.001);

This leaves the regular prior, but adds a second prior that concentrates the sum near zero. The scale of the second prior will need to be established on a problem and data-set specific basis so that it doesn’t shrink the estimates beyond the shrinkage of the hierarchical scale parameters.

Note that in the hierarchical model, the values of the coefficients when there are only two coefficients should be the same absolute value but opposite signs. Any other difference could be combined into the overall intercept \(\alpha.\) Even with a wide prior on the intercept, the hyperprior on \(\sigma^{\gamma}\) may not be strong enough to enforce that, leading to a weak form non-identifiability in the posterior. Enforcing a (hard or soft) sum-to-zero constraint can help mitigate non-identifiability. Whatever prior is chosen, prior predictive checks can help diagnose problems with it.

None of this work to manage identifiability in multilevel regressions has anything to do with the poststratification; it’s just required to fit a large multilevel regression with multiple discrete categories. Having multiple intercepts always leads to weak non-identifiability, even with the priors on the intercepts all centered at zero.