Problematic Posteriors
Mathematically speaking, with a proper posterior, one can do Bayesian inference and that’s that. There is not even a need to require a finite variance or even a finite mean—all that’s needed is a finite integral. Nevertheless, modeling is a tricky business and even experienced modelers sometimes code models that lead to improper priors. Furthermore, some posteriors are mathematically sound, but ill-behaved in practice. This chapter discusses issues in models that create problematic posterior inferences, either in general for Bayesian inference or in practice for Stan.
Collinearity of predictors in regressions
This section discusses problems related to the classical notion of identifiability, which lead to ridges in the posterior density and wreak havoc with both sampling and inference.
Examples of collinearity
Redundant intercepts
The first example of collinearity is an artificial example involving redundant intercept parameters.1
Suppose there are observations
For any constant
The consequence is that an improper uniform prior
The marginal posterior
The impropriety shows up visually as a ridge in the posterior density, as illustrated in the left-hand plot. The ridge for this model is along the line where
Contrast this model with a simple regression with a single intercept parameter
Ability and difficulty in IRT models
Consider an item-response theory model for students
For any constant
This leads to a multivariate version of the ridge displayed by the regression with two intercepts discussed above.
General collinear regression predictors
The general form of the collinearity problem arises when predictors for a regression are collinear. For example, consider a linear regression data model
Now suppose that column
Even if columns of the predictor matrix are not exactly collinear as discussed above, they cause similar problems for inference if they are nearly collinear.
Multiplicative issues with discrimination in IRT
Consider adding a discrimination parameter
Softmax with vs. parameters
In order to parameterize a
The softmax
function maps a
The softmax
function is many-to-one, which leads to a lack of identifiability of the unconstrained parameters
Mitigating the invariances
All of the examples discussed in the previous section allow translation or scaling of parameters while leaving the data probability density invariant. These problems can be mitigated in several ways.
Removing redundant parameters or predictors
In the case of the multiple intercepts,
Pinning parameters
The IRT model without a discrimination parameter can be fixed by pinning one of its parameters to a fixed value, typically 0. For example, the first student ability
This solution is not sufficient to deal with the multiplicative invariance introduced by the question discrimination parameters
The many-to-one nature of
vector softmax_id(vector alpha) {
vector[num_elements(alpha) + 1] alphac1;
for (k in 1:num_elements(alpha)) {
alphac1[k] = alpha[k];
}
alphac1[num_elements(alphac1)] = 0;
return softmax(alphac1);
}
Adding priors
So far, the models have been discussed as if the priors on the parameters were improper uniform priors.
A more general Bayesian solution to these invariance problems is to impose proper priors on the parameters. This approach can be used to solve problems arising from either additive or multiplicative invariance.
For example, normal priors on the multiple intercepts,
The following plots show the posteriors for two intercept parameterization without prior, two intercept parameterization with standard normal prior, and one intercept reparameterization without prior. For all three cases, the posterior is plotted for 100 data points drawn from a standard normal.
The two intercept parameterization leads to an improper prior with a ridge extending infinitely to the northwest and southeast.
Adding a standard normal prior for the intercepts results in a proper posterior.
The single intercept parameterization with no prior also has a proper posterior.
The addition of a prior to the two intercepts model is shown in the second plot; the final plot shows the result of reparameterizing to a single intercept.
An alternative strategy for identifying a
Vague, strongly informative, and weakly informative priors
Care must be used when adding a prior to resolve invariances. If the prior is taken to be too broad (i.e., too vague), the resolution is in theory only, and samplers will still struggle.
Ideally, a realistic prior will be formulated based on substantive knowledge of the problem being modeled. Such a prior can be chosen to have the appropriate strength based on prior knowledge. A strongly informative prior makes sense if there is strong prior information.
When there is not strong prior information, a weakly informative prior strikes the proper balance between controlling computational inference without dominating the data in the posterior. In most problems, the modeler will have at least some notion of the expected scale of the estimates and be able to choose a prior for identification purposes that does not dominate the data, but provides sufficient computational control on the posterior.
Priors can also be used in the same way to control the additive invariance of the IRT model. A typical approach is to place a strong prior on student ability parameters
Label switching in mixture models
Where collinearity in regression models can lead to infinitely many posterior maxima, swapping components in a mixture model leads to finitely many posterior maxima.
Mixture models
Consider a normal mixture model with two location parameters
Convergence monitoring and effective sample size
The analysis of posterior convergence and effective sample size is also difficult for mixture models. For example, the
Some inferences are invariant
In some sense, the index (or label) of a mixture component is irrelevant. Posterior predictive inferences can still be carried out without identifying mixture components. For example, the log probability of a new observation does not depend on the identities of the mixture components. The only sound Bayesian inferences in such models are those that are invariant to label switching. Posterior means for the parameters are meaningless because they are not invariant to label switching; for example, the posterior mean for
Highly multimodal posteriors
Theoretically, this should not present a problem for inference because all of the integrals involved in posterior predictive inference will be well behaved. The problem in practice is computation.
Being able to carry out such invariant inferences in practice is an altogether different matter. It is almost always intractable to find even a single posterior mode, much less balance the exploration of the neighborhoods of multiple local maxima according to the probability masses. In Gibbs sampling, it is unlikely for
Even with a proper posterior, all known sampling and inference techniques are notoriously ineffective when the number of modes grows super-exponentially as it does for mixture models with increasing numbers of components.
Hacks as fixes
Several hacks (i.e., “tricks”) have been suggested and employed to deal with the problems posed by label switching in practice.
Parameter ordering constraints
One common strategy is to impose a constraint on the parameters that identifies the components. For instance, we might consider constraining
Initialization around a single mode
Another common approach is to run a single chain or to initialize the parameters near realistic values.4
This can work better than the hard constraint approach if reasonable initial values can be found and the labels do not switch within a Markov chain. The result is that all chains are glued to a neighborhood of a particular mode in the posterior.
Component collapsing in mixture models
It is possible for two mixture components in a mixture model to collapse to the same values during sampling or optimization. For example, a mixture of
This will typically happen early in sampling due to initialization in MCMC or optimization or arise from random movement during MCMC. Once the parameters match for a given draw
It may help to use a smaller step size during warmup, a stronger prior on each mixture component’s membership responsibility. A more extreme measure is to include additional mixture components to deal with the possibility that some of them may collapse.
In general, it is difficult to recover exactly the right
Posteriors with unbounded densities
In some cases, the posterior density grows without bounds as parameters approach certain poles or boundaries. In such, there are no posterior modes and numerical stability issues can arise as sampled parameters approach constraint boundaries.
Mixture models with varying scales
One such example is a binary mixture model with scales varying by component,
Beta-binomial models with skewed data and weak priors
Another example of unbounded densities arises with a posterior such as
If
Constrained vs. unconstrained scales
Stan does not sample directly on the constrained
There are two problems that still arise, though. The first is that if the posterior mass for
Posteriors with unbounded parameters
In some cases, the posterior density will not grow without bound, but parameters will grow without bound with gradually increasing density values. Like the models discussed in the previous section that have densities that grow without bound, such models also have no posterior modes.
Separability in logistic regression
Consider a logistic regression model with
With separability, there is no maximum to the likelihood and hence no maximum likelihood estimate. From the Bayesian perspective, the posterior is improper and therefore the marginal posterior mean for
Uniform posteriors
Suppose your model includes a parameter
Although there is no maximum likelihood estimate for
Sampling difficulties with problematic priors
With an improper posterior, it is theoretically impossible to properly explore the posterior. However, Gibbs sampling as performed by BUGS and JAGS, although still unable to properly sample from such an improper posterior, behaves differently in practice than the Hamiltonian Monte Carlo sampling performed by Stan when faced with an example such as the two intercept model discussed in the collinearity section and illustrated in the non-identifiable density plot.
Gibbs sampling
Gibbs sampling, as performed by BUGS and JAGS, may appear to be efficient and well behaved for this unidentified model, but as discussed in the previous subsection, will not actually explore the posterior properly.
Consider what happens with initial values
Now consider the draw for
Hamiltonian Monte Carlo sampling
Hamiltonian Monte Carlo (HMC), as performed by Stan, is much more efficient at exploring posteriors in models where parameters are correlated in the posterior. In this particular example, the Hamiltonian dynamics (i.e., the motion of a fictitious particle given random momentum in the field defined by the negative log posterior) is going to run up and down along the valley defined by the potential energy (ridges in log posteriors correspond to valleys in potential energy). In practice, even with a random momentum for
No-U-turn sampling
Stan’s default no-U-turn sampler (NUTS), is even more efficient at exploring the posterior (see Hoffman and Gelman 2014). NUTS simulates the motion of the fictitious particle representing the parameter values until it makes a U-turn, it will be defeated in most cases, as it will just move down the potential energy valley indefinitely without making a U-turn. What happens in practice is that the maximum number of leapfrog steps in the simulation will be hit in many of the iterations, causing a large number of log probability and gradient evaluations (1000 if the max tree depth is set to 10, as in the default). Thus sampling will appear to be slow. This is indicative of an improper posterior, not a bug in the NUTS algorithm or its implementation. It is simply not possible to sample from an improper posterior! Thus the behavior of HMC in general and NUTS in particular should be reassuring in that it will clearly fail in cases of improper posteriors, resulting in a clean diagnostic of sweeping out large paths in the posterior.
Here are results of Stan runs with default parameters fit to
Two Scale Parameters, Improper Prior
Inference for Stan model: improper_stan
Warmup took (2.7, 2.6, 2.9, 2.9) seconds, 11 seconds total
Sampling took (3.4, 3.7, 3.6, 3.4) seconds, 14 seconds total
Mean MCSE StdDev 5% 95% N_Eff N_Eff/s R_hat
lp__ -5.3e+01 7.0e-02 8.5e-01 -5.5e+01 -5.3e+01 150 11 1.0
n_leapfrog__ 1.4e+03 1.7e+01 9.2e+02 3.0e+00 2.0e+03 2987 212 1.0
lambda1 1.3e+03 1.9e+03 2.7e+03 -2.3e+03 6.0e+03 2.1 0.15 5.2
lambda2 -1.3e+03 1.9e+03 2.7e+03 -6.0e+03 2.3e+03 2.1 0.15 5.2
sigma 1.0e+00 8.5e-03 6.2e-02 9.5e-01 1.2e+00 54 3.9 1.1
mu 1.6e-01 1.9e-03 1.0e-01 -8.3e-03 3.3e-01 2966 211 1.0
Two Scale Parameters, Weak Prior
Warmup took (0.40, 0.44, 0.40, 0.36) seconds, 1.6 seconds total
Sampling took (0.47, 0.40, 0.47, 0.39) seconds, 1.7 seconds total
Mean MCSE StdDev 5% 95% N_Eff N_Eff/s R_hat
lp__ -54 4.9e-02 1.3e+00 -5.7e+01 -53 728 421 1.0
n_leapfrog__ 157 2.8e+00 1.5e+02 3.0e+00 511 3085 1784 1.0
lambda1 0.31 2.8e-01 7.1e+00 -1.2e+01 12 638 369 1.0
lambda2 -0.14 2.8e-01 7.1e+00 -1.2e+01 12 638 369 1.0
sigma 1.0 2.6e-03 8.0e-02 9.2e-01 1.2 939 543 1.0
mu 0.16 1.8e-03 1.0e-01 -8.1e-03 0.33 3289 1902 1.0
One Scale Parameter, Improper Prior
Warmup took (0.011, 0.012, 0.011, 0.011) seconds, 0.044 seconds total
Sampling took (0.017, 0.020, 0.020, 0.019) seconds, 0.077 seconds total
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -54 2.5e-02 0.91 -5.5e+01 -53 -53 1318 17198 1.0
n_leapfrog__ 3.2 2.7e-01 1.7 1.0e+00 3.0 7.0 39 507 1.0
mu 0.17 2.1e-03 0.10 -3.8e-03 0.17 0.33 2408 31417 1.0
sigma 1.0 1.6e-03 0.071 9.3e-01 1.0 1.2 2094 27321 1.0
On the top is the non-identified model with improper uniform priors and data model
In the middle is the same data model as in top plus priors
On the bottom is an identified model with an improper prior, with data model
Examples: fits in Stan
To illustrate the issues with sampling from non-identified and only weakly identified models, we fit three models with increasing degrees of identification of their parameters. The posteriors for these models is illustrated in the non-identifiable density plot. The first model is the unidentified model with two location parameters and no priors discussed in the collinearity section.
data {
int N;
array[N] real y;
}parameters {
real lambda1;
real lambda2;
real<lower=0> sigma;
}transformed parameters {
real mu;
mu = lambda1 + lambda2;
}model {
y ~ normal(mu, sigma); }
The second adds priors to the model block for lambda1
and lambda2
to the previous model.
0, 10);
lambda1 ~ normal(0, 10); lambda2 ~ normal(
The third involves a single location parameter, but no priors.
data {
int N;
array[N] real y;
}parameters {
real mu;
real<lower=0> sigma;
}model {
y ~ normal(mu, sigma); }
All three of the example models were fit in Stan 2.1.0 with default parameters (1000 warmup iterations, 1000 sampling iterations, NUTS sampler with max tree depth of 10). The results are shown in the non-identified fits figure. The key statistics from these outputs are the following.
As indicated by
R_hat
column, all parameters have converged other than and in the non-identified model.The average number of leapfrog steps is roughly 3 in the identified model, 150 in the model identified by a weak prior, and 1400 in the non-identified model.
The number of effective samples per second for
is roughly 31,000 in the identified model, 1,900 in the model identified with weakly informative priors, and 200 in the non-identified model; the results are similar for .In the non-identified model, the 95% interval for
is (-2300,6000), whereas it is only (-12,12) in the model identified with weakly informative priors.In all three models, the simulated value of
and are well within the posterior 90% intervals.
The first two points, lack of convergence and hitting the maximum number of leapfrog steps (equivalently maximum tree depth) are indicative of improper posteriors. Thus rather than covering up the problem with poor sampling as may be done with Gibbs samplers, Hamiltonian Monte Carlo tries to explore the posterior and its failure is a clear indication that something is amiss in the model.
References
Footnotes
This example was raised by Richard McElreath on the Stan users group in a query about the difference in behavior between Gibbs sampling as used in BUGS and JAGS and the Hamiltonian Monte Carlo (HMC) and no-U-turn samplers (NUTS) used by Stan.↩︎
The marginal posterior
for is proper here as long as there are at least two distinct data points.↩︎A Laplace prior (or an L1 regularizer for penalized maximum likelihood estimation) is not sufficient to remove this additive invariance. It provides shrinkage, but does not in and of itself identify the parameters because adding a constant to
and subtracting it from results in the same value for the prior density.↩︎Tempering methods may be viewed as automated ways to carry out such a search for modes, though most MCMC tempering methods continue to search for modes on an ongoing basis; see (Swendsen and Wang 1986; Neal 1996).↩︎