1.10 Hierarchical Priors

Priors on priors, also known as “hyperpriors,” should be treated the same way as priors on lower-level parameters in that as much prior information as is available should be brought to bear. Because hyperpriors often apply to only a handful of lower-level parameters, care must be taken to ensure the posterior is both proper and not overly sensitive either statistically or computationally to wide tails in the priors.

1.10.1 Boundary-Avoiding Priors for MLE in Hierarchical Models {-}

The fundamental problem with maximum likelihood estimation (MLE) in the hierarchical model setting is that as the hierarchical variance drops and the values cluster around the hierarchical mean, the overall density grows without bound. As an illustration, consider a simple hierarchical linear regression (with fixed prior mean) of \(y_n \in \mathbb{R}\) on \(x_n \in \mathbb{R}^K\), formulated as

\[ \begin{array}{rcl} y_n & \sim & \mathsf{normal}(x_n \beta, \sigma) \\[3pt] \beta_k & \sim & \mathsf{normal}(0,\tau) \\[3pt] \tau & \sim & \mathsf{Cauchy}(0,2.5) \end{array} \]

In this case, as \(\tau \rightarrow 0\) and \(\beta_k \rightarrow 0\), the posterior density \[ p(\beta,\tau,\sigma|y,x) \propto p(y|x,\beta,\tau,\sigma) \] grows without bound. See the plot of Neal’s funnel density, which has similar behavior.

There is obviously no MLE estimate for \(\beta,\tau,\sigma\) in such a case, and therefore the model must be modified if posterior modes are to be used for inference. The approach recommended by Chung et al. (2013) is to use a gamma distribution as a prior, such as

\[ \sigma \sim \mathsf{Gamma}(2, 1/A), \]

for a reasonably large value of \(A\), such as \(A = 10\).

References

Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika 78 (4): 685–709.