This is an old version, view current version.

26.5 Example of prior predictive checks

Suppose we have a model for a football (aka soccer) league where there are \(J\) teams. Each team has a scoring rate \(\lambda_j\) and in each game will be assumed to score \(\textrm{poisson}(\lambda_j)\) points. Yes, this model completely ignores defense. Suppose the modeler does not want to “put their thumb on the scale” and would rather “let the data speak for themselves” and so uses a prior with very wide tails, because it seems uninformative, such as the widely deployed \[ \lambda_j \sim \textrm{gamma}(\epsilon_1, \epsilon_2). \] This is not just a manufactured example; The BUGS Book recommends setting \(\epsilon = (0.5, 0.00001)\), which corresponds to a Jeffreys prior for a Poisson rate parameter prior (Lunn et al. 2012, 85).

Suppose the league plays a round-robin tournament wherein every team plays every other team. The following Stan model generates random team abilities and the results of such a round-robin tournament, which may be used to perform prior predictive checks.

data {
  int<lower = 0> J;
  real<lower = 0> epsilon[2];
generated quantities {
  real<lower = 0> lambda[J];
  int y[J, J];
  for (j in 1:J) lambda[j] = gamma_rng(epsilon[1], epsilon[2]);
  for (i in 1:J)
    for (j in 1:J)
      y[i, j] = poisson_rng(lambda[i]) - poisson_rng(lambda[j]);

In this simulation, teams play each other twice and play themselves once. This could be made more realistic by controlling the combinatorics to only generate a single result for each pair of teams, of which there are \(\binom{J}{2} = \frac{J \cdot (J - 1)}{2}.\)

Using the \(\textrm{gamma}(0.5, 0.00001)\) reference prior on team abilities, the following are the first 20 simulated point differences for the match between the first two teams, \(y^{(1:20)}_{1, 2}\).

2597 -26000   5725  22496   1270   1072   4502  -2809   -302   4987
7513   7527  -3268 -12374   3828   -158 -29889   2986  -1392     66

That’s some pretty highly scoring football games being simulated; all but one has a score differential greater than 100! In other words, this \(\textrm{gamma}(0.5, 0.00001)\) prior is putting around 95% of its weight on score differentials above 100. Given that two teams combined rarely score 10 points, this prior is way out of line with prior knowledge about football matches; it is not only consistent with outcomes that have never occurred in the history of the sport, it puts most of the prior probability mass there.

The posterior predictive distribution can be strongly affected by the prior when there is not much observed data and substantial prior mass is concentrated around infeasible values (Gelman 2006).

Just as with posterior predictive distributions, any statistics of the generated data may be evaluated. Here, the focus was on score difference between a single pair of teams, but it could’ve been on maximums, minimums, averages, variances, etc.

In this textbook example, the prior is univariate and directly related to the expected number of points scored, and could thus be directly inspected for consistency with prior knowledge about scoring rates in football. There will not be the same kind of direct connection when the prior and sampling distributions are multivariate. In these more challenging situations, prior predictive checks are an easy way to get a handle on the implications of a prior in terms of what it says the data is going to look like; for a more complex application involving spatially heterogeneous air pollution concentration, see (Gabry et al. 2019).

Prior predictive checks can also be compared with the data, but one should not expect them to be calibrated in the same way as posterior predictive checks. That would require guessing the posterior and encoding it in the prior. The goal is make sure the prior is not so wide that it will pull probability mass away from feasible values.


Gabry, Jonah, Aki Vehtari, Måns Magnusson, Yuling Yao, Andrew Gelman, Paul-Christian Bürkner, Ben Goodrich, and Juho Piironen. 2019. “loo: Efficient Leave-One-Out Cross-Validation and WAIC for Bayesian Models.” The Comprehensive R Network 2 (2).

Gelman, A. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models.” Bayesian Analysis 1 (3): 515–34.

Lunn, David, Christopher Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC Press/Chapman & Hall.