20.2 Reparameterizations

Reparameterizations may be implemented directly using the transformed parameters block or just in the model block.

Beta and Dirichlet Priors

The beta and Dirichlet distributions may both be reparameterized from a vector of counts to use a mean and total count.

Beta Distribution

For example, the Beta distribution is parameterized by two positive count parameters \(\alpha, \beta > 0\). The following example illustrates a hierarchical Stan model with a vector of parameters theta are drawn i.i.d. for a Beta distribution whose parameters are themselves drawn from a hyperprior distribution.

parameters {
  real<lower = 0> alpha;
  real<lower = 0> beta;
  ...
model {
  alpha ~ ...
  beta ~ ...
  for (n in 1:N)
    theta[n] ~ beta(alpha, beta);
  ...

It is often more natural to specify hyperpriors in terms of transformed parameters. In the case of the Beta, the obvious choice for reparameterization is in terms of a mean parameter \[ \phi = \alpha / (\alpha + \beta) \] and total count parameter \[ \lambda = \alpha + \beta. \] Following @[GelmanEtAl:2013, Chapter 5] the mean gets a uniform prior and the count parameter a Pareto prior with \(p(\lambda) \propto \lambda^{-2.5}\).

parameters {
  real<lower=0,upper=1> phi;
  real<lower=0.1> lambda;
  ...
transformed parameters {
  real<lower=0> alpha = lambda * phi;
  real<lower=0> beta = lambda * (1 - phi);
  ...
model {
  phi ~ beta(1, 1); // uniform on phi, could drop
  lambda ~ pareto(0.1, 1.5);
  for (n in 1:N)
    theta[n] ~ beta(alpha, beta);
  ...

The new parameters, phi and lambda, are declared in the parameters block and the parameters for the Beta distribution, alpha and beta, are declared and defined in the transformed parameters block. And If their values are not of interest, they could instead be defined as local variables in the model as follows.

model {
  real alpha = lambda * phi
  real beta = lambda * (1 - phi);
...
  for (n in 1:N)
    theta[n] ~ beta(alpha, beta);
...
}

With vectorization, this could be expressed more compactly and efficiently as follows.

model {
  theta ~ beta(lambda * phi, lambda * (1 - phi));
...
}

If the variables alpha and beta are of interest, they can be defined in the transformed parameter block and then used in the model.

Jacobians not Necessary

Because the transformed parameters are being used, rather than given a distribution, there is no need to apply a Jacobian adjustment for the transform. For example, in the beta distribution example, alpha and beta have the correct posterior distribution.

Dirichlet Priors

The same thing can be done with a Dirichlet, replacing the mean for the Beta, which is a probability value, with a simplex. Assume there are \(K > 0\) dimensions being considered (\(K=1\) is trivial and \(K=2\) reduces to the beta distribution case). The traditional prior is

parameters {
  vector[K] alpha;
  simplex[K] theta[N];
  ...
model {
  alpha ~ ...;
  for (n in 1:N)
    theta[n] ~ dirichlet(alpha);
}

This provides essentially \(K\) degrees of freedom, one for each dimension of alpha, and it is not obvious how to specify a reasonable prior for alpha.

An alternative coding is to use the mean, which is a simplex, and a total count.

parameters {
  simplex[K] phi;
  real<lower=0> kappa;
  simplex[K] theta[N];
  ...
transformed parameters {
  vector[K] alpha = kappa * phi;
  ...
}
model {
  phi ~ ...;
  kappa ~ ...;
  for (n in 1:N)
    theta[n] ~ dirichlet(alpha);

Now it is much easier to formulate priors, because phi is the expected value of theta and kappa (minus K) is the strength of the prior mean measured in number of prior observations.

Transforming Unconstrained Priors: Probit and Logit

If the variable \(u\) has a \(\mathsf{Uniform}(0, 1)\) distribution, then \(\mbox{logit}(u)\) is distributed as \(\mathsf{Logistic}(0, 1)\). This is because inverse logit is the cumulative distribution function (cdf) for the logistic distribution, so that the logit function itself is the inverse cdf and thus maps a uniform draw in \((0, 1)\) to a logistically-distributed quantity.

Things work the same way for the probit case: if \(u\) has a \(\mathsf{Uniform}(0, 1)\) distribution, then \(\Phi^{-1}(u)\) has a \(\mathsf{normal}(0, 1)\) distribution. The other way around, if \(v\) has a \(\mathsf{normal}(0, 1)\) distribution, then \(\Phi(v)\) has a \(\mathsf{Uniform}(0, 1)\) distribution.

In order to use the probit and logistic as priors on variables constrained to \((0, 1)\), create an unconstrained variable and transform it appropriately. For comparison, the following Stan program fragment declares a \((0, 1)\)-constrained parameter theta and gives it a beta prior, then uses it as a parameter in a distribution (here using foo as a placeholder).

parameters {
  real<lower = 0, upper = 1> theta;
...
model {
  theta ~ beta(a, b);
  ...
  y ~ foo(theta);
...

If the variables a and b are one, then this imposes a uniform distribution theta. If a and b are both less than one, then the density on theta has a U shape, whereas if they are both greater than one, the density of theta has an inverted-U or more bell-like shape.

Roughly the same result can be achieved with unbounded parameters that are probit or inverse-logit-transformed. For example,

parameters {
  real theta_raw;
...
transformed parameters {
  real<lower = 0, upper = 1> theta = inv_logit(theta_raw);
...
model {
  theta_raw ~ logistic(mu, sigma);
  ...
  y ~ foo(theta);
...

In this model, an unconstrained parameter theta_raw gets a logistic prior, and then the transformed parameter theta is defined to be the inverse logit of theta_raw. In this parameterization, inv_logit(mu) is the mean of the implied prior on theta. The prior distribution on theta will be flat if sigma is one and mu is zero, and will be U-shaped if sigma is larger than one and bell shaped if sigma is less than one.

When moving from a variable in \((0, 1)\) to a simplex, the same trick may be performed using the softmax function, which is a multinomial generalization of the inverse logit function. First, consider a simplex parameter with a Dirichlet prior.

parameters {
  simplex[K] theta;
...
model {
  theta ~ dirichlet(a);
  ...
  y ~ foo(theta);

Now a is a vector with K rows, but it has the same shape properties as the pair a and b for a beta; the beta distribution is just the distribution of the first component of a Dirichlet with parameter vector \([a b]^{\top}\). To formulate an unconstrained prior, the exact same strategy works as for the beta.

parameters {
  vector[K] theta_raw;
...
transformed parameters {
  simplex[K] theta = softmax(theta_raw);
...
model {
  theta_raw ~ multi_normal_cholesky(mu, L_Sigma);

The multivariate normal is used for convenience and efficiency with its Cholesky-factor parameterization. Now the mean is controlled by softmax(mu), but we have additional control of covariance through L_Sigma at the expense of having on the order of \(K^2\) parameters in the prior rather than order \(K\). If no covariance is desired, the number of parameters can be reduced back to \(K\) using a vectorized normal distribution as follows.

  theta_raw ~ normal(mu, sigma);

where either or both of mu and sigma can be vectors.