21.2 Reparameterizations
Reparameterizations may be implemented directly using the transformed parameters block or just in the model block.
Beta and Dirichlet Priors
The beta and Dirichlet distributions may both be reparameterized from a vector of counts to use a mean and total count.
Beta Distribution
For example, the Beta distribution is parameterized by two positive
count parameters \(\alpha, \beta > 0\). The following example
illustrates a hierarchical Stan model with a vector of parameters
theta
are drawn i.i.d. for a Beta distribution whose
parameters are themselves drawn from a hyperprior distribution.
parameters {
real<lower = 0> alpha;
real<lower = 0> beta;
...
model {
alpha ~ ...
beta ~ ...
for (n in 1:N)
theta[n] ~ beta(alpha, beta);
...
It is often more natural to specify hyperpriors in terms of transformed parameters. In the case of the Beta, the obvious choice for reparameterization is in terms of a mean parameter \[ \phi = \alpha / (\alpha + \beta) \] and total count parameter \[ \lambda = \alpha + \beta. \] Following @[GelmanEtAl:2013, Chapter 5] the mean gets a uniform prior and the count parameter a Pareto prior with \(p(\lambda) \propto \lambda^{-2.5}\).
parameters {
real<lower=0,upper=1> phi;
real<lower=0.1> lambda;
...
transformed parameters {
real<lower=0> alpha = lambda * phi;
real<lower=0> beta = lambda * (1 - phi);
...
model {
phi ~ beta(1, 1); // uniform on phi, could drop
lambda ~ pareto(0.1, 1.5);
for (n in 1:N)
theta[n] ~ beta(alpha, beta);
...
The new parameters, phi
and lambda
, are declared in the
parameters block and the parameters for the Beta distribution,
alpha
and beta
, are declared and defined in the
transformed parameters block. And If their values are not of interest,
they could instead be defined as local variables in the model as
follows.
model {
real alpha = lambda * phi
real beta = lambda * (1 - phi);
...
for (n in 1:N)
theta[n] ~ beta(alpha, beta);
...
}
With vectorization, this could be expressed more compactly and efficiently as follows.
model {
theta ~ beta(lambda * phi, lambda * (1 - phi));
...
}
If the variables alpha
and beta
are of interest, they
can be defined in the transformed parameter block and then used in the
model.
Jacobians not Necessary
Because the transformed parameters are being used, rather than given a
distribution, there is no need to apply a Jacobian adjustment for the
transform. For example, in the beta distribution example,
alpha
and beta
have the correct posterior distribution.
Dirichlet Priors
The same thing can be done with a Dirichlet, replacing the mean for the Beta, which is a probability value, with a simplex. Assume there are \(K > 0\) dimensions being considered (\(K=1\) is trivial and \(K=2\) reduces to the beta distribution case). The traditional prior is
parameters {
vector[K] alpha;
simplex[K] theta[N];
...
model {
alpha ~ ...;
for (n in 1:N)
theta[n] ~ dirichlet(alpha);
}
This provides essentially \(K\) degrees of freedom, one for each
dimension of alpha
, and it is not obvious how to specify a
reasonable prior for alpha
.
An alternative coding is to use the mean, which is a simplex, and a total count.
parameters {
simplex[K] phi;
real<lower=0> kappa;
simplex[K] theta[N];
...
transformed parameters {
vector[K] alpha = kappa * phi;
...
}
model {
phi ~ ...;
kappa ~ ...;
for (n in 1:N)
theta[n] ~ dirichlet(alpha);
Now it is much easier to formulate priors, because phi
is the
expected value of theta
and kappa
(minus K
) is
the strength of the prior mean measured in number of prior observations.
Transforming Unconstrained Priors: Probit and Logit
If the variable \(u\) has a \(\mathsf{Uniform}(0, 1)\) distribution, then \(\mbox{logit}(u)\) is distributed as \(\mathsf{Logistic}(0, 1)\). This is because inverse logit is the cumulative distribution function (cdf) for the logistic distribution, so that the logit function itself is the inverse cdf and thus maps a uniform draw in \((0, 1)\) to a logistically-distributed quantity.
Things work the same way for the probit case: if \(u\) has a \(\mathsf{Uniform}(0, 1)\) distribution, then \(\Phi^{-1}(u)\) has a \(\mathsf{normal}(0, 1)\) distribution. The other way around, if \(v\) has a \(\mathsf{normal}(0, 1)\) distribution, then \(\Phi(v)\) has a \(\mathsf{Uniform}(0, 1)\) distribution.
In order to use the probit and logistic as priors on variables
constrained to \((0, 1)\), create an unconstrained variable and
transform it appropriately. For comparison, the following Stan
program fragment declares a \((0, 1)\)-constrained parameter
theta
and gives it a beta prior, then uses it as a parameter in
a distribution (here using foo
as a placeholder).
parameters {
real<lower = 0, upper = 1> theta;
...
model {
theta ~ beta(a, b);
...
y ~ foo(theta);
...
If the variables a
and b
are one, then this imposes
a uniform distribution theta
. If a
and b
are
both less than one, then the density on theta
has a U shape,
whereas if they are both greater than one, the density of theta
has an inverted-U or more bell-like shape.
Roughly the same result can be achieved with unbounded parameters that are probit or inverse-logit-transformed. For example,
parameters {
real theta_raw;
...
transformed parameters {
real<lower = 0, upper = 1> theta = inv_logit(theta_raw);
...
model {
theta_raw ~ logistic(mu, sigma);
...
y ~ foo(theta);
...
In this model, an unconstrained parameter theta_raw
gets a
logistic prior, and then the transformed parameter theta
is
defined to be the inverse logit of theta_raw
. In this
parameterization, inv_logit(mu)
is the mean of the implied
prior on theta
. The prior distribution on theta
will be
flat if sigma
is one and mu
is zero, and will be
U-shaped if sigma
is larger than one and bell shaped if
sigma
is less than one.
When moving from a variable in \((0, 1)\) to a simplex, the same trick may be performed using the softmax function, which is a multinomial generalization of the inverse logit function. First, consider a simplex parameter with a Dirichlet prior.
parameters {
simplex[K] theta;
...
model {
theta ~ dirichlet(a);
...
y ~ foo(theta);
Now a
is a vector with K
rows, but it has the same shape
properties as the pair a
and b
for a beta; the beta
distribution is just the distribution of the first component of a
Dirichlet with parameter vector \([a b]^{\top}\). To formulate an
unconstrained prior, the exact same strategy works as for the beta.
parameters {
vector[K] theta_raw;
...
transformed parameters {
simplex[K] theta = softmax(theta_raw);
...
model {
theta_raw ~ multi_normal_cholesky(mu, L_Sigma);
The multivariate normal is used for convenience and efficiency with
its Cholesky-factor parameterization. Now the mean is controlled by
softmax(mu)
, but we have additional control of covariance
through L_Sigma
at the expense of having on the order of \(K^2\)
parameters in the prior rather than order \(K\). If no covariance is
desired, the number of parameters can be reduced back to \(K\) using a
vectorized normal distribution as follows.
theta_raw ~ normal(mu, sigma);
where either or both of mu
and sigma
can be vectors.