## 21.2 Reparameterizations

Reparameterizations may be implemented directly using the transformed parameters block or just in the model block.

### Beta and Dirichlet Priors

The beta and Dirichlet distributions may both be reparameterized from a vector of counts to use a mean and total count.

#### Beta Distribution

For example, the Beta distribution is parameterized by two positive
count parameters \(\alpha, \beta > 0\). The following example
illustrates a hierarchical Stan model with a vector of parameters
`theta`

are drawn i.i.d. for a Beta distribution whose
parameters are themselves drawn from a hyperprior distribution.

```
parameters {
real<lower = 0> alpha;
real<lower = 0> beta;
...
model {
alpha ~ ...
beta ~ ...
for (n in 1:N)
theta[n] ~ beta(alpha, beta);
...
```

It is often more natural to specify hyperpriors in terms of transformed parameters. In the case of the Beta, the obvious choice for reparameterization is in terms of a mean parameter \[ \phi = \alpha / (\alpha + \beta) \] and total count parameter \[ \lambda = \alpha + \beta. \] Following @[GelmanEtAl:2013, Chapter 5] the mean gets a uniform prior and the count parameter a Pareto prior with \(p(\lambda) \propto \lambda^{-2.5}\).

```
parameters {
real<lower=0,upper=1> phi;
real<lower=0.1> lambda;
...
transformed parameters {
real<lower=0> alpha = lambda * phi;
real<lower=0> beta = lambda * (1 - phi);
...
model {
phi ~ beta(1, 1); // uniform on phi, could drop
lambda ~ pareto(0.1, 1.5);
for (n in 1:N)
theta[n] ~ beta(alpha, beta);
...
```

The new parameters, `phi`

and `lambda`

, are declared in the
parameters block and the parameters for the Beta distribution,
`alpha`

and `beta`

, are declared and defined in the
transformed parameters block. And If their values are not of interest,
they could instead be defined as local variables in the model as
follows.

```
model {
real alpha = lambda * phi
real beta = lambda * (1 - phi);
...
for (n in 1:N)
theta[n] ~ beta(alpha, beta);
...
}
```

With vectorization, this could be expressed more compactly and efficiently as follows.

```
model {
theta ~ beta(lambda * phi, lambda * (1 - phi));
...
}
```

If the variables `alpha`

and `beta`

are of interest, they
can be defined in the transformed parameter block and then used in the
model.

#### Jacobians not Necessary

Because the transformed parameters are being used, rather than given a
distribution, there is no need to apply a Jacobian adjustment for the
transform. For example, in the beta distribution example,
`alpha`

and `beta`

have the correct posterior distribution.

#### Dirichlet Priors

The same thing can be done with a Dirichlet, replacing the mean for the Beta, which is a probability value, with a simplex. Assume there are \(K > 0\) dimensions being considered (\(K=1\) is trivial and \(K=2\) reduces to the beta distribution case). The traditional prior is

```
parameters {
vector[K] alpha;
simplex[K] theta[N];
...
model {
alpha ~ ...;
for (n in 1:N)
theta[n] ~ dirichlet(alpha);
}
```

This provides essentially \(K\) degrees of freedom, one for each
dimension of `alpha`

, and it is not obvious how to specify a
reasonable prior for `alpha`

.

An alternative coding is to use the mean, which is a simplex, and a total count.

```
parameters {
simplex[K] phi;
real<lower=0> kappa;
simplex[K] theta[N];
...
transformed parameters {
vector[K] alpha = kappa * phi;
...
}
model {
phi ~ ...;
kappa ~ ...;
for (n in 1:N)
theta[n] ~ dirichlet(alpha);
```

Now it is much easier to formulate priors, because `phi`

is the
expected value of `theta`

and `kappa`

(minus `K`

) is
the strength of the prior mean measured in number of prior observations.

### Transforming Unconstrained Priors: Probit and Logit

If the variable \(u\) has a \(\mathsf{Uniform}(0, 1)\) distribution, then \(\mbox{logit}(u)\) is distributed as \(\mathsf{Logistic}(0, 1)\). This is because inverse logit is the cumulative distribution function (CDF) for the logistic distribution, so that the logit function itself is the inverse CDF and thus maps a uniform draw in \((0, 1)\) to a logistically-distributed quantity.

Things work the same way for the probit case: if \(u\) has a \(\mathsf{Uniform}(0, 1)\) distribution, then \(\Phi^{-1}(u)\) has a \(\mathsf{normal}(0, 1)\) distribution. The other way around, if \(v\) has a \(\mathsf{normal}(0, 1)\) distribution, then \(\Phi(v)\) has a \(\mathsf{Uniform}(0, 1)\) distribution.

In order to use the probit and logistic as priors on variables
constrained to \((0, 1)\), create an unconstrained variable and
transform it appropriately. For comparison, the following Stan
program fragment declares a \((0, 1)\)-constrained parameter
`theta`

and gives it a beta prior, then uses it as a parameter in
a distribution (here using `foo`

as a placeholder).

```
parameters {
real<lower = 0, upper = 1> theta;
...
model {
theta ~ beta(a, b);
...
y ~ foo(theta);
...
```

If the variables `a`

and `b`

are one, then this imposes
a uniform distribution `theta`

. If `a`

and `b`

are
both less than one, then the density on `theta`

has a U shape,
whereas if they are both greater than one, the density of `theta`

has an inverted-U or more bell-like shape.

Roughly the same result can be achieved with unbounded parameters that are probit or inverse-logit-transformed. For example,

```
parameters {
real theta_raw;
...
transformed parameters {
real<lower = 0, upper = 1> theta = inv_logit(theta_raw);
...
model {
theta_raw ~ logistic(mu, sigma);
...
y ~ foo(theta);
...
```

In this model, an unconstrained parameter `theta_raw`

gets a
logistic prior, and then the transformed parameter `theta`

is
defined to be the inverse logit of `theta_raw`

. In this
parameterization, `inv_logit(mu)`

is the mean of the implied
prior on `theta`

. The prior distribution on `theta`

will be
flat if `sigma`

is one and `mu`

is zero, and will be
U-shaped if `sigma`

is larger than one and bell shaped if
`sigma`

is less than one.

When moving from a variable in \((0, 1)\) to a simplex, the same trick may be performed using the softmax function, which is a multinomial generalization of the inverse logit function. First, consider a simplex parameter with a Dirichlet prior.

```
parameters {
simplex[K] theta;
...
model {
theta ~ dirichlet(a);
...
y ~ foo(theta);
```

Now `a`

is a vector with `K`

rows, but it has the same shape
properties as the pair `a`

and `b`

for a beta; the beta
distribution is just the distribution of the first component of a
Dirichlet with parameter vector \([a b]^{\top}\). To formulate an
unconstrained prior, the exact same strategy works as for the beta.

```
parameters {
vector[K] theta_raw;
...
transformed parameters {
simplex[K] theta = softmax(theta_raw);
...
model {
theta_raw ~ multi_normal_cholesky(mu, L_Sigma);
```

The multivariate normal is used for convenience and efficiency with
its Cholesky-factor parameterization. Now the mean is controlled by
`softmax(mu)`

, but we have additional control of covariance
through `L_Sigma`

at the expense of having on the order of \(K^2\)
parameters in the prior rather than order \(K\). If no covariance is
desired, the number of parameters can be reduced back to \(K\) using a
vectorized normal distribution as follows.

` theta_raw ~ normal(mu, sigma);`

where either or both of `mu`

and `sigma`

can be vectors.