8.2 Statistical Variable Taxonomy
Statistical Variable Taxonomy Table. Variables of the kind indicated in the left column must be declared in one of the blocks declared in the right column.
variable kind | declaration block |
---|---|
constants | data , transformed data |
unmodeled data | data , transformed data |
modeled data | data , transformed data |
missing data | parameters , transformed parameters |
modeled parameters | parameters , transformed parameters |
unmodeled parameters | data , transformed data |
derived quantities | transformed data , transformed parameters , generated quantities |
loop indices | loop statement |
Page 366 of (Gelman and Hill 2007) provides a taxonomy of the kinds of variables used in Bayesian models. The table of kinds of variables contains Gelman and Hill’s taxonomy along with a missing-data kind along with the corresponding locations of declarations and definitions in Stan.
Constants can be built into a model as literals, data variables, or
as transformed data variables. If specified as variables, their
definition must be included in data files. If they are specified as
transformed data variables, they cannot be used to specify the sizes
of elements in the data
block.
The following program illustrates various variables kinds, listing the kind of each variable next to its declaration.
data {
int<lower=0> N; // unmodeled data
real y[N]; // modeled data
real mu_mu; // config. unmodeled param
real<lower=0> sigma_mu; // config. unmodeled param
}
transformed data {
real<lower=0> alpha; // const. unmodeled param
real<lower=0> beta; // const. unmodeled param
alpha = 0.1;
beta = 0.1;
}
parameters {
real mu_y; // modeled param
real<lower=0> tau_y; // modeled param
}
transformed parameters {
real<lower=0> sigma_y; // derived quantity (param)
sigma_y = pow(tau_y, -0.5);
}
model {
tau_y ~ gamma(alpha, beta);
mu_y ~ normal(mu_mu, sigma_mu);
for (n in 1:N)
y[n] ~ normal(mu_y, sigma_y);
}
generated quantities {
real variance_y; // derived quantity (transform)
variance_y = sigma_y * sigma_y;
}
In this example, y[N]
is a modeled data vector. Although it is
specified in the data
block, and thus must have a known value
before the program may be run, it is modeled as if it were generated
randomly as described by the model.
The variable N
is a typical example of unmodeled data. It is
used to indicate a size that is not part of the model itself.
The other variables declared in the data and transformed data block are
examples of unmodeled parameters, also known as hyperparameters.
Unmodeled parameters are parameters to probability densities that are
not themselves modeled probabilistically. In Stan, unmodeled
parameters that appear in the data
block may be specified on a
per-model execution basis as part of the data read. In the above
model, mu_mu
and sigma_mu
are configurable unmodeled
parameters.
Unmodeled parameters that are hard coded in the model must be declared
in the transformed data
block. For example, the unmodeled
parameters alpha
and beta
are both hard coded to the
value 0.1. To allow such variables to be configurable based on data
supplied to the program at run time, they must be declared in the
data
block, like the variables mu_mu
and
sigma_mu
.
This program declares two modeled parameters, mu
and
tau_y
. These are the location and precision used in the normal
model of the values in y
. The heart of the model will be
sampling the values of these parameters from their posterior
distribution.
The modeled parameter tau_y
is transformed from a precision to
a scale parameter and assigned to the variable sigma_y
in the
transformed parameters
block. Thus the variable sigma_y
is considered a derived quantity — its value is entirely determined
by the values of other variables.
The generated quantities
block defines a value
variance_y
, which is defined as a transform of the scale or
deviation parameter sigma_y
. It is defined in the generated
quantities block because it is not used in the model. Making it
a generated quantity allows it to be monitored for convergence (being
a non-linear transform, it will have different autocorrelation and
hence convergence properties than the deviation itself).
In later versions of Stan which have random number generators for
the distributions, the generated quantities
block will be
usable to generate replicated data for model checking.
Finally, the variable n
is used as a loop index in the
model
block.
References
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.