8.2 Statistical Variable Taxonomy
Statistical Variable Taxonomy Table. Variables of the kind indicated in the left column must be declared in one of the blocks declared in the right column.
variable kind | declaration block |
---|---|
constants | data , transformed data |
unmodeled data | data , transformed data |
modeled data | data , transformed data |
missing data | parameters , transformed parameters |
modeled parameters | parameters , transformed parameters |
unmodeled parameters | data , transformed data |
derived quantities | transformed data , transformed parameters , generated quantities |
loop indices | loop statement |
Page 366 of (Gelman and Hill 2007) provides a taxonomy of the kinds of variables used in Bayesian models. The table of kinds of variables contains Gelman and Hill’s taxonomy along with a missing-data kind along with the corresponding locations of declarations and definitions in Stan.
Constants can be built into a model as literals, data variables, or as transformed data variables. If specified as variables, their definition must be included in data files. If they are specified as transformed data variables, they cannot be used to specify the sizes of elements in the data
block.
The following program illustrates various variables kinds, listing the kind of each variable next to its declaration.
data {
int<lower=0> N; // unmodeled data
real y[N]; // modeled data
real mu_mu; // config. unmodeled param
real<lower=0> sigma_mu; // config. unmodeled param
}
transformed data {
real<lower=0> alpha; // const. unmodeled param
real<lower=0> beta; // const. unmodeled param
alpha = 0.1;
beta = 0.1;
}
parameters {
real mu_y; // modeled param
real<lower=0> tau_y; // modeled param
}
transformed parameters {
real<lower=0> sigma_y; // derived quantity (param)
sigma_y = pow(tau_y, -0.5);
}
model {
tau_y ~ gamma(alpha, beta);
mu_y ~ normal(mu_mu, sigma_mu);
for (n in 1:N)
y[n] ~ normal(mu_y, sigma_y);
}
generated quantities {
real variance_y; // derived quantity (transform)
variance_y = sigma_y * sigma_y;
}
In this example, y[N]
is a modeled data vector. Although it is specified in the data
block, and thus must have a known value before the program may be run, it is modeled as if it were generated randomly as described by the model.
The variable N
is a typical example of unmodeled data. It is used to indicate a size that is not part of the model itself.
The other variables declared in the data and transformed data block are examples of unmodeled parameters, also known as hyperparameters. Unmodeled parameters are parameters to probability densities that are not themselves modeled probabilistically. In Stan, unmodeled parameters that appear in the data
block may be specified on a per-model execution basis as part of the data read. In the above model, mu_mu
and sigma_mu
are configurable unmodeled parameters.
Unmodeled parameters that are hard coded in the model must be declared in the transformed data
block. For example, the unmodeled parameters alpha
and beta
are both hard coded to the value 0.1. To allow such variables to be configurable based on data supplied to the program at run time, they must be declared in the data
block, like the variables mu_mu
and sigma_mu
.
This program declares two modeled parameters, mu
and tau_y
. These are the location and precision used in the normal model of the values in y
. The heart of the model will be sampling the values of these parameters from their posterior distribution.
The modeled parameter tau_y
is transformed from a precision to a scale parameter and assigned to the variable sigma_y
in the transformed parameters
block. Thus the variable sigma_y
is considered a derived quantity — its value is entirely determined by the values of other variables.
The generated quantities
block defines a value variance_y
, which is defined as a transform of the scale or deviation parameter sigma_y
. It is defined in the generated quantities block because it is not used in the model. Making it a generated quantity allows it to be monitored for convergence (being a non-linear transform, it will have different autocorrelation and hence convergence properties than the deviation itself).
In later versions of Stan which have random number generators for the distributions, the generated quantities
block will be usable to generate replicated data for model checking.
Finally, the variable n
is used as a loop index in the model
block.
References
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.