25.12 Standardizing predictors and outputs
Stan programs will run faster if the input is standardized to have a zero sample mean and unit sample variance. This section illustrates the principle with a simple linear regression.
Suppose that y=(y1,…,yN) is a sequence of N outcomes and x=(x1,…,xN) a parallel sequence of N predictors. A simple linear regression involving an intercept coefficient α and slope coefficient β can be expressed as yn=α+βxn+ϵn, where ϵn∼normal(0,σ).
If either vector x or y has very large or very small values or if the sample mean of the values is far away from 0 (on the scale of the values), then it can be more efficient to standardize the outputs yn and predictors xn. The data are first centered by subtracting the sample mean, and then scaled by dividing by the sample deviation. Thus a data point u is standardized with respect to a vector y by the function zy, defined by zy(u)=u−ˉysd(y) where the sample mean of y is ˉy=1NN∑n=1yn, and the sample standard deviation of y is sd(y)=(1NN∑n=1(yn−ˉy)2)1/2. The inverse transform is defined by reversing the two normalization steps, first rescaling by the same deviation and relocating by the sample mean, z−1y(v)=sd(y)v+ˉy.
To standardize a regression problem, the predictors and outcomes are standardized. This changes the scale of the variables, and hence changes the scale of the priors. Consider the following initial model.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}parameters {
real alpha;
real beta;
real<lower=0> sigma;
}model {
// priors
0, 10);
alpha ~ normal(0, 10);
beta ~ normal(0, 5);
sigma ~ cauchy(// likelihood
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n], sigma);
} }
The data block for the standardized model is identical. The standardized predictors and outputs are defined in the transformed data block.
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}transformed data {
vector[N] x_std;
vector[N] y_std;
x_std = (x - mean(x)) / sd(x);
y_std = (y - mean(y)) / sd(y);
}parameters {
real alpha_std;
real beta_std;
real<lower=0> sigma_std;
}model {
0, 10);
alpha_std ~ normal(0, 10);
beta_std ~ normal(0, 5);
sigma_std ~ cauchy(for (n in 1:N) {
y_std[n] ~ normal(alpha_std + beta_std * x_std[n],
sigma_std);
} }
The parameters are renamed to indicate that they aren’t the “natural” parameters, but the model is otherwise identical. In particular, the fairly diffuse priors on the coefficients and error scale are the same. These could have been transformed as well, but here they are left as is, because the scales make sense as diffuse priors for standardized data; the priors could be made more informative. For instance, because the outputs y have been standardized, the error σ should not be greater than 1, because that’s the scale of the noise for predictors α=β=0.
The original regression yn=α+βxn+ϵn has been transformed to a regression on the standardized variables, zy(yn)=α′+β′zx(xn)+ϵ′n. The original parameters can be recovered with a little algebra, yn=z−1y(zy(yn))=z−1y(α′+β′zx(xn)+ϵ′n)=z−1y(α′+β′(xn−ˉxsd(x))+ϵ′n)=sd(y)(α′+β′(xn−ˉxsd(x))+ϵ′n)+ˉy=(sd(y)(α′−β′ˉxsd(x))+ˉy)+(β′sd(y)sd(x))xn+sd(y)ϵ′n, from which the original scale parameter values can be read off, α=sd(y)(α′−β′ˉxsd(x))+ˉy;β=β′sd(y)sd(x);σ=sd(y)σ′.
These recovered parameter values on the original scales can be calculated within Stan using a generated quantities block following the model block,
generated quantities {
real alpha;
real beta;
real<lower=0> sigma;
alpha = sd(y) * (alpha_std - beta_std * mean(x) / sd(x))
+ mean(y);
beta = beta_std * sd(y) / sd(x);
sigma = sd(y) * sigma_std; }
It is inefficient to compute all of the means and standard deviations every iteration; for more efficiency, these can be calculated once and stored as transformed data. Furthermore, the model sampling statement can be easily vectorized, for instance, in the transformed model, to
y_std ~ normal(alpha_std + beta_std * x_std, sigma_std);
Standard normal distribution
For many applications on the standard scale, normal distributions with location zero and scale one will be used. In these cases, it is more efficient to use
y ~ std_normal();
than to use
0, 1); y ~ normal(
because the subtraction of the location and division by the scale cancel, as does subtracting the log of the scale.