22.12 Standardizing Predictors and Outputs

This is an old version, view current version.

22.12 Standardizing Predictors and Outputs

Stan programs will run faster if the input is standardized to have a zero sample mean and unit sample variance. This section illustrates the principle with a simple linear regression.

Suppose that \(y = (y_1,\dotsc,y_N)\) is a sequence of \(N\) outcomes and \(x = (x_1,\dotsc,x_N)\) a parallel sequence of \(N\) predictors. A simple linear regression involving an intercept coefficient \(\alpha\) and slope coefficient \(\beta\) can be expressed as \[ y_n = \alpha + \beta x_n + \epsilon_n, \] where \[ \epsilon_n \sim \textsf{normal}(0,\sigma). \]

If either vector \(x\) or \(y\) has very large or very small values or if the sample mean of the values is far away from 0 (on the scale of the values), then it can be more efficient to standardize the outputs \(y_n\) and predictors \(x_n\). The data are first centered by subtracting the sample mean, and then scaled by dividing by the sample deviation. Thus a data point \(u\) is standardized with respect to a vector \(y\) by the function \(\textsf{z}_y\), defined by \[ \textsf{z}_y(u) = \frac{u - \bar{y}}{\texttt{sd}(y)} \] where the sample mean of \(y\) is \[ \bar{y} = \frac{1}{N} \sum_{n=1}^N y_n, \] and the sample standard deviation of \(y\) is \[ \texttt{sd}(y) = \left( \frac{1}{N} \sum_{n=1}^N (y_n - \bar{y})^2 \right)^{1/2}. \] The inverse transform is defined by reversing the two normalization steps, first rescaling by the same deviation and relocating by the sample mean, \[ \textrm{z}_y^{-1}(v) = \texttt{sd}(y) v + \bar{y}. \]

To standardize a regression problem, the predictors and outcomes are standardized. This changes the scale of the variables, and hence changes the scale of the priors. Consider the following initial model.

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  // priors
  alpha ~ normal(0, 10);
  beta ~ normal(0, 10);
  sigma ~ cauchy(0, 5);
  // likelihood
  for (n in 1:N)
    y[n] ~ normal(alpha + beta * x[n], sigma);
}

The data block for the standardized model is identical. The standardized predictors and outputs are defined in the transformed data block.

data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
}
transformed data {
  vector[N] x_std;
  vector[N] y_std;
  x_std = (x - mean(x)) / sd(x);
  y_std = (y - mean(y)) / sd(y);
}
parameters {
  real alpha_std;
  real beta_std;
  real<lower=0> sigma_std;
}
model {
  alpha_std ~ normal(0, 10);
  beta_std ~ normal(0, 10);
  sigma_std ~ cauchy(0, 5);
  for (n in 1:N)
    y_std[n] ~ normal(alpha_std + beta_std * x_std[n],
                      sigma_std);
}

The parameters are renamed to indicate that they aren’t the “natural” parameters, but the model is otherwise identical. In particular, the fairly diffuse priors on the coefficients and error scale are the same. These could have been transformed as well, but here they are left as is, because the scales make sense as diffuse priors for standardized data; the priors could be made more informative. For instance, because the outputs \(y\) have been standardized, the error \(\sigma\) should not be greater than 1, because that’s the scale of the noise for predictors \(\alpha = \beta = 0\).

The original regression \[ y_n = \alpha + \beta x_n + \epsilon_n \] has been transformed to a regression on the standardized variables, \[ \textrm{z}_y(y_n) = \alpha' + \beta' \textrm{z}_x(x_n) + \epsilon'_n. \] The original parameters can be recovered with a little algebra, \[\begin{align*} y_n &= \textrm{z}_y^{-1}(\textrm{z}_y(y_n)) \\ &= \textrm{z}_y^{-1} \left( \alpha' + \beta' \textrm{z}_x(x_n) + \epsilon_n' \right) \\ &= \textrm{z}_y^{-1} \left( \alpha' + \beta' \left( \frac{x_n - \bar{x}}{\texttt{sd}(x)} \right) + \epsilon_n' \right) \\ &= \texttt{sd}(y) \left( \alpha' + \beta' \left( \frac{x_n - \bar{x}}{\texttt{sd}(x)} \right) + \epsilon_n' \right) + \bar{y} \\ &= \left( \texttt{sd}(y) \left( \alpha' - \beta' \frac{\bar{x}}{\texttt{sd}(x)} \right) + \bar{y} \right) + \left( \beta' \frac{\texttt{sd}(y)}{\texttt{sd}(x)} \right) x_n + \texttt{sd}(y) \epsilon'_n, \end{align*}\] from which the original scale parameter values can be read off, \[ \alpha = \texttt{sd}(y) \left( \alpha' - \beta' \frac{\bar{x}}{\texttt{sd}(x)} \right) + \bar{y}; \qquad \beta = \beta' \frac{\texttt{sd}(y)}{\texttt{sd}(x)}; \qquad \sigma = \texttt{sd}(y) \sigma'. \]

These recovered parameter values on the original scales can be calculated within Stan using a generated quantities block following the model block,

generated quantities {
  real alpha;
  real beta;
  real<lower=0> sigma;
  alpha = sd(y) * (alpha_std - beta_std * mean(x) / sd(x))
           + mean(y);
  beta = beta_std * sd(y) / sd(x);
  sigma = sd(y) * sigma_std;
}

It is inefficient to compute all of the means and standard deviations every iteration; for more efficiency, these can be calculated once and stored as transformed data. Furthermore, the model sampling statement can be easily vectorized, for instance, in the transformed model, to

y_std ~ normal(alpha_std + beta_std * x_std, sigma_std);

Standard Normal Distribution

For many applications on the standard scale, normal distributions with location zero and scale one will be used. In these cases, it is more efficient to use

y ~ std_normal();

than to use

y ~ normal(0, 1);

because the subtraction of the location and division by the scale cancel, as does subtracting the log of the scale.