29.2 Estimation error

29.2.1 Parameter estimates

Estimation is usually considered for unknown parameters. If the data from which the parameters were estimated came from simulated data, the true value of the parameters may be known. If \(\theta\) is the true value and \(\hat{\theta}\) the estimate, then error is just the difference between the prediction and the true value, \[ \textrm{err} = \hat{\theta} - \theta. \]

If the estimate is larger than the true value, the error is positive, and if it’s smaller, then error is negative. If an estimator’s unbiased, then expected error is zero. So typically, absolute error or squared error are used, which will always have positive expectations for an imperfect estimator. Absolute error is defined as \[ \textrm{abs-err} = \left| \hat{\theta} - \theta \right| \] and squared error as \[ \textrm{sq-err} = \left( \hat{\theta} - \theta \right)^2. \] Gneiting and Raftery (2007) provide a thorough overview of such scoring rules and their properties.

Bayesian posterior means minimize expected square error, whereas posterior medians minimize expected absolute error. Estimates based on modes rather than probability, such as (penalized) maximum likelihood estimates or maximum a posterior estimates, do not have these properties.

29.2.2 Predictive estimates

In addition to parameters, other unknown quantities may be estimated, such as the score of a football match or the effect of a medical treatment given to a subject. In these cases, square error is defined in the same way. If there are multiple exchangeable outcomes being estimated, \(z_1, \ldots, z_N,\) then it is common to report mean square error (MSE), \[ \textrm{mse} = \frac{1}{N} \sum_{n = 1}^N \left( \hat{z}_n - z_n\right)^2. \] To put the error back on the scale of the original value, the square root may be applied, resulting in what is known prosaically as root mean square error (RMSE), \[ \textrm{rmse} = \sqrt{\textrm{mean-sq-err}}. \]

29.2.3 Predictive estimates in Stan

Consider a simple linear regression model, parameters for the intercept \(\alpha\) and slope \(\beta\), along with predictors \(\tilde{x}_n\). The standard Bayesian estimate is the expected value of \(\tilde{y}\) given the predictors and training data, \[\begin{eqnarray*} \hat{\tilde{y}}_n & = & \mathbb{E}[\tilde{y}_n \mid \tilde{x}_n, x, y] \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \tilde{y}_n^{(m)} \end{eqnarray*}\] where \(\tilde{y}_n^{(m)}\) is drawn from the sampling distribution \[ \tilde{y}_n^{(m)} \sim p(\tilde{y}_n \mid \tilde{x}_n, \alpha^{(m)}, \beta^{(m)}), \] for parameters \(\alpha^{(m)}\) and \(\beta^{(m)}\) drawn from the posterior, \[ (\alpha^{(m)}, \beta^{(m)}) \sim p(\alpha, \beta \mid x, y). \]

In the linear regression case, two stages of simplification can be carried out, the first of which helpfully reduces the variance of the estimator. First, rather than averaging samples \(\tilde{y}_n^{(m)}\), the same result is obtained by averaging linear predictions, \[\begin{eqnarray*} \hat{\tilde{y}}_n & = & \mathbb{E}\left[ \alpha + \beta \cdot \tilde{x}_n \mid \tilde{x}_n, x, y \right] \\[4pt] & \approx & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n \end{eqnarray*}\] This is possible because \[ \tilde{y}_n^{(m)} \sim \textrm{normal}(\tilde{y}_n \mid \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n, \sigma^{(m)}), \] and the normal distribution has symmetric error so that the expectation of \(\tilde{y}_n^{(m)}\) is the same as \(\alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n\). Replacing the sampled quantity \(\tilde{y}_n^{(m)}\) with its expectation is a general variance reduction technique for Monte Carlo estimates known as Rao-Blackwellization (Rao 1945; Blackwell 1947).

In the linear case, because the predictor is linear in the coefficients, the estimate can be further simplified to use the estimated coefficients, \[\begin{eqnarray*} \tilde{y}_n^{(m)} & \approx & \frac{1}{M} \sum_{m = 1}^M \left( \alpha^{(m)} + \beta^{(m)} \cdot \tilde{x}_n \right) \\[4pt] & = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)} + \frac{1}{M} \sum_{m = 1}^M (\beta^{(m)} \cdot \tilde{x}_n) \\[4pt] & = & \frac{1}{M} \sum_{m = 1}^M \alpha^{(m)} + \left( \frac{1}{M} \sum_{m = 1}^M \beta^{(m)}\right) \cdot \tilde{x}_n \\[4pt] & = & \hat{\alpha} + \hat{\beta} \cdot \tilde{x}_n. \end{eqnarray*}\]

In Stan, only the first of the two steps (the important variance reduction step) can be coded in the object model. The linear predictor is defined in the generated quantities block.

data {
  int<lower=0> N_tilde;
  vector[N_tilde] x_tilde;
  // ...
}
// ...
generated quantities {
  vector[N_tilde] tilde_y = alpha + beta * x_tilde;
}

The posterior mean of tilde_y calculated by Stan is the Bayesian estimate \(\hat{\tilde{y}}.\) The posterior median may also be calculated and used as an estimate, though square error and the posterior mean are more commonly reported.

References

Blackwell, David. 1947. “Conditional Expectation and Unbiased Sequential Estimation.” The Annals of Mathematical Statistics 18 (1): 105–10. https://doi.org/10.1214/aoms/1177730497.
Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78.
Rao, C. Radhakrishna. 1945. “Information and Accuracy Attainable in the Estimation of Statistical Parameters.” Bulletin of the Calcutta Math Society 37 (3): 81–91.