This is an old version, view current version.

15.3 Arithmetic Precision

The choice of significand provides \(\log_10 2^53 \approx 15.95\) decimal (base 10) digits of arithmetic precision. This is just the precision of the floating-point representation. After several operations are chained together, the realized arithmetic precision is often much lower.

15.3.1 Rounding and probabilities

In practice, the finite amount of arithmetic precision leads to rounding, whereby a number is represented by the closest floating-point number. For example, with only 16 decimal digits of accuracy,

1 + 1e-20 == 1

The closest floating point number to \(1 + 10^{-20}\) turns out to be \(1\) itself. By contrast, 0 + 1e-20 == 1e-20.

This highlights the fact that precision depends on scale. Even though 1 + 1e-20 == 1, we have 1e-20 + 1e-20 == 2e-20, as expected.

Rounding also manifests itself in a lack of transitivity. In particular, it does not usually hold for three floating point numbers \(a, b, c\) that \((a + b) + c = a + (b = c)\).

In statistical applications, problems often manifest in situations where users expect the usual rules of real-valued arithmetic to hold. Suppose we have a lower triangular matrix \(L\) with strictly positive diagonal, so that it is the Cholesky factor of a positive-definite matrix \(L \, L^{\top}\). In practice, rounding and loss of precision may render the result \(L \, L^{\top}\) neither symmetric nor positive definite.

In practice, care must be taken to defend against rounding. For example, symmetry may be produced by adding \(L \, L^{top}\) with its transpose and dividing by two, or by copying the lower triangular portion into the upper portion. Positive definiteness may be maintained by adding a small quantity to the diagonal.

15.3.2 Machine precision and the asymmetry of 0 and 1

The smallest number greater than zero is roughly \(0 + 10^{-323}\). The largest number less than zero is roughly \(1 - 10^{-15.95}\). The asymmetry is apparent when considering the representation of that largest number smaller than one—the exponent is of no help, and the number is represented as the binary equivalent of 0.9999999999999999.

For this reason, the machine precision is said to be roughly \(10^{15.95}\). This constant is available as machine_precision() in Stan.

15.3.3 Complementary and epsilon functions

Special operations are available to mitigate this problem with numbers rounding when they get close to one. For example, consider the operation log(1 + x) for positive x. When x is small (less than \(10^{-16}\) for double-precision floating point), the sum in the argument will round to 1 and the result will round to zero. To allow more granularity, programming languages provide a library function directly implementing \(f(x) = \log (1 + x)\). In Stan (as in C++), this operation is written as log1p(x). Because x itself may be close to zero, the function log1p(x) can take the logarithm of values very close to one, the results of which are close to zero.

Similarly, the complementary cumulative distribution functions (CCDF), defined by \(F^{\complement}_Y(y) = 1 - F_Y(y)\), where \(F_Y\) is the cumulative distribution function (CDF) for the random variable \(Y\). This allows values very close to one to be represented in complementary form.

15.3.4 Catastrophic cancellation

Another downside to floating point representations is that subtraction of two numbers close to each other results in a loss of precision that depends on how close they are. This is easy to see in practice. Consider

  1.23456789012345
- 1.23456789012344
= 0.00000000000001

We start with fifteen decimal places of accuracy in the arguments and are left with a single decimal place of accuracy in the result.

Catastrophic cancellation arises in statistical computations whenever we calculate variance for a distribution with small standard deviations relative to its location. When calculating summary statistics, Stan uses Welford’s algorithm for computing variances. This avoids catastrophic cancellation and may also be carried out in a single pass.

15.3.5 Overflow

Even though 1e200 may be represented as a double precision floating point value, there is no finite value large enough to represent 1e200 * 1e200. The result of 1e200 * 1e200 is said to overflow. The IEEE 754 standard requires the result to be positive infinity.

Overflow is rarely a problem in statistical computations. If it is, it’s possible to work on the log scale, just as for underflow as described below.

15.3.6 Underflow and the log scale

When there is no number small enough to represent a result, it is said to underflow. For instance, 1e-200 may be represented, but 1e-200 * 1e-200 underflows so that the result is zero.

Underflow is a ubiquitous problem in likelihood calculations, For example, if \(p(y_n \mid \theta) < 0.1\), then

\[ p(y \mid \theta) = \prod_{n=1}^N p(y_n \mid \theta) \]

will underflow as soon as \(N > 350\) or so.

To deal with underflow, work on the log scale. Even though \(p(y \mid \theta)\) can’t be represented, there is no problem representing

\[ \begin{array}{rcl} \log p(y \mid \theta) & = & \log \prod_{n=1}^N p(y_n \mid \theta) \\[4pt] & = & \sum_{n = 1}^N \log p(y_n \mid \theta) \end{array} \]

This is why all of Stan’s probability functions operate on the log scale.

15.3.7 Adding on the log scale

If we work on the log scale, multiplication is converted to addition, \[ \log (a \times b) = \log a + \log b. \] Thus we can just start on the log scale and stay there through multiplication. But what about addition? If we have \(\log a\) and \(\log b\), how do we get \(\log (a + b)\)? Working out the algebra,

\[ \log (a + b) \ = \ \log (\exp(\log a) + \exp(\log b)) \]

The nested log of sum of exponentials is so common, it has its own name,

\[ \mathrm{log\_sum\_exp}(u, v) \ = \ \log (\exp(u) + \exp(v)). \]

so that

\[ \log (a + b) \ = \ \mathrm{log\_sum\_exp}(\log a, \log b). \]

Although it appears this might overflow as soon as exponentiation is introduced, evaluation does not proceed by evaluating the terms as written. Instead, with a little algebra, the terms are rearranged into a stable form,

\[ \mathrm{log\_sum\_exp}(u, v) \ = \ \max(u, v) + \log\left( \exp(u - \max(u, v)) + \exp(v - \max(u, v)) \right). \]

Because the terms inside the exponentiations are \(u - \max(u, v)\) and \(v - \max(u, v)\), one will be zero, and the other will be negative. Because the operation is symmetric, it may be assumed without loss of generality that \(u \geq v\), so that

\[ \mathrm{log\_sum\_exp}(u, v) = u + \log(1 + \exp(v - u)). \]

The inner term may itself be evaluated using log1p, there is only limited gain because \(\exp(v - u)\) is only near zero when \(u\) is much larger than \(v\), meaning the result is likely to round to \(u\) anyway.

To conclude, when evaluating \(\log (a + b)\) given \(\log a\) and \(\log b\), and assuming \(\log a > \log b\), return

\[ \log (a + b) \ = \ \log a + \mathrm{log1p}(\exp(\log b - \log a)). \]