15.3 Arithmetic Precision
The choice of significand provides \(\log_{10} 2^{53} \approx 15.95\) decimal (base 10) digits of arithmetic precision. This is just the precision of the floating-point representation. After several operations are chained together, the realized arithmetic precision is often much lower.
15.3.1 Rounding and probabilities
In practice, the finite amount of arithmetic precision leads to rounding, whereby a number is represented by the closest floating-point number. For example, with only 16 decimal digits of accuracy,
1 + 1e-20 == 1
The closest floating point number to \(1 + 10^{-20}\) turns out to be
\(1\) itself. By contrast, 0 + 1e-20 == 1e-20
.
This highlights the fact that precision depends on scale. Even though
1 + 1e-20 == 1
, we have 1e-20 + 1e-20 == 2e-20
, as expected.
Rounding also manifests itself in a lack of transitivity. In particular, it does not usually hold for three floating point numbers \(a, b, c\) that \((a + b) + c = a + (b = c)\).
In statistical applications, problems often manifest in situations where users expect the usual rules of real-valued arithmetic to hold. Suppose we have a lower triangular matrix \(L\) with strictly positive diagonal, so that it is the Cholesky factor of a positive-definite matrix \(L \, L^{\top}\). In practice, rounding and loss of precision may render the result \(L \, L^{\top}\) neither symmetric nor positive definite.
In practice, care must be taken to defend against rounding. For example, symmetry may be produced by adding \(L \, L^{\top}\) with its transpose and dividing by two, or by copying the lower triangular portion into the upper portion. Positive definiteness may be maintained by adding a small quantity to the diagonal.
15.3.2 Machine precision and the asymmetry of 0 and 1
The smallest number greater than zero is roughly \(0 + 10^{-323}\). The largest number less than zero is roughly \(1 - 10^{-15.95}\). The asymmetry is apparent when considering the representation of that largest number smaller than one—the exponent is of no help, and the number is represented as the binary equivalent of 0.9999999999999999.
For this reason, the machine precision is said to be roughly
\(10^{15.95}\). This constant is available as machine_precision()
in
Stan.
15.3.3 Complementary and epsilon functions
Special operations are available to mitigate this problem with numbers
rounding when they get close to one. For example, consider the
operation log(1 + x)
for positive x
. When x
is small (less than
\(10^{-16}\) for double-precision floating point), the sum in the
argument will round to 1 and the result will round to zero. To allow
more granularity, programming languages provide a library function
directly implementing \(f(x) = \log (1 + x)\). In Stan (as in C++),
this operation is written as log1p(x)
. Because x
itself may be
close to zero, the function log1p(x)
can take the logarithm of
values very close to one, the results of which are close to zero.
Similarly, the complementary cumulative distribution functions (CCDF), defined by \(F^{\complement}_Y(y) = 1 - F_Y(y)\), where \(F_Y\) is the cumulative distribution function (CDF) for the random variable \(Y\). This allows values very close to one to be represented in complementary form.
15.3.4 Catastrophic cancellation
Another downside to floating point representations is that subtraction of two numbers close to each other results in a loss of precision that depends on how close they are. This is easy to see in practice. Consider
1.23456789012345
- 1.23456789012344
= 0.00000000000001
We start with fifteen decimal places of accuracy in the arguments and are left with a single decimal place of accuracy in the result.
Catastrophic cancellation arises in statistical computations whenever we calculate variance for a distribution with small standard deviations relative to its location. When calculating summary statistics, Stan uses Welford’s algorithm for computing variances. This avoids catastrophic cancellation and may also be carried out in a single pass.
15.3.5 Overflow
Even though 1e200
may be represented as a double precision floating
point value, there is no finite value large enough to represent 1e200 * 1e200
. The result of 1e200 * 1e200
is said to overflow. The
IEEE 754 standard requires the result to be positive infinity.
Overflow is rarely a problem in statistical computations. If it is, it’s possible to work on the log scale, just as for underflow as described below.
15.3.6 Underflow and the log scale
When there is no number small enough to represent a result, it is said
to underflow. For instance, 1e-200
may be represented, but
1e-200 * 1e-200
underflows so that the result is zero.
Underflow is a ubiquitous problem in likelihood calculations, For example, if \(p(y_n \mid \theta) < 0.1\), then
\[ p(y \mid \theta) = \prod_{n=1}^N p(y_n \mid \theta) \]
will underflow as soon as \(N > 350\) or so.
To deal with underflow, work on the log scale. Even though \(p(y \mid \theta)\) can’t be represented, there is no problem representing
\[ \begin{array}{rcl} \log p(y \mid \theta) & = & \log \prod_{n=1}^N p(y_n \mid \theta) \\[4pt] & = & \sum_{n = 1}^N \log p(y_n \mid \theta) \end{array} \]
This is why all of Stan’s probability functions operate on the log scale.
15.3.7 Adding on the log scale
If we work on the log scale, multiplication is converted to addition, \[ \log (a \times b) = \log a + \log b. \] Thus we can just start on the log scale and stay there through multiplication. But what about addition? If we have \(\log a\) and \(\log b\), how do we get \(\log (a + b)\)? Working out the algebra,
\[ \log (a + b) \ = \ \log (\exp(\log a) + \exp(\log b)) \]
The nested log of sum of exponentials is so common, it has its own name,
\[ \mathrm{log\_sum\_exp}(u, v) \ = \ \log (\exp(u) + \exp(v)). \]
so that
\[ \log (a + b) \ = \ \mathrm{log\_sum\_exp}(\log a, \log b). \]
Although it appears this might overflow as soon as exponentiation is introduced, evaluation does not proceed by evaluating the terms as written. Instead, with a little algebra, the terms are rearranged into a stable form,
\[ \mathrm{log\_sum\_exp}(u, v) \ = \ \max(u, v) + \log\left( \exp(u - \max(u, v)) + \exp(v - \max(u, v)) \right). \]
Because the terms inside the exponentiations are \(u - \max(u, v)\) and \(v - \max(u, v)\), one will be zero, and the other will be negative. Because the operation is symmetric, it may be assumed without loss of generality that \(u \geq v\), so that
\[ \mathrm{log\_sum\_exp}(u, v) = u + \log(1 + \exp(v - u)). \]
The inner term may itself be evaluated using log1p
, there is only
limited gain because \(\exp(v - u)\) is only near zero when \(u\) is much
larger than \(v\), meaning the result is likely to round to \(u\) anyway.
To conclude, when evaluating \(\log (a + b)\) given \(\log a\) and \(\log b\), and assuming \(\log a > \log b\), return
\[ \log (a + b) \ = \ \log a + \mathrm{log1p}(\exp(\log b - \log a)). \]