17.3 Arithmetic precision
The choice of significand provides \(\log_{10} 2^{53} \approx 15.95\) decimal (base 10) digits of arithmetic precision. This is just the precision of the floating-point representation. After several operations are chained together, the realized arithmetic precision is often much lower.
17.3.1 Rounding and probabilities
In practice, the finite amount of arithmetic precision leads to rounding, whereby a number is represented by the closest floating-point number. For example, with only 16 decimal digits of accuracy,
1 + 1e-20 == 1
The closest floating point number to \(1 + 10^{-20}\) turns out to be \(1\) itself. By contrast,
0 + 1e-20 == 1e-20
This highlights the fact that precision depends on scale. Even though
1 + 1e-20 == 1
, we have 1e-20 + 1e-20 == 2e-20
, as expected.
Rounding also manifests itself in a lack of transitivity. In particular, it does not usually hold for three floating point numbers \(a, b, c\) that \((a + b) + c = a + (b = c)\).
In statistical applications, problems often manifest in situations where users expect the usual rules of real-valued arithmetic to hold. Suppose we have a lower triangular matrix \(L\) with strictly positive diagonal, so that it is the Cholesky factor of a positive-definite matrix \(L \, L^{\top}\). In practice, rounding and loss of precision may render the result \(L \, L^{\top}\) neither symmetric nor positive definite.
In practice, care must be taken to defend against rounding. For example, symmetry may be produced by adding \(L \, L^{\top}\) with its transpose and dividing by two, or by copying the lower triangular portion into the upper portion. Positive definiteness may be maintained by adding a small quantity to the diagonal.
17.3.2 Machine precision and the asymmetry of 0 and 1
The smallest number greater than zero is roughly \(0 + 10^{-323}\). The largest number less than zero is roughly \(1 - 10^{-15.95}\). The asymmetry is apparent when considering the representation of that largest number smaller than one—the exponent is of no help, and the number is represented as the binary equivalent of \(0.9999999999999999\).
For this reason, the machine precision is said to be roughly
\(10^{15.95}\). This constant is available as machine_precision()
in
Stan.
17.3.3 Complementary and epsilon functions
Special operations are available to mitigate this problem with numbers
rounding when they get close to one. For example, consider the
operation log(1 + x)
for positive x
. When x
is small (less than
\(10^{-16}\) for double-precision floating point), the sum in the
argument will round to 1 and the result will round to zero. To allow
more granularity, programming languages provide a library function
directly implementing \(f(x) = \log (1 + x)\). In Stan (as in C++),
this operation is written as log1p(x)
. Because x
itself may be
close to zero, the function log1p(x)
can take the logarithm of
values very close to one, the results of which are close to zero.
Similarly, the complementary cumulative distribution functions (CCDF), defined by \(F^{\complement}_Y(y) = 1 - F_Y(y)\), where \(F_Y\) is the cumulative distribution function (CDF) for the random variable \(Y\). This allows values very close to one to be represented in complementary form.
17.3.4 Catastrophic cancellation
Another downside to floating point representations is that subtraction of two numbers close to each other results in a loss of precision that depends on how close they are. This is easy to see in practice. Consider \[\begin{align*} 1&.23456789012345 \\ - 1&.23456789012344 \\ = 0&.00000000000001 \end{align*}\] We start with fifteen decimal places of accuracy in the arguments and are left with a single decimal place of accuracy in the result.
Catastrophic cancellation arises in statistical computations whenever we calculate variance for a distribution with small standard deviations relative to its location. When calculating summary statistics, Stan uses Welford’s algorithm for computing variances. This avoids catastrophic cancellation and may also be carried out in a single pass.
17.3.5 Overflow
Even though 1e200
may be represented as a double precision floating
point value, there is no finite value large enough to represent 1e200 * 1e200
. The result of 1e200 * 1e200
is said to overflow. The
IEEE 754 standard requires the result to be positive infinity.
Overflow is rarely a problem in statistical computations. If it is, it’s possible to work on the log scale, just as for underflow as described below.
17.3.6 Underflow and the log scale
When there is no number small enough to represent a result, it is said
to underflow. For instance, 1e-200
may be represented, but
1e-200 * 1e-200
underflows so that the result is zero.
Underflow is a ubiquitous problem in likelihood calculations, For example, if \(p(y_n \mid \theta) < 0.1\), then \[ p(y \mid \theta) = \prod_{n=1}^N p(y_n \mid \theta) \] will underflow as soon as \(N > 350\) or so.
To deal with underflow, work on the log scale. Even though \(p(y \mid \theta)\) can’t be represented, there is no problem representing \[ \begin{array}{rcl} \log p(y \mid \theta) & = & \log \prod_{n=1}^N p(y_n \mid \theta) \\[4pt] & = & \sum_{n = 1}^N \log p(y_n \mid \theta) \end{array} \]
This is why all of Stan’s probability functions operate on the log scale.