Floating Point Arithmetic
Computers approximate real values in
Floating-point representations
Stan’s arithmetic is implemented using double-precision arithmetic. The behavior of most1 modern computers follows the floating-point arithmetic, IEEE Standard for Floating-Point Arithmetic (IEEE 754).
Finite values
The double-precision component of the IEEE 754 standard specifies the representation of real values using a fixed pattern of 64 bits (8 bytes). All values are represented in base two (i.e., binary). The representation is divided into two signed components:
significand (53 bits): base value representing significant digits
exponent (11 bits): power of two multiplied by the base
The value of a finite floating point number is
Normality
A normal floating-point value does not use any leading zeros in its significand; subnormal numbers may use leading zeros. Not all I/O systems support subnormal numbers.
Ranges and extreme values
There are some reserved exponent values so that legal exponent values range between
largest normal finite number:
largest subnormal finite number:
smallest positive normal number:
smallest positive subnormal number:
Signed zero
Because of the sign bit, there are two ways to represent zero, often called “positive zero” and “negative zero”. This distinction is irrelevant in Stan (as it is in R), because the two values are equal (i.e., 0 == -0
evaluates to true).
Not-a-number values
A specially chosen bit pattern is used for the not-a-number value (often written as NaN
in programming language output, including Stan’s).
Stan provides a value function not_a_number()
that returns this special not-a-number value. It is meant to represent error conditions, not missing values. Usually when not-a-number is an argument to a function, the result will not-a-number if an exception (a rejection in Stan) is not raised.
Stan also provides a test function is_nan(x)
that returns 1 if x
is not-a-number and 0 otherwise.
Not-a-number values propagate under almost all mathematical operations. For example, all of the built-in binary arithmetic operations (addition, subtraction, multiplication, division, negation) return not-a-number if any of their arguments are not-a-number. The built-in functions such as log
and exp
have the same behavior, propagating not-a-number values.
Most of Stan’s built-in functions will throw exceptions (i.e., reject) when any of their arguments is not-a-number.
Comparisons with not-a-number always return false, up to and including comparison with itself. That is, not_a_number() == not_a_number()
somewhat confusingly returns false. That is why there is a built-in is_nan()
function in Stan (and in C++). The only exception is negation, which remains coherent. This means not_a_number() != not_a_number()
returns true.
Undefined operations often return not-a-number values. For example, sqrt(-1)
will evaluate to not-a-number.
Positive and negative infinity
There are also two special values representing positive infinity (log(0)
evaluates to negative infinity. Exponentiating negative infinity leads back to zero, so that 0 == exp(log(0))
. Nevertheless, this should not be done in Stan because the chain rule used to calculate the derivatives will attempt illegal operations and return not-a-number.
There are value functions positive_infinity()
and negative_infinity()
as well as a test function is_inf()
.
Positive and negative infinity have the expected comparison behavior, so that negative_infinty() < 0
evaluates to true (represented with 1 in Stan). Also, negating positive infinity leads to negative infinity and vice-versa.
Positive infinity added to either itself or a finite value produces positive infinity. Negative infinity behaves the same way. However, attempts to subtract positive infinity from itself produce not-a-number, not zero. Similarly, attempts to divide infinite values results in a not-a-number value.
Literals: decimal and scientific notation
In programming languages such as Stan, numbers may be represented in standard decimal (base 10) notation. For example, 2.39
or -1567846.276452
. Remember there is no point in writing more than 16 significant digits as they cannot be represented. A number may be coded in Stan using scientific notation, which consists of a signed decimal representation of a base and a signed integer decimal exponent. For example, 36.29e-3
represents the number 0.03629
.
Arithmetic precision
The choice of significand provides
Rounding and probabilities
In practice, the finite amount of arithmetic precision leads to rounding, whereby a number is represented by the closest floating-point number. For example, with only 16 decimal digits of accuracy,
1 + 1e-20 == 1
The closest floating point number to
0 + 1e-20 == 1e-20
This highlights the fact that precision depends on scale. Even though 1 + 1e-20 == 1
, we have 1e-20 + 1e-20 == 2e-20
, as expected.
Rounding also manifests itself in a lack of transitivity. In particular, it does not usually hold for three floating point numbers
In statistical applications, problems often manifest in situations where users expect the usual rules of real-valued arithmetic to hold. Suppose we have a lower triangular matrix
In practice, care must be taken to defend against rounding. For example, symmetry may be produced by adding
Machine precision and the asymmetry of 0 and 1
The smallest number greater than zero is roughly
For this reason, the machine precision is said to be roughly machine_precision()
in Stan.
Complementary and epsilon functions
Special operations are available to mitigate this problem with numbers rounding when they get close to one. For example, consider the operation log(1 + x)
for positive x
. When x
is small (less than log1p(x)
. Because x
itself may be close to zero, the function log1p(x)
can take the logarithm of values very close to one, the results of which are close to zero.
Similarly, the complementary cumulative distribution functions (CCDF), defined by
Catastrophic cancellation
Another downside to floating point representations is that subtraction of two numbers close to each other results in a loss of precision that depends on how close they are. This is easy to see in practice. Consider
Catastrophic cancellation arises in statistical computations whenever we calculate variance for a distribution with small standard deviations relative to its location. When calculating summary statistics, Stan uses Welford’s algorithm for computing variances. This avoids catastrophic cancellation and may also be carried out in a single pass.
Overflow
Even though 1e200
may be represented as a double precision floating point value, there is no finite value large enough to represent 1e200 * 1e200
. The result of 1e200 * 1e200
is said to overflow. The IEEE 754 standard requires the result to be positive infinity.
Overflow is rarely a problem in statistical computations. If it is, it’s possible to work on the log scale, just as for underflow as described below.
Underflow and the log scale
When there is no number small enough to represent a result, it is said to underflow. For instance, 1e-200
may be represented, but 1e-200 * 1e-200
underflows so that the result is zero.
Underflow is a ubiquitous problem in likelihood calculations, For example, if
To deal with underflow, work on the log scale. Even though
This is why all of Stan’s probability functions operate on the log scale.
Log sum of exponentials
Working on the log scale, multiplication is converted to addition,
Log-sum-exp function
The nested log of sum of exponentials is so common, it has its own name, “log-sum-exp”,
Although it appears this might overflow as soon as exponentiation is introduced, evaluation does not proceed by evaluating the terms as written. Instead, with a little algebra, the terms are rearranged into a stable form,
Because the terms inside the exponentiations are
Although the inner term may itself be evaluated using the built-in function log1p
, there is only limited gain because
To conclude, when evaluating
Applying log-sum-exp to a sequence
The log sum of exponentials function may be generalized to sequences in the obvious way, so that if
Calculating means with log-sum-exp
An immediate application is to computing the mean of a vector
Comparing floating-point numbers
Because floating-point representations are inexact, it is rarely a good idea to test exact inequality. The general recommendation is that rather than testing x == y
, an approximate test may be used given an absolute or relative tolerance.
Given a positive absolute tolerance of epsilon
, x
can be compared to y
using the conditional
abs(x - y) <= epsilon.
Absolute tolerances work when the scale of x
and y
and the relevant comparison is known.
Given a positive relative tolerance of epsilon
, a typical comparison is
2 * abs(x - y) / (abs(x) + abs(y)) <= epsilon.
Footnotes
The notable exception is Intel’s optimizing compilers under certain optimization settings.↩︎