Introduction
Calculation of sums of multiples is required for many algorithms. It is so important that it has been added to the recommended functions list in the IEEE Standard for Floating-Point Arithmetic (IEEE Std 754-2008, section 9.4). Examples of operations requiring such calculations are:
- Complex multiplication and division
- Vector length
- Vector dot products
Unfortunately, the most common methods for performing these operations may lead to unacceptable errors due to rounding of intermediate results.
Error-free Transforms
Exact Multiplication
Given the availability of the fma()
operation (as required by IEEE Std 754-2008, and as present in many modern CPUs), and assuming no overflow/underflow, it is possible to calculate the exact value of a*b
using two floating-point operations:
2MultFMA(a, b):
s = a * b
if s is non-finite
return (s, 0)
t = fma(a, b, -s)
return (s,t)
The exact result is given as s+t
.
- Without the check, if
a*b
overflows, then s
will equal +infinity, while t
will equal -infinity, and the sum will be a NaN - If
a*b
underflows, the result will not be exact, but will give the best approximation
Exact Addition
2Sum(a, b):
s = RN(a + b)
if s is non-finite
return (s, 0)
b’ = RN(s − a)
a’ = RN(s – b’)
δb = RN(b – b’)
δa = RN(a – a’)
t = RN(δa + δb)
return (s, t)
where RN()
means "round to nearest or even".
The exact result is given as s+t
.
- Without the check,, if
a+b
overflows, then the result will be a NaN - If
a+b
underflows, the result will still be exact
Summation of Multiple Components
Error-free summation of more than two components is surprisingly difficult. The algorithm presented here is a variant of compensated addition, where all operations are exact.
Given a vector of N floating-point numbers, we may sum them as follows:
for i = 1 to N-1
(a[i], a[i-1]) = 2Sum(a[i], a[i-1])
This performs an error-free transform of the vector a
into a form where a[N-1]
contains an approximation to the sum, while a[0] .. a[N-2]
contain the residuals of the summation operations.
Repeating this operation “sweeps” the residuals towards the sum, decreasing the magnitudes of the residuals and increasing the accuracy of the result. In practice, repeating the algorithm once (K = 2)
is usually sufficient.
Rump, Ogita, and Oishi have analyzed this algorithm, and shown that in the worst case, the error (the sum of all the residuals) converges to a value slightly larger than 0.5 ulp of the sum. This algorithm therefore does not guarantee correct rounding, but in most cases – the result is correctly rounded.
Accuracy
Muller et al. perform an analysis of various summation algorithms. In addition to the formal analysis, the results of two sample additions are given:
Method | Error in ulps |
Errors of the various methods for Xi = RN(cos(i)) for N=5000 and binary32 arithmeticAdd values in increasing order (requires sort!) | 18.90625 |
Add values in decreasing order (requires sort!) | 16.90625 |
Compensated addition (Kahan) | 6.90625 |
Doubly compensated (Priest) (requires sort!) | 0.09375 |
Rump, Ogita, and Oishi | 0.09375 |
Method | Error in ulps |
Errors of the various methods for Xi = RN(1/i), with N = 10 and binary32 arithmeticAdd values in increasing order (requires sort!) | 6.860 |
Add values in decreasing order (requires sort!) | 738.900 |
Compensated addition (Kahan) | 0.137 |
Doubly compensated (Priest) (requires sort!) | 0.137 |
Rump, Ogita, and Oishi | 0.137 |
As we see, the only algorithms that consistently give accurate results (error < 0.5 ulp) are Priest’s and Rump-Ogita-Oishi’s. However, Priest’s method requires that the operands are pre-sorted.
Conclusion
The general method of operation is:
- Transform each multiplication into a pair of floating-point numbers, whose sum is an error-free representation of the result.
- Use the Rump-Ogita-Oishi K-fold addition algorithm to sum this vector of 2*N numbers, giving a result accurate to within slightly more than 0.5 ulp of the correctly-rounded result.
The simple method requires N multiplications and N-1 additions to deliver a result of dubious accuracy. The method described here requires N multiplications, N fma()
operations, and 6*K*(2*N-1)
additions/subtractions, but gives a highly accurate result.
References
- IEEE Standard for Floating-Point Arithmetic (Std 754-2008)
- O. Møller. Quasi double-precision in floating-point addition. BIT, 5:37–50, 1965.
- T. J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224–242, 1971.
- D. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, Reading, MA, 3rd edition, 1998.
- S. M. Rump, T. Ogita, and S. Oishi. Accurate floating-point summation part I: Faithful rounding. SIAM Journal on Scientific Computing, 31(1):189–224, 2008.
- Muller et al., Handbook of Floating-Point Arithmetic, Birkh¨auser, 2010