Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / numerics

Accurate Calculation of Sums of Multiples

4.86/5 (4 votes)
2 Mar 2017CPOL3 min read 7.4K  
A method for nearly correct rounding of sums of multiples

Introduction

Calculation of sums of multiples is required for many algorithms. It is so important that it has been added to the recommended functions list in the IEEE Standard for Floating-Point Arithmetic (IEEE Std 754-2008, section 9.4). Examples of operations requiring such calculations are:

  • Complex multiplication and division
  • Vector length
  • Vector dot products

Unfortunately, the most common methods for performing these operations may lead to unacceptable errors due to rounding of intermediate results.

Error-free Transforms

Exact Multiplication

Given the availability of the fma() operation (as required by IEEE Std 754-2008, and as present in many modern CPUs), and assuming no overflow/underflow, it is possible to calculate the exact value of a*b using two floating-point operations:

2MultFMA(a, b):
  s = a * b
  if s is non-finite
    return (s, 0)
  t = fma(a, b, -s)
  return (s,t)

The exact result is given as s+t.

  • Without the check, if a*b overflows, then s will equal +infinity, while t will equal -infinity, and the sum will be a NaN
  • If a*b underflows, the result will not be exact, but will give the best approximation

Exact Addition

2Sum(a, b):
  s = RN(a + b)
  if s is non-finite
    return (s, 0)
  b’ = RN(s − a)
  a’ = RN(s – b’)
  δb = RN(b – b’)
  δa = RN(a – a’)
  t = RN(δa + δb)
  return (s, t)

where RN() means "round to nearest or even".

The exact result is given as s+t.

  • Without the check,, if a+b overflows, then the result will be a NaN
  • If a+b underflows, the result will still be exact

Summation of Multiple Components

Error-free summation of more than two components is surprisingly difficult. The algorithm presented here is a variant of compensated addition, where all operations are exact.

Given a vector of N floating-point numbers, we may sum them as follows:

for i = 1 to N-1
    (a[i], a[i-1]) = 2Sum(a[i], a[i-1])

This performs an error-free transform of the vector a into a form where a[N-1] contains an approximation to the sum, while a[0] .. a[N-2] contain the residuals of the summation operations.

Repeating this operation “sweeps” the residuals towards the sum, decreasing the magnitudes of the residuals and increasing the accuracy of the result. In practice, repeating the algorithm once (K = 2) is usually sufficient.

Rump, Ogita, and Oishi have analyzed this algorithm, and shown that in the worst case, the error (the sum of all the residuals) converges to a value slightly larger than 0.5 ulp of the sum. This algorithm therefore does not guarantee correct rounding, but in most cases – the result is correctly rounded.

Accuracy

Muller et al. perform an analysis of various summation algorithms. In addition to the formal analysis, the results of two sample additions are given:

MethodError in ulps
Errors of the various methods for Xi = RN(cos(i)) for N=5000 and binary32 arithmetic
Add values in increasing order (requires sort!)18.90625
Add values in decreasing order (requires sort!)16.90625
Compensated addition (Kahan)6.90625
Doubly compensated (Priest) (requires sort!)0.09375
Rump, Ogita, and Oishi0.09375
MethodError in ulps
Errors of the various methods for Xi = RN(1/i), with N = 10 and binary32 arithmetic
Add values in increasing order (requires sort!)6.860
Add values in decreasing order (requires sort!)738.900
Compensated addition (Kahan)0.137
Doubly compensated (Priest) (requires sort!)0.137
Rump, Ogita, and Oishi0.137

As we see, the only algorithms that consistently give accurate results (error < 0.5 ulp) are Priest’s and Rump-Ogita-Oishi’s. However, Priest’s method requires that the operands are pre-sorted.

Conclusion

The general method of operation is:

  • Transform each multiplication into a pair of floating-point numbers, whose sum is an error-free representation of the result.
  • Use the Rump-Ogita-Oishi K-fold addition algorithm to sum this vector of 2*N numbers, giving a result accurate to within slightly more than 0.5 ulp of the correctly-rounded result.

The simple method requires N multiplications and N-1 additions to deliver a result of dubious accuracy. The method described here requires N multiplications, N fma() operations, and 6*K*(2*N-1) additions/subtractions, but gives a highly accurate result.

References

  1. IEEE Standard for Floating-Point Arithmetic (Std 754-2008)
  2. O. Møller. Quasi double-precision in floating-point addition. BIT, 5:37–50, 1965.
  3. T. J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224–242, 1971.
  4. D. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, Reading, MA, 3rd edition, 1998.
  5. S. M. Rump, T. Ogita, and S. Oishi. Accurate floating-point summation part I: Faithful rounding. SIAM Journal on Scientific Computing, 31(1):189–224, 2008.
  6. Muller et al., Handbook of Floating-Point Arithmetic, Birkh¨auser, 2010

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)