Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C++

Avoiding Overflow, Underflow, and Loss of Precision

4.88/5 (86 votes)
17 Dec 2014BSD7 min read 1   791  
Describes why the most obvious way of evaluating functions may be bad and how to do better

Introduction

The most obvious way of evaluating mathematical functions might not work. For example, how would you calculate log(x+1)/x? The most obvious approach would be to add 1 to x, take the log, then divide by x. For some values of x, that will produce an accurate result. For other values, it will be off by 100%.

This article gives three examples where the most direct evaluation of functions can fail. Along the way, it gives four general rules for accurately evaluating functions.

The code included with the article applies the principles in the article to evaluating nine functions that often come up in statistics.

Example 1: Quadratic Equation

Suppose you want to find the roots of 3x<sup>2</sup> + 10<sup>9</sup>x + 5 = 0. This will be simple: just apply the quadratic equation. What could go wrong? Let's see.

C++
double a = 3.0, b = 1.0e9, c = 5.0;
double d = b*b - 4.0*a*c;
double r1 = (-b - sqrt(d))/(2.0*a);
double r2 = (-b + sqrt(d))/(2.0*a);

This code correctly computes r1 to all displayed digits but calculates r2 as 0 even though the correct value to machine precision is <nobr>-4.5e-8</nobr>. The absolute error in computing r2 is small but the relative error is 100%. Why is one root computed accurately and the other not?

The cardinal rule in numerical analysis is to avoid subtracting nearly equal numbers. The more nearly equal two numbers are, the more precision is lost in the subtraction. Since b is so much larger than a and c, the variable sqrt(d) is very nearly the same as b. The variables b and sqrt(d) agree to 12 decimal places, so about 12 decimal places are lost in the subtraction. How do we compute r2 more accurately?

Rule 1: Use algebraic trickery to avoid loss of precision.

There's an equivalent but much less familiar form of the quadratic equation. Multiply the numerator and denominator of the standard form by the numerator with the ± sign turned upside down and simplify. This new form adds b and sqrt(d) in the denominator, avoiding the problem of subtracting nearly equal numbers. The revised code below is completely accurate:

C++
double a = 3.0, b = 1.0e9, c = 5.0;
double d = b*b - 4.0*a*c;
double r1 = (-b - sqrt(d))/(2.0*a);
double r2 = -2.0*c/(b + sqrt(d));

There's still some inaccuracy since d equals b*b in machine precision, but we've improved our error rate from 100% down to 10%. For more accuracy, abandon the quadratic equation and use a root finding algorithm.

Along these same lines, see Comparing three methods of computing standard deviation for numerical examples showing how three algebraically equivalent formulas can have very different behavior. A similar phenomena happens when fitting a line to data.

Example 2: Ratios of Factorials

Probability calculations often involve taking the ratio of very big numbers to produce a moderate-sized number. The final result may fit inside a double with room to spare, but the intermediate results would overflow. For example, suppose you need to calculate the number of ways to select 10 objects from a set of 200. This is 200!/(190! 10!), about 2.2 e16. But 200! and 190! would overflow the range of a double.

There are two ways around this problem. Both use the following rule.

Rule 2: Use algebraic trickery to avoid overflow.

The first trick is to recognize that 200! = 200*199*198*...*191*190! and so 200!/(190! 10!) = 200*199*198*...*191/10!. This certainly works, but it's limited to factorials. A more general technique is to use logarithms to avoid overflow: take the logarithm of the expression you want to evaluate and then exponentiate the result. In this example, log( 200!/(190! 10!) ) = log(200!) - log(190!) - log(10!). If you have code that calculates the logarithm of factorials directly without calculating factorials first, you could use it to find the logarithm of the result you want, then apply the exp function. For more on this example, see Five tips for floating point computing.

Example 3: log(1 + x)

Now let's look at the example of computing log(x+1) mentioned in the introduction. Consider the following code:

C++
double x = 1e-16;
double y = log(1 + x)/x;

This code sets y to 0, even though the correct value equals 1 to machine precision. What went wrong?

Double precision numbers are accurate to about 15 decimal places, so 1 + x equals 1 to machine precision. The log of 1 is zero, so y is set to zero. But for small values of x, log(1 + x) is approximately x, and so log(1+x)/x is approximately 1. That means the code above for computing log(1 + x)/x returns a result with 100% relative error. If x is not so small that 1 + x equals 1 in the machine, we can still have problems. If x is moderately small, the bits in x are not totally lost in computing 1 + x, but some of them are. The closer x is to 0, the more bits are lost.

This problem is similar to the problem with the quadratic equation, so we might try a similar solution. Unfortunately, algebraic trickery won't help. So we try our second rule.

Rule 3: Use analytical approximations to avoid loss of precision.

The numerical analysts favorite form of approximation is power series. The power series for log(1 + x) is x + x<sup>2</sup>/2! + x<sup>3</sup>/3! + ... For small values of x, simply returning x for log(1 + x) is an improvement. This would work well for the smallest values of x, but for some not-so-small values, this will not be accurate enough, but neither will directly computing log(1 + x). The sample code included with this article uses a rational approximation for small values of x and direct calculation for larger values.

Example 4: Inverse Logit

Next let's look at computing f(x) = e<sup>x</sup>/(1 + e<sup>x</sup>). (Statisticians call this the "inverse logit" function because it is the inverse of the function they call the "logit" function.) The most direct approach would be to compute exp(x)/(1 + exp(x)). Let's see where that can break down.

C++
double x = 1000;
double t = exp(x);
double y = t/(1.0 + t);

Printing out y gives -1.#IND. This is because the calculation for t overflowed, producing a number larger than could fit in a double. But we can figure out the value of y easily. If t is so big that we can't store it, then 1 + t is essentially the same as t and the ratio is very nearly 1. This suggests we find out which values of x are so big that f(x) will equal 1 to machine precision, then just return 1 for those values of x to avoid the possibility of overflow. This is our third rule.

Rule 4: Don't calculate a result you can accurately predict.

The header file float.h has a constant DBL_EPSILON which is the smallest double precision that we can add to 1 without getting 1 back. A little algebra shows that if x is bigger than -log(DBL_EPSILON), then f(x) will equal 1 to machine precision. So here's a code snippet for calculating f(x) for large values of x:

C++
const double x_max = -log(DBL_EPSILON);
if (x > x_max) return 1.0;

Using the Code

The code provided with this article calculates seven functions that come up in statistics. Each avoids problems of overflow, underflow, or loss of precision that could occur for large negative arguments, large positive arguments, or arguments near zero.

  • LogOnePlusX computes log(1 + x) as in Example 3
  • ExpMinusOne computes e<sup>x</sup>-1
  • Logit computes log(x/(1-x))
  • LogitInverse computes e<sup>x</sup>/(1 + e<sup>x</sup>) as discussed in Example 4
  • LogLogitInverse computes log(e<sup>x</sup>/(1 + e<sup>x</sup>))
  • LogitInverseDifference computes LogitInverse(x) - LogitInverse(y)
  • LogOnePlusExpX computes log(1 + exp(x))
  • ComplementaryLogLog computes log( -log(1 - x) )
  • ComplementaryLogLogInverse computes 1.0 - exp(-exp(x))

Points of Interest

The solutions presented here look unnecessary or even wrong at first. If this kind of code isn't well commented, someone will come through and "simplify" it incorrectly. They'll be proud of all the unnecessary clutter they removed. And if they do not test extreme values, their new code will appear to work correctly. The wrong answers and NaNs will only appear later.

History

  • 16th April, 2008: Initial post
  • 18th April, 2008: Revised article to clarify a couple of points
  • 9th October, 2008: Added links to further examples
  • 29th October, 2008: Added examples

License

This article, along with any associated source code and files, is licensed under The BSD License