Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Interpreting Intel 80-bit Long Double Byte Arrays

0.00/5 (No votes)
5 Apr 2004 1  
A simple BitConverter class that is capable of reading and writing Intel 80-bit long doubles.

Abstract

I recently found myself needing to read Intel 80-bit long doubles from a binary stream whilst integrating with another system. For cross platform compatibility reasons, Microsoft decided not to include a long double (a.k.a. extended) type in the framework and thus I was forced to interpret the bytes myself.

This is not an implementation of a long double type, but a BitConverter class that translates between long double byte arrays and regular doubles. Naturally, a certain amount of range and precision is lost during the process; however, for my purposes, this is acceptable.

Introduction

A long double is a floating point number that is 16 bits bigger than a regular double. These additional bits are used to increase both the range and precision of the number and are usually used for mathematical and scientific calculations.

This article describes how to read a long double byte array and create a regular double value from it. The inverse operation is simply a matter of reversing the procedure and will not be covered here.

Problem

The first step is to identify the different kinds of long doubles and their equivalent regular doubles. The rules that define each of these states can be found in the IEEE Standard 754 specification (see References).

Long Double Double Equivalent
Unsupported Unsupported
Normal Zero, Subnormal, Normal or Overflow (depends on e)
Subnormal Zero
Pseudo-Denormal Zero
Signed Zero Zero
Positive Infinity Positive Infinity
Negative Infinity Negative Infinity
Quiet NaN NaN
Signaling NaN NaN

There are three main kinds of long doubles; normal, subnormal and pseudo-denormal. Each of these represent adjacent (but overlapping) ranges of numbers, pseudo-denormal being the smallest, followed by subnormal, then normal. Since the entire range of a regular double fits into the range of a normal long double, subnormal and pseudo-denormal numbers must round to zero.

The internal bytes of a long double consist of four parts (or fields); a sign bit (s), a biased exponent (e), a significand bit (j) and a fraction (f). Doubles are arranged in a similar way except that they don't have a significand bit and the exponent and fraction fields are smaller. The following diagram illustrates these differences and shows how the fields are translated.

Field layouts and the translation between them.

Now you're probably wondering what j is and what happens to it when translating to a double. I found it was of little significance and was only used when detecting the unsupported state, so I won't go into it here.

Solution

So, now the problem can be solved by determining the type of long double and then translating e and f (if it happens to be a normal number).

Starting with f, the following code translates it into a value suitable for a double.

f >>= 11;

It is shifted to suit the 11 less bits that are available in the double's fraction field. Next, we translate e.

e -= (0x3FFF - 0x3FF);

Since e is biased, to take it from 15 bits to 11 bits involves subtracting the original bias (214 - 1) and adding the new one (210 - 1).

Clearly e can still fall outside the range of an 11-bit number after this translation. If e is too high, then the number is too big and cannot be represented by a double. An OverflowException is thrown in this case. If e is less than 0 then the number is too small and will be rounded to zero.

However, if e is no less than -51, it can be salvaged by translating it into a subnormal double using some careful bit manipulation. The following code does just this:

if(e >= 0x7FF) //outside the range of a double

    throw new OverflowException();
else if(e < -51) //too small to translate into subnormal

    return 0;
else if(e < 0) //big enough to translate into subnormal

{
    f |= 0x10000000000000;
    f >>= (1 - e);
    e = 0;
}

To understand the above translation, it is important to understand the mathematical representation of a normal double. The following is a (simplified) definition.

2e-1023 x 1.f, e > 0

The more we reduce e below 1023, the more times we end up halving the resulting number. In this case, we have tried to reduce e past 0 which is not allowed. Another way to reduce the result is to halve the f component instead, that is, bit shifting it to the right. Since the fraction field of a double is 52 bits wide, 52 becomes the maximum number of shifts we can do before we're left with zero.

The last step in the process is to create a double byte array with our new field values and use the standard BitConverter to read it into a double.

Using the Code

The code is used in the same way that the System.BitConverter class is used. To read a long double byte array, use ToDouble() specifying a byte array and start index. To generate a byte array from a double, use BetBytes() specifying the double.

Conclusion

This project proved to be an interesting look into the structure and bitwise manipulation of floating point numbers.

Doubles are surprisingly easy to construct and deconstruct once you understand their internals.

References

History

  • 2004-04-06: Initial release.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here