Abstract
I recently found myself needing to read Intel 80-bit long doubles from a binary stream whilst integrating with another system. For cross platform compatibility reasons, Microsoft decided not to include a long double (a.k.a. extended) type in the framework and thus I was forced to interpret the bytes myself.
This is not an implementation of a long double type, but a BitConverter
class that translates between long double byte arrays and regular double
s. Naturally, a certain amount of range and precision is lost during the process; however, for my purposes, this is acceptable.
Introduction
A long double is a floating point number that is 16 bits bigger than a regular double
. These additional bits are used to increase both the range and precision of the number and are usually used for mathematical and scientific calculations.
This article describes how to read a long double byte array and create a regular double
value from it. The inverse operation is simply a matter of reversing the procedure and will not be covered here.
Problem
The first step is to identify the different kinds of long doubles and their equivalent regular double
s. The rules that define each of these states can be found in the IEEE Standard 754 specification (see References).
Long Double |
Double Equivalent |
Unsupported |
Unsupported |
Normal |
Zero, Subnormal, Normal or Overflow (depends on e) |
Subnormal |
Zero |
Pseudo-Denormal |
Zero |
Signed Zero |
Zero |
Positive Infinity |
Positive Infinity |
Negative Infinity |
Negative Infinity |
Quiet NaN |
NaN |
Signaling NaN |
NaN |
There are three main kinds of long doubles; normal, subnormal and pseudo-denormal. Each of these represent adjacent (but overlapping) ranges of numbers, pseudo-denormal being the smallest, followed by subnormal, then normal. Since the entire range of a regular double
fits into the range of a normal long double, subnormal and pseudo-denormal numbers must round to zero.
The internal bytes of a long double consist of four parts (or fields); a sign bit (s), a biased exponent (e), a significand bit (j) and a fraction (f). Doubles are arranged in a similar way except that they don't have a significand bit and the exponent and fraction fields are smaller. The following diagram illustrates these differences and shows how the fields are translated.
Now you're probably wondering what j is and what happens to it when translating to a double
. I found it was of little significance and was only used when detecting the unsupported state, so I won't go into it here.
Solution
So, now the problem can be solved by determining the type of long double and then translating e and f (if it happens to be a normal number).
Starting with f, the following code translates it into a value suitable for a double
.
f >>= 11;
It is shifted to suit the 11 less bits that are available in the double
's fraction field. Next, we translate e.
e -= (0x3FFF - 0x3FF);
Since e is biased, to take it from 15 bits to 11 bits involves subtracting the original bias (214 - 1) and adding the new one (210 - 1).
Clearly e can still fall outside the range of an 11-bit number after this translation. If e is too high, then the number is too big and cannot be represented by a double
. An OverflowException
is thrown in this case. If e is less than 0 then the number is too small and will be rounded to zero.
However, if e is no less than -51, it can be salvaged by translating it into a subnormal double using some careful bit manipulation. The following code does just this:
if(e >= 0x7FF)
throw new OverflowException();
else if(e < -51)
return 0;
else if(e < 0)
{
f |= 0x10000000000000;
f >>= (1 - e);
e = 0;
}
To understand the above translation, it is important to understand the mathematical representation of a normal double
. The following is a (simplified) definition.
2e-1023 x 1.f, e > 0
The more we reduce e below 1023, the more times we end up halving the resulting number. In this case, we have tried to reduce e past 0 which is not allowed. Another way to reduce the result is to halve the f component instead, that is, bit shifting it to the right. Since the fraction field of a double is 52 bits wide, 52 becomes the maximum number of shifts we can do before we're left with zero.
The last step in the process is to create a double byte array with our new field values and use the standard BitConverter
to read it into a double
.
Using the Code
The code is used in the same way that the System.BitConverter
class is used. To read a long double byte array, use ToDouble()
specifying a byte array and start index. To generate a byte array from a double
, use BetBytes()
specifying the double
.
Conclusion
This project proved to be an interesting look into the structure and bitwise manipulation of floating point numbers.
Doubles are surprisingly easy to construct and deconstruct once you understand their internals.
References
History
- 2004-04-06: Initial release.