UInt128 Series
I think that I shall never envision
An op unlovely as division
An op whose answer must be guessed
And then, through multiply, assessed;
An op for which we dearly pay,
In cycles wasted every day.
Division code is often hairy;
Long division's downright scary.
The proofs can overtax your brain,
The ceiling and floor may drive you insane.
Good code to divide takes a Knuthian hero,
But even God can't divide by zero!
- Henry S. Warren Jr.
Hacker's Delight
Introduction
Welcome to the Boss Battle of Integer Arithmetic. Of all the basic operations, there is no other which has annoyed, irritated, and caused as many sleepless nights as the Division operation. Some of the greatest optimizations in programming have been finding ways to speed up or in some cases, eliminate all together, the Division operation.
I will be showing you two methods for Dividing UInt128's: The Binary Shift and the Divide and Conquer. And before you go to the Comments and ask: "What about Newton-Raphson Division?"
the reason I won't be going over it is that, while it is a very useful method for Division:
- Newton-Raphson Division works very well for larger integers, where the difference in speed between Division and Multiplication is much, much greater. For smaller integers, it's not as fast, especially when you have to calculate the inverse each time.
- Newton-Raphson is primarily useful when you have to divide by the same number multiple times. In that case having to calculate the inverse first is easily worth it; but if you only have to divide by it once, then due to Reason 1, it's not always as fast.
Binary Shift
The Binary Shift Algorithm is one of the simplest division algorithms. The basic idea is to shift the Divisor so that it's highest bit is even with the highest bit of the Dividend. If the Divisor is less than or equal to the Dividend, then you subtract it from the Dividend and add 1 to the Quotient. Then you shift the quotient left once, and the Divisor right once. You keep doing this until the Divisor is back where it started. The Remainder will be what's left of the Dividend.
The binary shift has a bad reputation, because when the Dividend is much larger than the Divisor, it's very slow. However I've found that it's still a very practical algorithm when the Divisor and Dividend are roughly the same size. In that case, it is even faster than the Divide and Conquer algorithm. I've done some informal benchmarking, and it seems that the threshold between the two is when the Dividend is more than 5 bits longer than the Divisor.
void bindivmod128(uint128 M, uint128 N, uint128& Q, uint128 &R)
{
Q.Hi = Q.Lo = 0;
size_t Shift = nlz128(N) - nlz128(M);
shiftleft128(N, Shift, N);
do
{
shiftleft128(Q, 1, Q);
if(compare128(M, N) >= 0)
{
sub128(M, N, M);
Q.Lo |= 1;
}
shiftright128(N, 1, N);
}while(Shift-- != 0);
R.Hi = M.Hi;
R.Lo = M.Lo;
}
Divide and Conquer
So the basic idea behind the Divide and Conquer approach is that instead of calculating the entire Quotient at once, we will generate a "trial" quotient. Using that, we will calculate the remainder. If it's greater than the Divisor, then we increment the quotient and subtract the Divisor from the Remainder. Now if we were to just do that, then it's possible the trial quotient could be way off, in which case this could be even slower than the Binary Shift. But the trick is that we manipulate the Dividend and Divisor so that we can guarantee that it will never be that far off.
That's really about as much as I am capable of explaining. I adapted this code from Hacker's Delight. If you are looking for a more indepth explanation, I suggest you look in there.
So in order to use the Divide and Conquer algorithm, we will need two functions. The first Divides a 128 bit Dividend by a 64 bit Divisor. It calculates both the Quotient and the Remainder.
void divmod128by64(const uint64 u1, const uint64 u0, uint64 v, uint64& q, uint64& r)
{
const uint64 b = 1ll << 32;
uint64 un1, un0, vn1, vn0, q1, q0, un32, un21, un10, rhat, left, right;
size_t s;
s = nlz64(v);
v <<= s;
vn1 = v >> 32;
vn0 = v & 0xffffffff;
if (s > 0)
{
un32 = (u1 << s) | (u0 >> (64 - s));
un10 = u0 << s;
}
else
{
un32 = u1;
un10 = u0;
}
un1 = un10 >> 32;
un0 = un10 & 0xffffffff;
q1 = un32 / vn1;
rhat = un32 % vn1;
left = q1 * vn0;
right = (rhat << 32) + un1;
again1:
if ((q1 >= b) || (left > right))
{
--q1;
rhat += vn1;
if (rhat < b)
{
left -= vn0;
right = (rhat << 32) | un1;
goto again1;
}
}
un21 = (un32 << 32) + (un1 - (q1 * v));
q0 = un21 / vn1;
rhat = un21 % vn1;
left = q0 * vn0;
right = (rhat << 32) | un0;
again2:
if ((q0 >= b) || (left > right))
{
--q0;
rhat += vn1;
if (rhat < b)
{
left -= vn0;
right = (rhat << 32) | un0;
goto again2;
}
}
r = ((un21 << 32) + (un0 - (q0 * v))) >> s;
q = (q1 << 32) | q0;
}
Yes, it uses goto. Like I said, I adapted it.
The second function we need is the main function to divide a 128 bit integer by another. This one calls DivMod128by64
when needed.
static void divmod128by128(uint128 M, uint128 N, uint128& Q, uint128& R)
{
if (N.Hi == 0)
{
if (M.Hi < N.Lo)
{
divmod128by64(M.Hi, M.Lo, N.Lo, Q.Lo, R.Lo);
Q.Hi = 0;
R.Hi = 0;
return;
}
else
{
Q.Hi = M.Hi / N.Lo;
R.Hi = M.Hi % N.Lo;
divmod128by64(R.Hi, M.Lo, N.Lo, Q.Lo, R.Lo);
R.Hi = 0;
return;
}
}
else
{
size_t n = nlz64(N.Hi);
uint128 v1;
shiftleft128(N, n, v1);
uint128 u1;
shiftright128(M, 1, u1);
uint128 q1;
div128by64(u1.Hi, u1.Lo, v1.Hi, q1.Lo);
q1.Hi = 0;
shiftright128(q1, 63 - n, q1);
if ((q1.Hi | q1.Lo) != 0)
{
dec128(q1, q1);
}
Q.Hi = q1.Hi;
Q.Lo = q1.Lo;
mult128(q1, N, q1);
sub128(M, q1, R);
if (compare128(R, N) >= 0)
{
inc128(Q, Q);
sub128(R, N, R);
}
return;
}
}
An Efficient Division Function
As I mentioned above, there are some cases where the Binary Shift algorithm is faster than the Divide and Conquer algorithm, so you need one more function that is able to choose the appropriate one. You also have other cases where you speed things up such as Division by 1 or another power of 2. As well as having to look out for Division by 0.
This means we need a primary function capable of handling all of these situations.
void divmod128(uint128 M, uint128 N, uint128& Q, uint128& R)
{
size_t Nlz, Mlz, Ntz;
int C;
Nlz = nlz128(N);
Mlz = nlz128(M);
Ntz = ntz128(N);
if(Nlz == 128)
{
throw 0;
}
else if((M.Hi | N.Hi) == 0)
{
Q.Hi = R.Hi = 0;
Q.Lo = M.Lo / N.Lo;
R.Lo = M.Lo % N.Lo;
return;
}
else if(Nlz == 127)
{
Q = M;
R.Hi = R.Lo = 0;
return;
}
else if((Ntz + Nlz) == 127)
{
shiftright128(M, Ntz, Q);
dec128(N, N);
and128(N, M, R);
return;
}
C = compare128(M, N);
if(C < 0)
{
Q.Hi = Q.Lo = 0;
R = M;
return;
}
else if(C == 0)
{
Q.Hi = R.Hi = R.Lo = 0;
Q.Lo = 1;
return;
}
if((Nlz - Mlz) > 5)
{
divmod128by128(M, N, Q, R);
}
else
{
bindivmod128(M, N, Q, R);
}
}
Conclusion
First off, I'd like to apologize for the long delay between articles. I kept having problems getting the example code to work, and between that and work and school, it just snowballed into one big frustration for me.
However, in the event I missed any errors, feel free to let me know as always.