Introduction
The Binary Operations extension functions apply to byte arrays, to provide an easy and hopefully fast way to use the basic binary operators.
The operators AND, OR, XOR, NOT, Shift Left, Shift Right are provided.
The provided functions rely on the System.Threading.Tasks.Parallel
library and on many unsafe
zones where pointers
are used to access sequentially the elements of the byte arrays.
Background
In the beginning, I was using a simple for
cycle that repeated the same binary operation on all the elements of the arrays, like in the following statement:
public static byte[] Bin_And_noParallel_noPointers(this byte[] ba, byte[] bt)
{
int longlen = Math.Max(ba.Length, bt.Length);
int shortlen = Math.Min(ba.Length, bt.Length);
byte[] result = new byte[longlen];
for (int i = 0; i < shortlen; i++)
{
result[i] = (byte)(ba[i] & bt[i]);
}
return result;
}
Then a question popped into my mind: "how can I make it faster?" I then tried three solutions (which are implemented in the code at the end
of the article): parallelism, use of pointers, and substitute the pointers to bytes with pointers to integers.
Using the code
The provided class was written using C# 4.0 (in the .NET environment). Once you can access its namespace, this class extends the functionality of the byte
array (byte[]
) with these six functions:
public static byte[] Bin_And(this byte[] ba, byte[] bt)
public static byte[] Bin_Or(this byte[] ba, byte[] bt)
public static byte[] Bin_Xor(this byte[] ba, byte[] bt)
public static byte[] Bin_Not(this byte[] ba)
public static byte[] Bin_ShiftLeft(this byte[] ba, int bits)
public static byte[] Bin_ShiftRight(this byte[] ba, int bits)
These functions are extended methods and are to be used in the way explained in the following example, where a binary 'AND' operation is executed, between two byte arrays:
byte[] MyFirstByteArray, MySecondByteArray;
byte[] result = MyFirstByteArray.Bin_And(MySecondByteArray);
If you are going to try these functions, remember to check the box 'Allow unsafe code' in Project -> Properties -> Build, since the pointers are used in usafe zones...
I've tested these operations by filling the byte arrays with literature texts (like the 'Divine Comedy' from the famous Italian poet Dante Alighieri, who wrote it around the
years 1308 and 1321; so there shouldn't be copyright problems) longer than 500,000 bytes, and it worked fine.
Explanation of the code
As I mentioned before, during my journey to improve the simple binary functions, I tried three different strategies to make them faster: parallelism, pointers, and integers instead of bytes.
At first I tried using the simple System.Threading.Tasks.Parallel.For
and the execution speed increased a bit, but not so much.
Then I tried the pointer approach: accessing the array through the index (as, for example, in MyArray[175]
) is slower than accessing
the same data through a pointer that is moved sequentially from the first to the last byte. This second approach gave good results, similar, if not even faster, than the parallel approach.
Then... why not try and combine them?
Well, using the Parallel.For
was a bad idea, since the overall speed dropped very much.
The final solution was a bit more complicated, but the increase of speed (especially with large arrays) has been worth the while:
look hereinafter to the sample code for the 'AND' binary operation:
public static byte[] Bin_And(this byte[] ba, byte[] bt)
{
int lenbig = Math.Max(ba.Length, bt.Length);
int lensmall = Math.Min(ba.Length, bt.Length);
byte[] result = new byte[lenbig];
int ipar = 0;
object o = new object();
System.Action paction = delegate()
{
int actidx;
lock (o)
{
actidx = ipar++;
}
unsafe
{
fixed (byte* ptres = result, ptba = ba, ptbt = bt)
{
byte* pr = (byte*)ptres;
byte* pa = (byte*)ptba;
byte* pt = (byte*)ptbt;
pr += actidx; pa += actidx; pt += actidx;
while (pr < ptres + lensmall)
{
*pr = (byte)(*pt & *pa);
pr += paralleldegree; pa += paralleldegree; pt += paralleldegree;
}
}
}
};
System.Action[] actions = new Action[paralleldegree];
for (int i = 0; i < paralleldegree; i++)
{
actions[i] = paction;
}
Parallel.Invoke(actions);
return result;
}
In this function, a delegate (containing the code to execute the AND operation) is declared, and then the same delegate is invoked a number of times
equal to the paralleldegree
variable (which is a class level variable assigned inside the static constructor (as you can see in the complete class code, at the end of the article)
and contains the number of processors available for the current machine). The various instances of the delegate are, of course, invoked in a parallel manner.
Inside the delegate, pointers are declared to the input and output byte arrays, and then, inside the while
cycle, they are moved to the next element at each cycle.
The core of the function is the simple line *pr = (*pt & *pa);
, where the AND binary operation is executed between the input values (*pt
and *pa
) putting the result in the output (*pr
).
Now, if paralleldegree
is 1 (i.e., no parallelism), this delegate is quite easy to understand.
But let's assume, for example, paralleldegree
is 4. This creates the problem that the four instances of the delegate would execute exactly the same code four times,
resulting in a very slow execution time. So, the idea, when paralleldegree
is 4 (or any number greater than 1), is to make each delegate start from
a different byte of the input byte arrays, and then, inside the while
cycle, jump to the next 4 bytes, at every cycle, thus calculating the result
without collisions with the other delegates running in parallel.
In this case, the first delegate would calculate the bytes 0, 4, 8, 12, etc; the second delegate would calculate the bytes 1, 5, 9, 13, etc., the third 2, 6, 10, 14 ...,
and the fourth 3, 7, 11, 15... The initial lock
statement serves exactly the purpose of making the delegates start from sequential points,
thanks to the ipar
variable (function level) that is increased at each call (four times when paralleldegree
is 4), and
the actidx
variable (delegate level), which is assigned only once (for every delegate) and acts something like the 'initial offset' of the pointers,
for every delegate (as can be seen in the line pr += actidx; pa += actidx; pt += actidx;
).
Using the parallelism together with the pointers resulted in a very good increase in execution speed (by the waw: to analyze the execution speed, I used to put
the function to be tested inside a loop, and executing it 100 or 1000 times, measuring the time with the System.Diagnostics.Stopwatch
class); but a further improvement
can be done, by replacing the byte pointers (byte*
) with uint
pointers (uint*
): uint
is four times bigger than byte
,
so every cycle would be four times shorter. Well, the speed increased again; not four times, but 'only' two times. So, this last improvement was, in my opinion, worth the while.
Full class code
using System;
using System.Threading.Tasks;
namespace MySpace
{
public static class BinOps
{
private const int bitsinbyte = 8;
private static readonly int paralleldegree;
private static readonly int uintsize;
private static readonly int bitsinuint;
static BinOps()
{
paralleldegree = Environment.ProcessorCount;
uintsize = sizeof(uint) / sizeof(byte); bitsinuint = uintsize * bitsinbyte;
}
public static byte[] Bin_And(this byte[] ba, byte[] bt)
{
int lenbig = Math.Max(ba.Length, bt.Length);
int lensmall = Math.Min(ba.Length, bt.Length);
byte[] result = new byte[lenbig];
int ipar = 0;
object o = new object();
System.Action paction = delegate()
{
int actidx;
lock (o)
{
actidx = ipar++;
}
unsafe
{
fixed (byte* ptres = result, ptba = ba, ptbt = bt)
{
uint* pr = (uint*)ptres;
uint* pa = (uint*)ptba;
uint* pt = (uint*)ptbt;
pr += actidx; pa += actidx; pt += actidx;
while (pr < ptres + lensmall)
{
*pr = (*pt & *pa);
pr += paralleldegree; pa += paralleldegree; pt += paralleldegree;
}
}
}
};
System.Action[] actions = new Action[paralleldegree];
for (int i = 0; i < paralleldegree; i++)
{
actions[i] = paction;
}
Parallel.Invoke(actions);
return result;
}
public static byte[] Bin_Or(this byte[] ba, byte[] bt)
{
int lenbig = Math.Max(ba.Length, bt.Length);
int lensmall = Math.Min(ba.Length, bt.Length);
byte[] result = new byte[lenbig];
int ipar = 0;
object o = new object();
System.Action paction = delegate()
{
int actidx;
lock (o)
{
actidx = ipar++;
}
unsafe
{
fixed (byte* ptres = result, ptba = ba, ptbt = bt)
{
uint* pr = (uint*)ptres;
uint* pa = (uint*)ptba;
uint* pt = (uint*)ptbt;
pr += actidx; pa += actidx; pt += actidx;
while (pr < ptres + lensmall)
{
*pr = (*pt | *pa);
pr += paralleldegree; pa += paralleldegree; pt += paralleldegree;
}
uint* pl = ba.Length > bt.Length ? pa : pt;
while (pr < ptres + lenbig)
{
*pr = *pl;
pr += paralleldegree; pl += paralleldegree;
}
}
}
};
System.Action[] actions = new Action[paralleldegree];
for (int i = 0; i < paralleldegree; i++)
{
actions[i] = paction;
}
Parallel.Invoke(actions);
return result;
}
public static byte[] Bin_Xor(this byte[] ba, byte[] bt)
{
int lenbig = Math.Max(ba.Length, bt.Length);
int lensmall = Math.Min(ba.Length, bt.Length);
byte[] result = new byte[lenbig];
int ipar = 0;
object o = new object();
System.Action paction = delegate()
{
int actidx;
lock (o)
{
actidx = ipar++;
}
unsafe
{
fixed (byte* ptres = result, ptba = ba, ptbt = bt)
{
uint* pr = ((uint*)ptres) + actidx;
uint* pa = ((uint*)ptba) + actidx;
uint* pt = ((uint*)ptbt) + actidx;
while (pr < ptres + lensmall)
{
*pr = (*pt ^ *pa);
pr += paralleldegree; pa += paralleldegree; pt += paralleldegree;
}
uint* pl = ba.Length > bt.Length ? pa : pt;
while (pr < ptres + lenbig)
{
*pr = *pl;
pr += paralleldegree; pl += paralleldegree;
}
}
}
};
System.Action[] actions = new Action[paralleldegree];
for (int i = 0; i < paralleldegree; i++)
{
actions[i] = paction;
}
Parallel.Invoke(actions);
return result;
}
public static byte[] Bin_Not(this byte[] ba)
{
int len = ba.Length;
byte[] result = new byte[len];
int ipar = 0;
object o = new object();
System.Action paction = delegate()
{
int actidx;
lock (o)
{
actidx = ipar++;
}
unsafe
{
fixed (byte* ptres = result, ptba = ba)
{
uint* pr = (uint*)ptres;
uint* pa = (uint*)ptba;
pr += actidx; pa += actidx;
while (pr < ptres + len)
{
*pr = ~(*pa);
pr += paralleldegree; pa += paralleldegree;
}
}
}
};
System.Action[] actions = new Action[paralleldegree];
for (int i = 0; i < paralleldegree; i++)
{
actions[i] = paction;
}
Parallel.Invoke(actions);
return result;
}
public static byte[] Bin_ShiftLeft(this byte[] ba, int bits)
{
int ipar = 0;
object o = new object();
int len = ba.Length;
if (bits >= len * bitsinbyte) return new byte[len];
int shiftbits = bits % bitsinuint;
int shiftuints = bits / bitsinuint;
byte[] result = new byte[len];
if (len > 1)
{
unsafe
{
fixed (byte* fpba = ba, fpres = result)
{
uint* pres = (uint*)fpres + shiftuints;
uint* pba = (uint*)fpba;
*pres = *pba << shiftbits;
}
}
System.Action paction = delegate()
{
int actidx;
lock (o)
{
actidx = ipar++;
}
unsafe
{
fixed (byte* fpba = ba, fpres = result)
{
uint* pres = (uint*)fpres + shiftuints + actidx + 1;
uint* pba1 = (uint*)fpba + actidx + 1;
if (shiftbits == 0)
{
while (pres < fpres + len)
{
*pres = *pba1;
pres += paralleldegree; pba1 += paralleldegree;
}
}
else
{
uint* pba2 = (uint*)fpba + actidx;
while (pres < fpres + len)
{
*pres = *pba2 >> (bitsinuint - shiftbits) | *pba1 << shiftbits;
pres += paralleldegree; pba1 += paralleldegree; pba2 += paralleldegree;
}
}
}
};
};
System.Action[] actions = new Action[paralleldegree];
for (int i = 0; i < paralleldegree; i++)
{
actions[i] = paction;
}
Parallel.Invoke(actions);
}
return result;
}
public static byte[] Bin_ShiftRight(this byte[] ba, int bits)
{
int ipar = 0;
object o = new object();
int len = ba.Length;
if (bits >= len * bitsinbyte) return new byte[len];
int ulen = len / uintsize + 1 - (uintsize - (len % uintsize)) / uintsize;
int shiftbits = bits % bitsinuint;
int shiftuints = bits / bitsinuint;
byte[] result = new byte[len];
if (len > 1)
{
unsafe
{
fixed (byte* fpba = ba, fpres = result)
{
uint* pres = (uint*)fpres + ulen - shiftuints - 1;
uint* pba = (uint*)fpba + ulen - 1;
*pres = *pba >> shiftbits;
}
}
System.Action paction = delegate()
{
int actidx;
lock (o)
{
actidx = ipar++;
}
unsafe
{
fixed (byte* fpba = ba, fpres = result)
{
uint* pres = (uint*)fpres + actidx;
uint* pba1 = (uint*)fpba + shiftuints + actidx;
if (shiftbits == 0)
{
while (pres < ((uint*)fpres) + ulen - shiftuints - 1)
{
*pres = *pba1;
pres += paralleldegree; pba1 += paralleldegree;
}
}
else
{
uint* pba2 = (uint*)fpba + shiftuints + actidx + 1;
while (pres < ((uint*)fpres) + ulen - shiftuints - 1)
{
*pres = (*pba1 >> shiftbits | *pba2 << (bitsinuint - shiftbits));
pres += paralleldegree; pba1 += paralleldegree; pba2 += paralleldegree;
}
}
}
};
};
System.Action[] actions = new Action[paralleldegree];
for (int i = 0; i < paralleldegree; i++)
{
actions[i] = paction;
}
Parallel.Invoke(actions);
}
return result;
}
}
}