Introduction
This article compares different allocation- and copy-methods in using large byte[]
in managed code.
Background
Sometimes you have to deal with frequently allocated and copied large byte arrays. Especially in video processing RGB24 of a 320x240-picture needs a byte[]
of 320*240*3 = 230400 bytes. Choosing the right memory allocation and memory copying strategies may be vital for your project.
Using the Code
In my current project, I have to handle hundreds of uncompressed RGB24-Frames on multi core servers in real time. To be able to choose the best architecture for my project, I compared different memory allocations and copy mechanisms.
Because I know how difficult it is to find a good way to measure, I decided to do a really simple test and get a raw comparable result. I simply run a loop for 10 seconds and count the number of loops.
Allocation
Looking around, I found 5 different methods to allocate large byte arrays:
new byte[]
Marshal.AllocHGlobal()
Marshal.AllocCoTaskMem()
CreateFileMapping()
// This is shared memory stackalloc byte[]
new byte[]
Here is a typical loop showing the new byte[]
:
private static void newbyte()
{
Console.Write("new byte[]: ");
long start = DateTime.UtcNow.Ticks;
int i = 0;
while ((start + duration) > DateTime.UtcNow.Ticks)
{
byte[] buf = new byte[bufsize];
i++;
}
Console.WriteLine(i);
}
new byte[]
is completely managed code.
Marshal.AllocHGlobal()
Allocates memory from the unmanaged memory of the process.
IntPtr p = Marshal.AllocHGlobal(bufsize);
Marshal.FreeHGlobal(p);
Marshal.AllocHGlobal()
returns an IntPtr
but still does not need to be unsafe code. But when you want to access the allocated memory, you most often need unsafe code.
Marshal.AllocCoTaskMem()
Allocates a block of memory of specified size from the COM task memory allocator.
IntPtr p = Marshal.AllocCoTaskMem(bufsize);
Marshal.FreeCoTaskMem(p);
Same need of unsafe code like Marshal.AllocHGlobal()
.
CreateFileMapping()
For using shared memory in a managed code project, I wrote my own little helper class for using CreateFileMapping()
-Functions.
Using shared memory is quite simple:
using (SharedMemory mem = new SharedMemory("abc", bufsize, true))
mem
has a void*
to the buffer and a length-property. From inside another process, you can get access to the same memory by simple using false
in the constructor (and the same name).
SharedMem
uses unsafe.
stackalloc byte[]:
Allocated a byte[]
on the stack. Therefore it will be freed when you return from the current method. Using the stack may result in stack overflows when you don't do it wisely.
unsafe static void stack()
{
byte* buf = stackalloc byte[bufsize];
}
Using stackalloc
requires using unsafe, too.
Test Results
I don't want to talk about single/multicore, NUMA/non-NUMA-architecures and so on. Therefore I just print some interesting results. Feel free to run the test on your machines!
Debug/Release
Running the test in Debug and Release offers dramatic differences in the number of loops in 10 seconds:
Release
new byte[]: 425340907 100%
Marshal.AllocHGlobal: 19680751 5%
Marshal.AllocCoTaskMem: 21062645 5%
stackalloc: 341525631 80%
SharedMemory: 792007 0.2%
Debug
new byte[]: 71004 0.3%
Marshal.AllocHGlobal: 22660829 89%
Marshal.AllocCoTaskMem: 25557756 100%
stackalloc: 558497 2%
SharedMemory: 785470 3%
As you can see, new byte[]
and stackalloc byte[]
dramatically depend on the debug/release switch. And the other three do not depend on it. This may be because they are mainly kernel-managed.
And new byte[]
and stackalloc byte[]
are the fastest in managed code in release-mode and the slowest in debug-mode. But remember that the garbage collector has to handle the new byte[]
, too.
PC/Server
These two runs were done on my PC (intel dualcore, vista64). So let's compare it to a typical server (dual xeon quadcore, Windows server 2008 64bit) in release:
Server Workstation
new byte[]: 553541729 425340907
Marshal.AllocHGlobal: 26460746 19680751
Marshal.AllocCoTaskMem: 28294494 21062645
stackalloc: 466980755 341525631
SharedMemory: 817317 792007
Because we are single-threaded, the number of cores does not matter. Remember, the garbage collector has its own thread.
32Bit/64Bit
Let's compare 32bit to 64bit (release):
x86-32bit x64-64bit
new byte[]: 1046577767 516441931
Marshal.AllocHGlobal: 21034715 25152330
Marshal.AllocCoTaskMem: 23467574 27787971
stackalloc: 83956017 416630753
SharedMemory: 728858 793750
Marshal.*
and SharedMemory
is a little bit faster on x64. new byte[]
is up to twice as fast on x86 than on x64. And stackalloc byte[]
is 5 times faster on x64 than on x86. I didn't expect this result!
The same result is true on my server.
Conclusion
So think twice before you decide which allocation method and target-platform you choose!
MemCopy
And now let's look for some memcopy-variants. I use the same algorithm to measure. Let one thread do a loop copying a byte[]
to another byte[]
for 10 seconds and count the number of copies.
Array.Copy()
Marshal.Copy()
Kernel32.dll CopyMemory()
Buffer.BlockCopy()
MemCopyInt()
MemCopyLong()
Release/Debug
Release Debug
Array.Copy: 360741 361740
Marshal.Copy: 360680 359712
Kernel32NativeMethods.CopyMemory: 361314 358927
Buffer.BlockCopy: 375440 374004
OwnMemCopyInt: 217736 33833
OwnMemCopyLong: 295372 54601
As expected only my own MemCopy
was a lot slower in Debug-Mode. Let's take a look at my own MemCopy
:
static readonly int sizeOfInt = Marshal.SizeOf(typeof(int));
static public unsafe void MemCopy(IntPtr pSource, IntPtr pDest, int Len)
{
unchecked
{
int size = sizeOfInt;
int count = Len / size;
int rest = Len % count;
int* ps = (int*)pSource.ToPointer(), pd = (int*)pDest.ToPointer();
for (int n = 0; n < count; n++)
{
*pd = *ps;
pd++;
ps++;
}
if (rest > 0)
{
byte* ps1 = (byte*)ps;
byte* pd1 = (byte*)pd;
for (int n = 0; n < rest; n++)
{
*pd1 = *ps1;
pd1++;
ps1++;
}
}
}
}
Even when you use unchecked unsafe code, the built in copy-functions perform much faster in debug mode than doing the copy in a loop yourself.
32Bit/64Bit
32Bit 64Bit
Array.Copy: 230788 360741
Marshal.Copy: 460061 360680
Kernel32NativeMethods.CopyMemory: 365850 361314
Buffer.BlockCopy: 368212 375440
OwnMemCopyInt: 218438 217736
OwnMemCopyLong: 286321 295372
In 32Bit x86-Code, the Marshal.Copy
is significantly faster than in 64bit-code. Array.Copy
is much slower in 32bit than in 64bit. And my own memcopy
-loop uses 32bit integers and therefore has the same speed. And the kernel-method is not affected by this setting.
Conclusion
It is a good idea to use the built-in memcopy
functions.
Points of Interest
Try the source on your machine and compare the results.
History
- 9th February, 2009: Initial post