Handling of Large Byte Arrays

bauemeise

4.81/5 (14 votes)

9 Feb 2009LGPL34 min read

68.1K

626

Allocation and copy of large byte[]

Download source - 5.07 KB

Introduction

This article compares different allocation- and copy-methods in using large byte[] in managed code.

Background

Sometimes you have to deal with frequently allocated and copied large byte arrays. Especially in video processing RGB24 of a 320x240-picture needs a byte[] of 320*240*3 = 230400 bytes. Choosing the right memory allocation and memory copying strategies may be vital for your project.

Using the Code

In my current project, I have to handle hundreds of uncompressed RGB24-Frames on multi core servers in real time. To be able to choose the best architecture for my project, I compared different memory allocations and copy mechanisms.

Because I know how difficult it is to find a good way to measure, I decided to do a really simple test and get a raw comparable result. I simply run a loop for 10 seconds and count the number of loops.

Allocation

Looking around, I found 5 different methods to allocate large byte arrays:

new byte[]
Marshal.AllocHGlobal()
Marshal.AllocCoTaskMem()
CreateFileMapping() // This is shared memory
stackalloc byte[]

new byte[]

Here is a typical loop showing the new byte[]:

private static void newbyte()
{
    Console.Write("new byte[]: ");
    long start = DateTime.UtcNow.Ticks;
    int i = 0;
    while ((start + duration) > DateTime.UtcNow.Ticks)
    {
        byte[] buf = new byte[bufsize];
        i++;
    }
    Console.WriteLine(i);
}

new byte[] is completely managed code.

Marshal.AllocHGlobal()

Allocates memory from the unmanaged memory of the process.

IntPtr p = Marshal.AllocHGlobal(bufsize);
Marshal.FreeHGlobal(p);

Marshal.AllocHGlobal() returns an IntPtr but still does not need to be unsafe code. But when you want to access the allocated memory, you most often need unsafe code.

Marshal.AllocCoTaskMem()

Allocates a block of memory of specified size from the COM task memory allocator.

IntPtr p = Marshal.AllocCoTaskMem(bufsize);
Marshal.FreeCoTaskMem(p);

Same need of unsafe code like Marshal.AllocHGlobal().

CreateFileMapping()

For using shared memory in a managed code project, I wrote my own little helper class for using CreateFileMapping()-Functions.

Using shared memory is quite simple:

using (SharedMemory mem = new SharedMemory("abc", bufsize, true))
// use mem;

mem has a void* to the buffer and a length-property. From inside another process, you can get access to the same memory by simple using false in the constructor (and the same name).

SharedMem uses unsafe.

stackalloc byte[]:

Allocated a byte[] on the stack. Therefore it will be freed when you return from the current method. Using the stack may result in stack overflows when you don't do it wisely.

unsafe static void stack()
{
    byte* buf = stackalloc byte[bufsize];
}

Using stackalloc requires using unsafe, too.

Test Results

I don't want to talk about single/multicore, NUMA/non-NUMA-architecures and so on. Therefore I just print some interesting results. Feel free to run the test on your machines!

Debug/Release

Running the test in Debug and Release offers dramatic differences in the number of loops in 10 seconds:

Release

new byte[]:            425340907   100%
Marshal.AllocHGlobal:   19680751     5%
Marshal.AllocCoTaskMem: 21062645     5%
stackalloc:            341525631    80%
SharedMemory:             792007   0.2%

Debug

new byte[]:                71004   0.3%
Marshal.AllocHGlobal:   22660829    89%
Marshal.AllocCoTaskMem: 25557756   100%
stackalloc:               558497     2%
SharedMemory:             785470     3%

As you can see, new byte[] and stackalloc byte[] dramatically depend on the debug/release switch. And the other three do not depend on it. This may be because they are mainly kernel-managed.

And new byte[] and stackalloc byte[] are the fastest in managed code in release-mode and the slowest in debug-mode. But remember that the garbage collector has to handle the new byte[], too.

PC/Server

These two runs were done on my PC (intel dualcore, vista64). So let's compare it to a typical server (dual xeon quadcore, Windows server 2008 64bit) in release:

                         Server      Workstation
new byte[]:            553541729      425340907   
Marshal.AllocHGlobal:   26460746       19680751 
Marshal.AllocCoTaskMem: 28294494       21062645 
stackalloc:            466980755      341525631    
SharedMemory:             817317         792007

Because we are single-threaded, the number of cores does not matter. Remember, the garbage collector has its own thread.

32Bit/64Bit

Let's compare 32bit to 64bit (release):

                         x86-32bit   x64-64bit
new byte[]:             1046577767   516441931
Marshal.AllocHGlobal:     21034715    25152330
Marshal.AllocCoTaskMem:   23467574    27787971
stackalloc:               83956017   416630753
SharedMemory:               728858      793750

Marshal.* and SharedMemory is a little bit faster on x64. new byte[] is up to twice as fast on x86 than on x64. And stackalloc byte[] is 5 times faster on x64 than on x86. I didn't expect this result!
The same result is true on my server.

Conclusion

So think twice before you decide which allocation method and target-platform you choose!

MemCopy

And now let's look for some memcopy-variants. I use the same algorithm to measure. Let one thread do a loop copying a byte[] to another byte[] for 10 seconds and count the number of copies.

Array.Copy()
Marshal.Copy()
Kernel32.dll CopyMemory()
Buffer.BlockCopy()
MemCopyInt()
MemCopyLong()

Release/Debug

                                  Release    Debug
Array.Copy:                       360741    361740
Marshal.Copy:                     360680    359712
Kernel32NativeMethods.CopyMemory: 361314    358927
Buffer.BlockCopy:                 375440    374004
OwnMemCopyInt:                    217736     33833
OwnMemCopyLong:                   295372     54601

As expected only my own MemCopy was a lot slower in Debug-Mode. Let's take a look at my own MemCopy:

static readonly int sizeOfInt = Marshal.SizeOf(typeof(int));
static public unsafe void MemCopy(IntPtr pSource, IntPtr pDest, int Len)
{
    unchecked
    {
        int size = sizeOfInt;
        int count = Len / size;
        int rest = Len % count;
        int* ps = (int*)pSource.ToPointer(), pd = (int*)pDest.ToPointer();
        // Loop over the cnt in blocks of 4 bytes, 
        // copying an integer (4 bytes) at a time:
        for (int n = 0; n < count; n++)
        {
            *pd = *ps;
            pd++;
            ps++;
        }
        // Complete the copy by moving any bytes that weren't moved in blocks of 4:
        if (rest > 0)
        {
            byte* ps1 = (byte*)ps;
            byte* pd1 = (byte*)pd;
            for (int n = 0; n < rest; n++)
            {
                *pd1 = *ps1;
                pd1++;
                ps1++;
            }
        }
    }
}

Even when you use unchecked unsafe code, the built in copy-functions perform much faster in debug mode than doing the copy in a loop yourself.

32Bit/64Bit

                                   32Bit    64Bit
Array.Copy:                       230788   360741    
Marshal.Copy:                     460061   360680
Kernel32NativeMethods.CopyMemory: 365850   361314
Buffer.BlockCopy:                 368212   375440 
OwnMemCopyInt:                    218438   217736  
OwnMemCopyLong:                   286321   295372

In 32Bit x86-Code, the Marshal.Copy is significantly faster than in 64bit-code. Array.Copy is much slower in 32bit than in 64bit. And my own memcopy-loop uses 32bit integers and therefore has the same speed. And the kernel-method is not affected by this setting.

Conclusion

It is a good idea to use the built-in memcopy functions.

Points of Interest

Try the source on your machine and compare the results.

History

9^th February, 2009: Initial post

License

This article, along with any associated source code and files, is licensed under The GNU Lesser General Public License (LGPLv3)