Introduction
I'm Oleksandr Karpov and this is my first article here, thanks for reading it.
Here, I'm going to show and explain how to copy data really fast and how to use assembly under C# and .NET. In my case, I use it in a video creating application from images, video and sound.
Also, if you have an assembly method or function that you need to use under C#, it will show you how to do it in a quick and simple way.
Background
To understand it all, it would be great for you to know assembly language, memory alignment and some C#, Windows and .NET advanced techniques.
To be able to copy-paste data really fast, you need it to have 16 byte aligned memory address in other way it will have almost the same speed (in my case, about 1.02 time faster).
The code uses SSE instructions that are supported by processors from Pentium III+ (KNI/MMX2), AMD Athlon (AMD EMMX).
I have tested it on my Pentium Dual-Core E5800 3.2GHz with 4GB RAM in dual mode.
For me, the fast copy method is 1.5 times faster than the standard with 16 byte memory aligned and
almost the same (1.02 times faster) with non-aligned memory addresses.
To be able to allocate 16 byte aligned memory in C# under Windows, we have three ways to do it:
a) On this time it seems that Bitmap
object (actually windows itself inside) allocates memory with 16 byte aligned address, so we can use Bitmap
to easy and quick aligned memory allocation;
b) As managed array by adding 8 bytes more (as windows heap is 8 byte aligned) and calculating 16 byte aligned memory point within allocated memory:
int dataLength = 4096;
byte[] buffer = new byte[dataLength + 8];
IntPtr addr = Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0);
int bufferAlignedOffset = (int)(((long)addr + 15) / 16 * 16 - addr);
c) By allocating memory with VirtualAlloc
API:
IntPtr addr = VirtualAlloc(
IntPtr.Zero,
new UIntPtr(dataLength + 8),
AllocationTypes.Commit | AllocationTypes.Reserve,
MemoryProtections.ExecuteReadWrite);
addr = new IntPtr(((long)addr + 15)/16*16);
Using the Code
This is a complete performance test that will show you performance measurements and how to use it all.
The FastMemCopy
class contains all things for fast memory copy logic.
First thing you need is to create a default Windows Forms application project and put two buttons on the form and the PictureBox
control as we will test it on images.
Let's declare some fields:
string bitmapPath;
Bitmap bmp, bmp2;
BitmapData bmpd, bmpd2;
byte[] buffer = null;
Now, we will create two methods to handle click events for our buttons.
For standard method:
private void btnStandard_Click(object sender, EventArgs e)
{
using (OpenFileDialog ofd = new OpenFileDialog())
{
if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)
return;
bitmapPath = ofd.FileName;
}
OpenImage();
UnlockBitmap();
CopyImage();
LockBitmap();
pictureBox1.Image = bmp2;
}
and for fast method:
private void btnFast_Click(object sender, EventArgs e)
{
using (OpenFileDialog ofd = new OpenFileDialog())
{
if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)
return;
bitmapPath = ofd.FileName;
}
OpenImage();
UnlockBitmap();
FastCopyImage();
LockBitmap();
pictureBox1.Image = bmp2;
}
Ok, now we have buttons and event handlers so let's implement methods that will open images, lock, unlock them and standard copy method:
Open an image:
void OpenImage()
{
pictureBox1.Image = null;
buffer = null;
if (bmp != null)
{
bmp.Dispose();
bmp = null;
}
if (bmp2 != null)
{
bmp2.Dispose();
bmp2 = null;
}
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
bmp = (Bitmap)Bitmap.FromFile(bitmapPath);
buffer = new byte[bmp.Width * 4 * bmp.Height];
bmp2 = new Bitmap(bmp.Width, bmp.Height, bmp.Width * 4, PixelFormat.Format32bppArgb,
Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0));
}
Lock and unlock bitmaps:
void UnlockBitmap()
{
bmpd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,
PixelFormat.Format32bppArgb);
bmpd2 = bmp2.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,
PixelFormat.Format32bppArgb);
}
void LockBitmap()
{
bmp.UnlockBits(bmpd);
bmp2.UnlockBits(bmpd2);
}
and copy data from one image to another and show measured time:
void CopyImage()
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 10; i++)
{
System.Runtime.InteropServices.Marshal.Copy(bmpd.Scan0, buffer, 0, buffer.Length);
}
sw.Stop();
MessageBox.Show(sw.ElapsedTicks.ToString());
}
That's it for the standard copy-paste method. Actually, there is nothing too complex, we use well-known System.Runtime.InteropServices.Marshal.Copy
method.
And one more "middle-method" for the fast copy logic:
void FastCopyImage()
{
FastMemCopy.FastMemoryCopy(bmpd.Scan0, bmpd2.Scan0, buffer.Length);
}
Now, let's implement the FastMemCopy
class. Here is the declaration of the class and some types we will use inside of it:
internal static class FastMemCopy
{
[Flags]
private enum AllocationTypes : uint
{
Commit = 0x1000, Reserve = 0x2000,
Reset = 0x80000, LargePages = 0x20000000,
Physical = 0x400000, TopDown = 0x100000,
WriteWatch = 0x200000
}
[Flags]
private enum MemoryProtections : uint
{
Execute = 0x10, ExecuteRead = 0x20,
ExecuteReadWrite = 0x40, ExecuteWriteCopy = 0x80,
NoAccess = 0x01, ReadOnly = 0x02,
ReadWrite = 0x04, WriteCopy = 0x08,
GuartModifierflag = 0x100, NoCacheModifierflag = 0x200,
WriteCombineModifierflag = 0x400
}
[Flags]
private enum FreeTypes : uint
{
Decommit = 0x4000, Release = 0x8000
}
[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
private unsafe delegate void FastMemCopyDelegate();
private static class NativeMethods
{
[DllImport("kernel32.dll", SetLastError = true)]
internal static extern IntPtr VirtualAlloc(
IntPtr lpAddress,
UIntPtr dwSize,
AllocationTypes flAllocationType,
MemoryProtections flProtect);
[DllImport("kernel32")]
[return: MarshalAs(UnmanagedType.Bool)]
internal static extern bool VirtualFree(
IntPtr lpAddress,
uint dwSize,
FreeTypes flFreeType);
}
Now let's declare the method itself:
public static unsafe void FastMemoryCopy(IntPtr src, IntPtr dst, int nBytes)
{
if (IntPtr.Size == 4)
{
IntPtr p = NativeMethods.VirtualAlloc(
IntPtr.Zero,
new UIntPtr((uint)x86_FastMemCopy_New.Length),
AllocationTypes.Commit | AllocationTypes.Reserve,
MemoryProtections.ExecuteReadWrite);
try
{
Marshal.Copy(x86_FastMemCopy_New, 0, p, x86_FastMemCopy_New.Length);
FastMemCopyDelegate _fastmemcopy =
(FastMemCopyDelegate)Marshal.GetDelegateForFunctionPointer(p,
typeof(FastMemCopyDelegate));
p += x86_FastMemCopy_New.Length;
p -= 8;
Marshal.Copy(BitConverter.GetBytes((long)nBytes), 0, p, 4);
p -= 8;
Marshal.Copy(BitConverter.GetBytes((long)dst), 0, p, 4);
p -= 8;
Marshal.Copy(BitConverter.GetBytes((long)src), 0, p, 4);
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 10; i++)
_fastmemcopy();
sw.Stop();
System.Windows.Forms.MessageBox.Show(sw.ElapsedTicks.ToString());
}
catch (Exception ex)
{
System.Windows.Forms.MessageBox.Show(ex.Message);
}
finally
{
NativeMethods.VirtualFree(p, (uint)(x86_FastMemCopy_New.Length),
FreeTypes.Release);
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
}
}
else if (IntPtr.Size == 8)
{
throw new ApplicationException("x64 is not supported yet!");
}
}
and assembly code that is represented as an array of bytes with explanation:
private static byte[] x86_FastMemCopy_New = new byte[]
{
0x90, 0x60, 0x95, 0x8B, 0xB5, 0x5A, 0x01, 0x00, 0x00, 0x89, 0xF0, 0x83, 0xE0, 0x0F, 0x8B, 0xBD, 0x62, 0x01, 0x00, 0x00, 0x89, 0xFB, 0x83, 0xE3, 0x0F, 0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, 0xC1, 0xE9, 0x07, 0x85, 0xC9, 0x0F, 0x84, 0x1C, 0x01, 0x00, 0x00, 0x0F, 0x18, 0x06, 0x85, 0xC0, 0x0F, 0x84, 0x8B, 0x00, 0x00, 0x00, 0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, 0x0F, 0x10, 0x06, 0x0F, 0x10, 0x4E, 0x10, 0x0F, 0x10, 0x56, 0x20, 0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, 0x0F, 0x10, 0x5E, 0x30, 0x0F, 0x10, 0x66, 0x40, 0x0F, 0x10, 0x6E, 0x50, 0x0F, 0x10, 0x76, 0x60, 0x0F, 0x10, 0x7E, 0x70, 0x85, 0xDB, 0x74, 0x21, 0x0F, 0x11, 0x07, 0x0F, 0x11, 0x4F, 0x10, 0x0F, 0x11, 0x57, 0x20, 0x0F, 0x11, 0x5F, 0x30, 0x0F, 0x11, 0x67, 0x40, 0x0F, 0x11, 0x6F, 0x50, 0x0F, 0x11, 0x77, 0x60, 0x0F, 0x11, 0x7F, 0x70, 0xEB, 0x1F, 0x0F, 0x2B, 0x07, 0x0F, 0x2B, 0x4F, 0x10, 0x0F, 0x2B, 0x57, 0x20, 0x0F, 0x2B, 0x5F, 0x30, 0x0F, 0x2B, 0x67, 0x40, 0x0F, 0x2B, 0x6F, 0x50, 0x0F, 0x2B, 0x77, 0x60, 0x0F, 0x2B, 0x7F, 0x70, 0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, 0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, 0x83, 0xE9, 0x01, 0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, 0xE9, 0x86, 0x00, 0x00, 0x00,
0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, 0x0F, 0x28, 0x06, 0x0F, 0x28, 0x4E, 0x10, 0x0F, 0x28, 0x56, 0x20, 0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, 0x0F, 0x28, 0x5E, 0x30, 0x0F, 0x28, 0x66, 0x40, 0x0F, 0x28, 0x6E, 0x50, 0x0F, 0x28, 0x76, 0x60, 0x0F, 0x28, 0x7E, 0x70, 0x85, 0xDB, 0x74, 0x21, 0x0F, 0x11, 0x07, 0x0F, 0x11, 0x4F, 0x10, 0x0F, 0x11, 0x57, 0x20, 0x0F, 0x11, 0x5F, 0x30, 0x0F, 0x11, 0x67, 0x40, 0x0F, 0x11, 0x6F, 0x50, 0x0F, 0x11, 0x77, 0x60, 0x0F, 0x11, 0x7F, 0x70, 0xEB, 0x1F, 0x0F, 0x2B, 0x07, 0x0F, 0x2B, 0x4F, 0x10, 0x0F, 0x2B, 0x57, 0x20, 0x0F, 0x2B, 0x5F, 0x30, 0x0F, 0x2B, 0x67, 0x40, 0x0F, 0x2B, 0x6F, 0x50, 0x0F, 0x2B, 0x77, 0x60, 0x0F, 0x2B, 0x7F, 0x70, 0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, 0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, 0x83, 0xE9, 0x01, 0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, 0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, 0x83, 0xE1, 0x7F, 0x85, 0xC9, 0x74, 0x02, 0xF3, 0xA4, 0x0F, 0xAE, 0xF8, 0x61, 0xC3,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
};
We will call this assembly method via delegate we have created earlier.
This method works in 32 bit mode for now and I will implement the 64 bit mode later.
I will add source code if anyone is interested in it (almost all code is there in the article).
Pay attention, the assembly code throws an exception if it is run under Visual Studio, and I still don't understand why.
Points of Interest
During implementation and testing this method, I have found that prefetchnta
command is not very clear described even by the Intel specification, so I did try to figure out it myself and via Google.
Also, pay attention to movntps
and movaps
instructions as they work with 16-byte memory aligned addresses only.
History
Bitmap
and 16 byte memory alignment
- Source code and memory alignment samples were added
- First version - 12/19/2014