Introduction
This article first addresses some of the common programmer's algorithm when writing computing intensive image processing application. It then explains how the existing code can be improved from native C++ code, to using MMX acceleration. MMX refers to a feature that exist in the Intel processors, which can boost up multimedia software, by executing integer instructions on several data parallely. Though the processor supports the acceleration, it does not mean your application can make use of it automatically. Hence, your application must be manually written in order to take advantage of it.
Once user is able to make use of MMX acceleration, he/she can advance 1 step further by utilizing SSE (Streaming SIMD Extension) which features parallel floating point calculations and SSE2 which is an upgrade to both floating point & integer calculation. You have to be aware that not all hardware support the acceleration described. Well, to keep the scope small, I had separated this explanation in another article (will be posted soon).
Now lets take a look at the list of Intel processors which supports each feature.
- Pentium MMX, Pro, Pentium 2 - MMX
- Pentium 3, Xeon - MMX, SSE
- Pentium 4 and above - MMX, SSE, SSE2 /HHT
The article contents mainly apply to developers using Intel processor hardware. As for AMD users, some of the assembly instructions in MMX are compatible! Prior to this, some basic assembly language knowledge are needed. I will also assume that you have some knowledge on MFC doc/view architecture.
The reason I came out with this article is because I think there might be a need to highlight some high performance computing knowledge. Specially to those developers who are developing time critical applications. Image processing is a good example for demo purposes.
In this downloadable demo, you will need to have at least a Pentium processor with MMX to see the results.
Background
In image processing/3D graphics etc, performance is critical. The lesser the time an algorithm takes, the better it is, because more functionality can be squeezed in.
In my examples, I generated 2 images of size 1024 x 1024. Each is a megabyte in size. For the sake of simplicity, both are 8 bit grayscale images. I create synthetic images because dependencies upon other classes/objects/DLLs are minimum. Image 1 is an image with white vertical lines on black background. Image 2 has gradient fills of gray.
The 2 images will be added together to create a 3rd image. The final product would be an image with gradient fill & vertical line.
I can easily create the best performing algorithm in the demo, but instead I want to show you what is the difference between various kinds of code. I have developed 4 algorithms which demonstrate image addition. 2 in C++ and 2 in assembly language. For each of the algorithms, I put a high resolution timer to take the elapsed. The final image will be shown on screen (main window). The 2 source images are displayed in 2 mini windows on the right. And then I present you with the results of different kinds of PC in the later section of the article.
Writing the algorithms
For some basic understanding lets go thru this C++ code.
void CMMXDemoDoc::OnRunbasic()
{
int x=0, y=0, i=0, iGray=0;
m_el.Begin();
for (y=0; y<m_iHeight; y++)
{
for (x=0; x<m_iWidth; x++)
{
i = x + y*m_iWidth;
iGray = m_pImg1[i] + m_pImg2[i];
if (iGray > 255) iGray = 255;
m_pImg[i] = iGray;
}
}
m_el.End();
UpdateAllViews(0);
}
m_pImg
, m_pImg1
, m_pImg2
are all 1 dimensional array of unsigned char
. All images are same size.
There are 2 for loops, the outer loop iterate through each line in the image. The inner loop iterate through each pixel in the line. For every pixel, we need to calculate the index in the array, Add the pixel from image 1 & image 2, check out if its saturated (>255), then store it into the destination image.
What is the problem with this algorithm?
- There are a lot of arithmetic operations used on addition and there is a multiplication operation as well. Multiplication is a bit slower than addition. Try changing the addition operator to multiply and see the differences in the demo.
- Array indexing to access random memory data. This causes some slowdown because the processor has to convert from array index to actual address in memory.
- 2 levels of
for
loops causes processor to operate stack instructions. Because ecx
(counter register) is push and pop often. This imposes some processor cycles.
Optimizing using pointer arithmetic
void CMMXDemoDoc::OnRunopt()
{
m_el.Begin();
BYTE *pSrc = m_pImg2;
BYTE *pSrcEnd = m_pImg2 + m_iSize;
BYTE *pDest = m_pImg;
BYTE *pDestEnd = m_pImg + m_iSize;
int iGray;
memcpy(m_pImg, m_pImg1, m_iSize);
while (pSrc < pSrcEnd)
{
iGray = *pDest + *pSrc;
if (iGray > 255) iGray=255;
*pDest = iGray;
pSrc++;
pDest++;
}
m_el.End();
UpdateAllViews(0);
}
Notice that I copied the data to the dest image. I can save up a pointer addition since memcpy()
is an optimized function from VC. (memcpy()
still takes time)
The code above uses pointer to access the data. There is nothing more to optimize, since the pointer is already storing the data memory address. The pointer moves from 1 element to the next in a sequential form. The source and destination pointers are incremented by 1 each, after the pixel has been calculated.
Converting to assembly
The code below shows how to write the algorithm in assembly. The elapse time of the function will be much more constant.
int AsmAdd(BYTE *d, BYTE *s, int w, int h)
{
int iCount = w * h
// we assume all data in the register is not used by others
__asm
{
// Assign pointers to register
mov esi, [s]
mov edi, [d]
mov ecx, [iCount]
codeloop:
mov al, [edi]
add al, [esi]
jc nosave
mov [edi], al
nosave:
inc esi
inc edi
dec ecx
jnz
codeloop
}
return 1
}
Utilising MMX
We have 8 general purpose registers in our 80x86/Pentium processor. They are eax
, ebx
, ecx
, edx
, esi
, edi
, ebp
, esp
. These registers are used to hold operands for logical and arithmetic operations, operands for address calculations and memory pointers. Register esp
is used to hold stack pointers most of the time, so that leaves us 7 registers. When we have lots of data, we would have to make sure these data are well handled by these 7 registers. Besides that, we also have another 8 80-bit registers located in the FPU (Floating Point Unit). We can make use of these 8 registers in FPU if we have MMX feature in our processor. If we use MMX, we could access 64bit from the 80bit register in the FPU. Why only 64? Because the Intel manual state so. : ) Anyhow, that's not an issue as we will be happy with 8 additional registers. We name these registers by mm0-mm7.
MMX Technology Execution Environment
So what can we do with these 8 additional MMX registers? They are not going to run any faster right since the clock speed of the CPU is fixed to the amount of operation it can execute in a particular time. Yes they are! How? By executing MMX parallel instructions on preloaded stream of bytes!
SIMD Execution model
Data Types introduced with MMX Technologies
The parallel execution can save up some clock cycles per instruction. You can load 8-8bits, 4-16bits, 2-32bits into the 64 bits MMX register. To load or store, you use the instruction MOV x
where x
is the data size. Then you can perform calculation by calling MMX instructions. MMX instructions set consist of 47 instructions, grouped into several categories:
Data Transfer, Arithmetic, Comparison, Conversion, Unpacking, Logical, Shift, Empty MMX state instruction (EMMS).
Study the code below, 8-8bit unsigned saturated add operation is performed on 2-64bit MMX register by executing these lines.
unsigned char mask[8] = {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01 }
// txt = "abcdefgh"
unsigned char txt[8] = {0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48 }
__asm {
movq mm0, txt
movq mm1, mask
paddusb mm0, mm1
movq txt, mm0
emms
}
The string txt[]
which contains "abcdefgh" will now be "bcdefghi".
By using MMX, there are only 4 instructions executed and 8 bytes of data is processed at a time. A typical assembly algorithm without MMX will have to loop 8 times, every iteration executing a move instruction and an addition. Giving a total of 16 instructions at least. Now what is the speed improvement? I bet you'll get at least a minimum of 200% in time gain.
In Visual C++ 6.0, the easiest way to make use of MMX is writing assembly code using inline assembler. The code above illustrates so. The list of MMX instructions set is available in MSDN web sites. Intrinsics (wrapup function) is available in VC++.NET
For this demo you should at least have a Pentium processor with MMX and above, to produce the desired results.
Executing the demo
Building & executing the demo is pretty straight forward. The source code is presented together with some inline assembler instructions (very well commented). There are also some helper classes & functions.
<CElapsed>
- provide high resolution timing using QueryPerformanceCounter()
.
<CMiniView>
- serve additional views for a CDocument
class.
- Some raster drawing codes to assist image display.
<CreateNewViews()>
shows how to create additional views for a typical document.
<CheckSSEAvail()>
checks the availability of SSE presence in your processor. You can also check other features in your processor such as MMX and HHT.
Test Result
|
Array Index |
Pointer arithmetic |
MMX Instructions |
Clone Pentium 2, 400 MHZ |
46ms |
35ms |
16ms |
Clone Pentium 3(eb) 600 MHZ |
30.6ms |
24ms |
11ms |
Acer TravMate 332T, Mobile Pentium 2 300 MHZ |
71ms |
66ms |
58ms |
Acer TravMate 270, Mobile Pentium 4 1.7 GHZ |
34ms |
20ms |
5.9ms |
Toshiba Sat 2410, Mobile Pentium 4 1.8 GHZ |
22ms |
14ms |
4.5ms |
Clone, Pentium 4 1.7 GHZ |
30ms |
18ms |
5.5ms |
Dell Dimen 2400, Pentium 4 2.2 GHZ |
18.5ms |
12.5ms |
6ms |
Various kinds of PC platforms and the test results on a 1024 x 1024 image.
Please note that the speeds are affected by several factors. This includes processor's clock speed, 1st & 2nd level cache memory size, system bus speed, DRAM speed. A fast system bus & DRAM will result in faster data transfer into the core register set. FYI, data is fetched from DRAM into the cache memory and then to the processor register set.
Points of interest
I had recently read through Alex Farber - Introduction to SSE Programming articles, and found out that he had just posted a similar topic. So I modified the content and separated it into 2 parts. The first was to emphasize on C++ and MMX and the second describing SSE/SSE2 implementation by using external compilers (Netwide assembler). The version Alex posted is a .NET version of SSE implementation by using intrinsics. So that makes room for me to post a VC 6.0 version for my next article.
MMX development is meant for intermediate programmers. Some learning curves are required. You have to catch up on Assembly, Intel Processor architectures, MMX instruction sets, assembly optimization & its terms. Hope this article really helps!
Based on the test results, I found that my laptop, Toshiba satellite 2410 seems be the best performer if utilizing MMX instructions. Its bizarre seeing why the Dell dimension desktop PC with a 2.2 GHZ processor loose out so much, even to a 1.7 GHZ clone system (gigabyte motherboard). I had double confirmed this with testing on a few of the Dell systems.
And finally, if you like the article, please give me a vote ya!
Reference
You can refer to some of these helpful sites.