(untagged)

High performance computing from C++ to MMX

Vincent Leong77

0.00/5 (No votes)

30 Jul 2003

Boosting you application performance to the optimum by using hardware acceleration.

Download demo project - 34.6 Kb

Introduction

This article first addresses some of the common programmer's algorithm when writing computing intensive image processing application. It then explains how the existing code can be improved from native C++ code, to using MMX acceleration. MMX refers to a feature that exist in the Intel processors, which can boost up multimedia software, by executing integer instructions on several data parallely. Though the processor supports the acceleration, it does not mean your application can make use of it automatically. Hence, your application must be manually written in order to take advantage of it.

Once user is able to make use of MMX acceleration, he/she can advance 1 step further by utilizing SSE (Streaming SIMD Extension) which features parallel floating point calculations and SSE2 which is an upgrade to both floating point & integer calculation. You have to be aware that not all hardware support the acceleration described. Well, to keep the scope small, I had separated this explanation in another article (will be posted soon).

Now lets take a look at the list of Intel processors which supports each feature.

Pentium MMX, Pro, Pentium 2 - MMX
Pentium 3, Xeon - MMX, SSE
Pentium 4 and above - MMX, SSE, SSE2 /HHT

The article contents mainly apply to developers using Intel processor hardware. As for AMD users, some of the assembly instructions in MMX are compatible! Prior to this, some basic assembly language knowledge are needed. I will also assume that you have some knowledge on MFC doc/view architecture.

The reason I came out with this article is because I think there might be a need to highlight some high performance computing knowledge. Specially to those developers who are developing time critical applications. Image processing is a good example for demo purposes.

In this downloadable demo, you will need to have at least a Pentium processor with MMX to see the results.

Background

In image processing/3D graphics etc, performance is critical. The lesser the time an algorithm takes, the better it is, because more functionality can be squeezed in.

In my examples, I generated 2 images of size 1024 x 1024. Each is a megabyte in size. For the sake of simplicity, both are 8 bit grayscale images. I create synthetic images because dependencies upon other classes/objects/DLLs are minimum. Image 1 is an image with white vertical lines on black background. Image 2 has gradient fills of gray.

The 2 images will be added together to create a 3rd image. The final product would be an image with gradient fill & vertical line.

I can easily create the best performing algorithm in the demo, but instead I want to show you what is the difference between various kinds of code. I have developed 4 algorithms which demonstrate image addition. 2 in C++ and 2 in assembly language. For each of the algorithms, I put a high resolution timer to take the elapsed. The final image will be shown on screen (main window). The 2 source images are displayed in 2 mini windows on the right. And then I present you with the results of different kinds of PC in the later section of the article.

Writing the algorithms

For some basic understanding lets go thru this C++ code.

void CMMXDemoDoc::OnRunbasic() 
{
    int x=0, y=0, i=0, iGray=0;

    // Start timing

    m_el.Begin();

    // Add 2 image using direct memory access

    // Assume both image are same size

    for (y=0; y<m_iHeight; y++)
    {
        for (x=0; x<m_iWidth; x++)
        {
            // calc index

            i = x + y*m_iWidth;
            // add 2 pixels

            iGray = m_pImg1[i] + m_pImg2[i];
            // keep saturation value

            if (iGray > 255) iGray = 255;
            // save to destination image

            m_pImg[i] = iGray;

        }
    }

    // End Timing

    m_el.End();

    // Force redraw

    UpdateAllViews(0);
}

m_pImg, m_pImg1, m_pImg2 are all 1 dimensional array of unsigned char. All images are same size.

There are 2 for loops, the outer loop iterate through each line in the image. The inner loop iterate through each pixel in the line. For every pixel, we need to calculate the index in the array, Add the pixel from image 1 & image 2, check out if its saturated (>255), then store it into the destination image.

What is the problem with this algorithm?

There are a lot of arithmetic operations used on addition and there is a multiplication operation as well. Multiplication is a bit slower than addition. Try changing the addition operator to multiply and see the differences in the demo.
Array indexing to access random memory data. This causes some slowdown because the processor has to convert from array index to actual address in memory.
2 levels of for loops causes processor to operate stack instructions. Because ecx (counter register) is push and pop often. This imposes some processor cycles.

Optimizing using pointer arithmetic

void CMMXDemoDoc::OnRunopt() 
{
    // Begin timing

    m_el.Begin();

    // Precalculate the pointers

    BYTE *pSrc = m_pImg2;
    BYTE *pSrcEnd = m_pImg2 + m_iSize;
    BYTE *pDest = m_pImg;
    BYTE *pDestEnd = m_pImg + m_iSize;
    int iGray;

    // Copy from img1 to tmp

    memcpy(m_pImg, m_pImg1, m_iSize);
    
    // loop each pixel and Add

    while (pSrc < pSrcEnd)
    {
        iGray = *pDest + *pSrc;
        if (iGray > 255) iGray=255;
        *pDest = iGray;
        pSrc++;
        pDest++;
    }

    // End Timing

    m_el.End();
    // Force redraw

    UpdateAllViews(0);
}

Notice that I copied the data to the dest image. I can save up a pointer addition since memcpy() is an optimized function from VC. (memcpy() still takes time)

The code above uses pointer to access the data. There is nothing more to optimize, since the pointer is already storing the data memory address. The pointer moves from 1 element to the next in a sequential form. The source and destination pointers are incremented by 1 each, after the pixel has been calculated.

Converting to assembly

The code below shows how to write the algorithm in assembly. The elapse time of the function will be much more constant.

int AsmAdd(BYTE *d, BYTE *s, int w, int h)
{
    int iCount = w * h;


    // we assume all data in the register is not used by others
    __asm
    {
        // Assign pointers to register
        mov esi, [s] ; put src addr to esi reg

        mov edi, [d] ; put dest addr to edi reg

        mov ecx, [iCount] ; put count to ecx reg


        codeloop:
        mov al, [edi] ; mov a byte of src data to low

                        ; word of eax register

        add al, [esi] ; Add 8 bit from dest ptr to al

        jc nosave ; jump if carry flag on

        mov [edi], al
        nosave:
        inc esi
        inc edi
        dec ecx ; decrement count by 1

        jnz
        codeloop ; jump to codeloop if not 0


    }

    return 1;

}

Utilising MMX

We have 8 general purpose registers in our 80x86/Pentium processor. They are eax, ebx, ecx, edx, esi, edi, ebp, esp. These registers are used to hold operands for logical and arithmetic operations, operands for address calculations and memory pointers. Register esp is used to hold stack pointers most of the time, so that leaves us 7 registers. When we have lots of data, we would have to make sure these data are well handled by these 7 registers. Besides that, we also have another 8 80-bit registers located in the FPU (Floating Point Unit). We can make use of these 8 registers in FPU if we have MMX feature in our processor. If we use MMX, we could access 64bit from the 80bit register in the FPU. Why only 64? Because the Intel manual state so. : ) Anyhow, that's not an issue as we will be happy with 8 additional registers. We name these registers by mm0-mm7.

MMX Technology Execution Environment

So what can we do with these 8 additional MMX registers? They are not going to run any faster right since the clock speed of the CPU is fixed to the amount of operation it can execute in a particular time. Yes they are! How? By executing MMX parallel instructions on preloaded stream of bytes!

SIMD Execution model

Data Types introduced with MMX Technologies

The parallel execution can save up some clock cycles per instruction. You can load 8-8bits, 4-16bits, 2-32bits into the 64 bits MMX register. To load or store, you use the instruction MOV x where x is the data size. Then you can perform calculation by calling MMX instructions. MMX instructions set consist of 47 instructions, grouped into several categories:

Data Transfer, Arithmetic, Comparison, Conversion, Unpacking, Logical, Shift, Empty MMX state instruction (EMMS).

Study the code below, 8-8bit unsigned saturated add operation is performed on 2-64bit MMX register by executing these lines.

  unsigned char mask[8] = {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01 }
  // txt = "abcdefgh"
  unsigned char txt[8] = {0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48 } 
  __asm {
  movq mm0, txt ; mm0 = ( 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 )

  movq mm1, mask ; mm1 = ( 01 | 01 | 01 | 01 | 01 | 01 | 01 | 01 )

  paddusb mm0, mm1 ; mm0 = ( 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 )

  movq txt, mm0 ; mov back the data in register to memory.

  emms ; switch back to FPU state.

}

The string txt[] which contains "abcdefgh" will now be "bcdefghi".

By using MMX, there are only 4 instructions executed and 8 bytes of data is processed at a time. A typical assembly algorithm without MMX will have to loop 8 times, every iteration executing a move instruction and an addition. Giving a total of 16 instructions at least. Now what is the speed improvement? I bet you'll get at least a minimum of 200% in time gain.

In Visual C++ 6.0, the easiest way to make use of MMX is writing assembly code using inline assembler. The code above illustrates so. The list of MMX instructions set is available in MSDN web sites. Intrinsics (wrapup function) is available in VC++.NET

For this demo you should at least have a Pentium processor with MMX and above, to produce the desired results.

Executing the demo

Building & executing the demo is pretty straight forward. The source code is presented together with some inline assembler instructions (very well commented). There are also some helper classes & functions.

<CElapsed> - provide high resolution timing using QueryPerformanceCounter().
<CMiniView> - serve additional views for a CDocument class.
Some raster drawing codes to assist image display.
<CreateNewViews()> shows how to create additional views for a typical document.
<CheckSSEAvail()> checks the availability of SSE presence in your processor. You can also check other features in your processor such as MMX and HHT.

Test Result

	Array Index	Pointer arithmetic	MMX Instructions
Clone Pentium 2, 400 MHZ	46ms	35ms	16ms
Clone Pentium 3(eb) 600 MHZ	30.6ms	24ms	11ms
Acer TravMate 332T, Mobile Pentium 2 300 MHZ	71ms	66ms	58ms
Acer TravMate 270, Mobile Pentium 4 1.7 GHZ	34ms	20ms	5.9ms
Toshiba Sat 2410, Mobile Pentium 4 1.8 GHZ	22ms	14ms	4.5ms
Clone, Pentium 4 1.7 GHZ	30ms	18ms	5.5ms
Dell Dimen 2400, Pentium 4 2.2 GHZ	18.5ms	12.5ms	6ms

Various kinds of PC platforms and the test results on a 1024 x 1024 image.

Please note that the speeds are affected by several factors. This includes processor's clock speed, 1st & 2nd level cache memory size, system bus speed, DRAM speed. A fast system bus & DRAM will result in faster data transfer into the core register set. FYI, data is fetched from DRAM into the cache memory and then to the processor register set.

Points of interest

I had recently read through Alex Farber - Introduction to SSE Programming articles, and found out that he had just posted a similar topic. So I modified the content and separated it into 2 parts. The first was to emphasize on C++ and MMX and the second describing SSE/SSE2 implementation by using external compilers (Netwide assembler). The version Alex posted is a .NET version of SSE implementation by using intrinsics. So that makes room for me to post a VC 6.0 version for my next article.

MMX development is meant for intermediate programmers. Some learning curves are required. You have to catch up on Assembly, Intel Processor architectures, MMX instruction sets, assembly optimization & its terms. Hope this article really helps!

Based on the test results, I found that my laptop, Toshiba satellite 2410 seems be the best performer if utilizing MMX instructions. Its bizarre seeing why the Dell dimension desktop PC with a 2.2 GHZ processor loose out so much, even to a 1.7 GHZ clone system (gigabyte motherboard). I had double confirmed this with testing on a few of the Dell systems.

And finally, if you like the article, please give me a vote ya!

Reference

You can refer to some of these helpful sites.

The MMX programming paradigm
MMX corner
How to optimize code for MMX processors
Using vector c to take advantage of MMX
Intel Software manuals
- Intel Software Optimization Manual.
- Volume 1: Basic Architecture.
- Volume 2: Developers manual.
- Volume 3: System programming guide.
Alex Farber, Introduction to SSE Programming

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here