(untagged)

Code profiler and optimizations

Hockey

0.00/5 (No votes)

2 May 2002

C++ code profiler and small profiling utility

Download source files - 35 Kb

Sample Image - Profiler.jpg

Introduction

The arcane art of code profiling or tuning may seem unnecessary with today's desktop CPU's cranking out 2.2 GHz, which is pretty crazy pixel pushing power if you ask me. The release of .NET seems to promote machine abstraction, with more benefits to the time/cost aware programmer. However if you're more interested in writing code for the pure enjoyment of writing code, rather than making money, you're probably just as interested in writing fast code. Depending on how intimate you care to get with your current machine, they are some pretty cool optimizations you can perform to squeeze every last redundant clock out of your code. Because assembly language is about as popular as machine code, far harder to read and makes you "REALLY" machine dependent, I will ignore the common optimizations performed in assembly. In any high level language there are some simple optimizations you can perform which will reduce the code size and cycles required to execute it; probably the most over-looked method of optimizing would be the algorithm you use.

Consider the following add/delete/insert functions as an example:

Add (Index) - Adds an item to the currently indexed element and pushes trailing elements in the array down by one.
Del (Index) - Removes the indexed element from the array and pulls trailing elements in the array up by one.
Ins (Index) - Inserts/overwrites an item at the currently indexed element

Add(Index, newItem)
{
    //Index MUST be within array bounds plus ONE element

    //ASSERT(Index<=UpperBound+1)

    //Allocate memory for one more element at end of array

    //Was allocation successful? Exit if not!

    //Push all items at or greater than Index down by one

    //Overwrite currently indexed element with newItem

    //Update counters 

}

Del(Index)
{
    //Pull all items greater than Index up by one to overwrite indexed element

    //Free memory used by last element

    //Update counters

}

Ins(Index, newItem)
{
    //Index MUST be within array bounds plus ONE element

    //ASSERT(Index<=UpperBound+1)	

    //Overwrite item at current index

    //Update counters

}

The Insert function, programmed this way, would cause code bloat; one simple solution for the redundancy is to design Insert() this way:

Ins(Index, newItem)
{
    Del(Index);
    Add(Index, newItem);
}

Note: If you wanted Insert() to add elements when the Index was out of bounds and NOT just overwrite the current indexed element in the array you would need to add more memory managing routines which would complicate your code and management. By having Add() manage this and making Insert() call Del(), Add() you now have removed a lot of unnecessary code and left tricky memory allocations to only 2 functions rather than three. If you one day decide to use Virtual memory instead of Heap it's a whole lot easier to manage. You now have seen an example of poor algorithm design and redundant code removal so let's move on to less obvious code tweaking techniques.

Now we'll talk speed enhancements, rather than code size:

If you ever cross paths with any old game/graphics or assembly book you'll often read about bit shifting "A faster multiplication/division". It (I reluctantly speak in past tense, as I don't know if bit shifting is still faster) was a commonly known quirk of the 80x86 instruction set(s) that shifting was a faster method of doing division or multiplication by powers of 2.

1
2631
8426 8421
=========
0000 0001 = 1
0000 0010 = 2  -> left-shift by one multiplied the value by two
0000 0100 = 4  -> left-shift by one multiplied the value by two again

Of course this is only useful for integer division/multiplication. Basically as a general rule of thumb:

int result = b * 2;

is faster if you write

int result = b << 1;

These are some common, easy to apply optimizations in any language that supports bit shifting.

Some other optimizations that don't require intimate knowledge of the computer architecture are given below.

Removal of loop invariants

for(int x=0;x<=10000;x++){
    rect[x].left = nSpace + nWidth;
}

would be more efficient like this

temp = nSpace + nWidth;

for(int x=0;x<=10000;x++){
    rect[x].left = temp;
}

In the first example you're making a unnecessary addition calculation 10001 times, whereas the second calculates the variable prior to iterating the array.

Common sub-expression elimination

result = a*((a+b+c)+b*(a+b+c)+c);

would be more efficient like this

temp = a+b+c;
result = a*(temp+b*temp+c);

In the above your calculating a redundant expression only once instead of twice (like the first) and storing it in a intermediate variable, thus you get faster (and I think) smaller code when compiled.

These are just two common code optimization techniques you can easily apply to your own code. There are, however, quite a few more techniques you can perform and, depending on your compiler, several automatic optimizations available to you. However I will ignore these for the sake of brevity.

You now have an idea of how easy it is to optimize your code, so without further delay I'll explain how my profiler works (or should anyway), but first the basics of profiling.

What is code profiling?

Code profiling records the clock cycles expended by your profiled section of code or function. By repeatedly profiling your code, you get a clear picture as to which functions or sections of code are causing bottlenecks or hot spots. After you localize your bottleneck it's then just a matter of applying the techniques mentioned above (and more) to lower the number of clock cycles used in your function or section of code.

What are clocks...?

Clocks or clock cycles are what the instructions on your CPU consume to execute tasks. When you write code in a HLL (high level language) your compiler outputs a list of instructions native to the target machine. Each instruction requires so many clock cycles to execute. Hypothetically speaking, assume the ADD instruction needs 4 cycles to execute, if your computer is the newest Pentium 4 rated at 2.2 GHz (2, 200, 000, 000 cycles per second) then your CPU could execute:

2200000000 cycles per second / 4 clocks for ADD instruction = 550000000 adds in ONE second. That's blazing speed huh...?

History:

After thinking about how code profiling actually worked I decided to visit Intel's site and download their VTune program (Free for 30 days). However after downloading and playing with it for only a few minutes I decided it offered far to many features, bells, and whistles for simply profiling my MFC code. I'm not designing the autopilot on shuttle Discovery, why would I need all this info? So I sat down and started to think of a quick and dirty solution. Naturally recording the elapsed time from function invocation to function return immediately came to mind, but how do you do this? And accurately? I started asking around CP and many people answered GetTickCount and QueryPerformanceCounter. I knew that GetTickCount wasn't exactly high performance, so I opted for QueryPerformanceCounter. After about 7 hours of development and searching the NET for resources on the subject at hand, I stumbled across CProfile on CodeGuru...at last an example.

The first example on CG I ran across used the following approach to profiling code. It's a class that uses GetTickCount and two member functions Start and Finish (I think) and I believe optionally outputs a log file to disk or debug output window. The major benefit to this class is that by using the two functions you can be more local to what you want to analyze; you maintain total control over what you profile. If you know precisely where your bottleneck is, you can profile exactly that code only...and better observe enhancements you make.

If you know of this example, you probably read the comments at the bottom of the article, where someone suggests he's developed a better profiling class. This class is more object oriented (friendly)? Instead of explicitly supplying the start and end of your profiled section of code, you rely on the constructor to start recording and destructor to stop recording. This means you're less likely to get faulty profile timings because now you're guaranteed to have properly recorded start and stop times. If you forget the stop () in the other class, then obviously your going to have incorrect measurement of clocks expended. The drawback to this method is that now you don't have direct control over the code section you want to profile. Profiling starts whenever the object is instantiated and stops whenever the object is destroyed or goes out of scope. An interesting message post at the bottom of this article had another guy stating that QueryPerformanceCounter had known issues and wasn't as accurate as one would initially believe.

I did some more research and found a few sites that discussed the QueryPerformanceCounter function; all had negative things to say, but nothing really concrete. Following my instincts I decided to read up a little more about a RTDSC instruction that was Intel's own, which I had read about a while back. Solution found! I'll use this instruction instead of the 2 Windows functions. Some drawbacks, however, are still present, since my CProfile class will be tied to Intel platforms ONLY! That was no big deal, because I wasn't going to post the code, but now that I have it might be for some people. Sorry!

The CProfile class I developed uses the RDTSC (read-time stamp counter) instruction...I believe this to be the most accurate method available (for those with supporting architectures anyway). The design is similar to the second mentioned class, starting and stopping clock cycle recording occurs at construction and destruction. Only I designed with DEBUG mode in mind, so you don't actually instantiate the object with normal syntax, but rather the with the macro provided. This offers a cool advantage over the other 2 classes. Now you can plug a PROFILER macro in every function in your code (at the very beginning of each function) and leave it there...it acts like an ASSERT or TRACE macro...it only compiles under debug mode...it's not included in release.

All three (I believe) create log files...but what good is a log file without tools to view all the data? So I wrote a quick and dirty utility to load and view the profiled data. I did loose interest in the project rather quickly (something else has my attention now) so there are many improvements to be made. My hope is that by sharing with you, you'll share with me and let me know about bugs; fixes would be great! I ask that if you add anything or change anything, you send me the updated code so I can keep this project alive. I'll make sure to make mention of your name, I promise.

The following I owe great thanks to. Not just for this project, but for any problem they help me solve. Helped being the keyword here though, so don't expect anything when I make billions.

Credits:

Roger Wright - Proofed this article and showed me proper usage of semi-colons, although I may have changed a few
Joaquin - Huge help in writing a previous buffer class, showed me the coolest and most efficient way of using templates
Christian Graus and Nish - Both speed up my development 10 fold when offering solutions I want today NOT tomorrow.
Kilowatt - Reminded me that OnNcPaint is sometime an excellent place to paint
Rick York - Made some corrections to the first draft and gave me some tools/info which I plan on including in my next update.

And of course the author of the CSortListCtrl for a job well done!!!

Known Bugs:

None�what, are you kidding?

Somewhere in the utility program I think I have used a long instead of double or __int64. When working with clocks this can mean loss of data. If you find what I think I'm looking for please inform me.

Other than that the code is pretty sloppy, so if you see the lack of const's or clarity somewhere please let me know and I'll change it, or you can change it and send me the adjusted code. One problem I realized late in development was that some functions will rate really high in the clocks, which kind of overshadows the other profiled functions. Using the status bar in the list view doesn't help and isn't very accurate to view, so I plan on designing a graph of sorts for displaying the data more accurately. Like I said above there is much code to be re-done, so by all means have a crack at it.

I plan on updating my current version of VC++ to VC++ .NET someday and I am curious to know if inline assembler is still available (with all that fancy byte code stuff). If you know of a more future friendly design, let me know and i'll start thinking of tomorrow, today.

Bug Fixes:

None yet!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here