Introduction
There are many profilers avialable in the market, I could not use some of them because my platform/IDE does not support it or it is being a licensed one. And the commercial products are costly and it cannot be customized in a way we want. For example how to ignore some of the functions from being profiled? Say, if I have a MFC application where we get flurry of paint messages and I dont want to profile them. How do you do it ? So I started to write my own profiler to profile C++ programs. Keep in mind that, this is not kind of profiler which profiles other applications rather we are modifying the target process to profile itself. Yeah, thats correct, I'm able to read your voice. There are many open source codes for profiler and why dont we use the same? I searched over web for long time for a profiler code which can be used for x64 platform. Most of these profilers do support x86 and not x64. And even I pinged some of the authors who implemented the profier in x86 and asked them how to do in x64. Either they are engaged with their routine work or they are not very sure on x64.
So I planned, instead of investing the time to search a profiler for x64, why dont we develop ourself? My requirement is very simple: I would like to measure the apporximate time taken for all the functions executed during an application's life time. And obviously I would like to filter out some of the functions called indirectly by our functions like std::xxxx functions, C++ library function calls etc.,
Limitation
There are few limitations on this work and let me list out at the very begining. So that the audience can decide to really go ahead further or not.
- This profiler is not a kind of one which can profile other applications. Rather this article focus on how to modify the application such that it does the self profiling. This may involve changing the project settings, adding files to the existing project and adding the dll to your app which does the job of profiling.
- This may not work out for release mode as we rely heavily on debug information to find the function name etc.
Background
When I started to search for the profiler, the first thing ( fortunately ) I came across is the compiler flags supported by Microsoft C/C++ compiler like /GH and /Gh. These options allow the user to inject a function that can be called before any function execution and function exit.
And the when I went through some of the profilers on x86, they were using this as a base to build over it. So I got a base to go ahead. Ok let's start!!!!
My idea was very simple:
- As _penter is called for every function call, lets start ticking the counter when _penter is called.
- Stop the counter at _pexit.
Preity simple and sounds good. But when I started to put my hands on, I was able to see some serious issues piled up.
Issue1 : I just added _penter and _pexit and compiled my app. Oh no, I could not succeed. Because these functions should follow naked calling convention. This calling convention is not supported by x64.
Issue2: Say somehow if I'm able to add them, _penter and _pexit does not do anything by itself. They are mere skeletons. It is upto the user to provide the implementation. And the documentation says that these functions should take care of managing the procedure prolog and epilog, by pushing the required registers and poping them back. This implication results into multiple questions.
- Which of the registers to be pushed?
- How to push the registers? x86 supports pushing the registers and poping them back with pushad and popad. x64 does not have these functions.
- Wait!!! Accessing the registers can be done only inside the inline assembly code which usually starts with _asm keyword. Oh dear, x64 does not support inline assembly code. Refer this link for further reading. How to have assembly code in x64 then?
Let's take one by one and untie the knot.
Step1: Adding _penter and _pexit
As _penter and _pexit is not supported in x64, I decided to move them to assembly code. But I already mentioned that inline assembly is not supported by VS in x64. Yes, inline assembly is not supported but we can have them in separate assembly files [ Refer ]. Let's create an assembly file ( .asm extension ) and add the skeletons for _penter and _pexit.
- To add an .asm file to the exisiting project -> Right click on Project -> Add -> New Item -> Let the filter be in C++ File ( .cpp ) -> Enter the name of the file as xxxxxx.asm.
Done. But how to compile the .asm file?
Step2: Compiling .asm file
As visual studio does not have any built in support to compile .asm files by itself, we have to borrow the external tool to do this job. This can be done using MASM assembler. Let me tell you simple steps. Once after adding the .asm file, do specify how these files to be compiled.
Righ click project -> Build Customization -> select masm.
Step3: Stuff _penter
Stack Manipulation:
I hope all of you aware of how the stack is manipulated and managed during function call. If not, please refer here. Even though it gives you details on x86, you should be aware of the stack management.
Let's see the disassembly code for a simple program with out /GH and /Gh flags.
#include <iostream>
using namespace std;
int Foo(int a, int b )
{
return ( a+b);
}
void main( )
{
cout<<Foo(5,6);
}
For the above code, while entering into main , stack pointer [ RSP ] points to 00000000002FF9E8. Let's see the content at this memory location. It is 6d 40 03 3f 01 00 00 00 , which is nothing but the little endian notation of 000000013f03406d. 000000013f03406d is the return address of main.
Figure 1.
I just wanted to show you that at the start of the function RSP holds the address of return instruction. And the other thing is rdi refers to the frame pointer where as in case of x86 ebp points to the frame pointer.
The above trace is for the program when compiled with out /Gh and /GH switch. But we are going to compile our project with /GH and /Gh flags. Let me show another sample program with these switches.
extern "C" void _penter;
extern "C" void _pexit;
void Subtract( int a, int b )
{
int c = a - b;
cout<<" c = "<<c<<endl;
}
void main()
{
Subtract(5,3);
}
Figure 2.
We can see the entries for _penter and _pexit for the main function. Similarly, if you see the disassembly for Subtract function, you can find the entries for _penter and _pexit.
What to stuff in _penter and _pexit?
As it's the responsibilty of user to provide definition for _penter and _pexit, we should aware what to stuff. And the documentation also says that, these function should take care of pushing the register contents during entry and pop out during exit.
As said already, for x86 pushad and popad are readily available. What about x64? Which of the x64 registers should be considered? [ Refer details of x86 registers ].
We should take care of the volatile registers [ Refer here ]. As we are not manipulating the floating point registers, our target is RAX,RCX,RDX,R8,R9,R10 and R11.We have to explicity push and pop these registers individually as there is no call available to do this job.
Now we know which registers to be stored and restored back in _penter and _pexit.
_penter proc
push r11
push r10
push r9
push r8
push rax
push rdx
push rcx
Stack Alignment: x64 looks for the stack pointer to be 16 byte aligned.
At this moment, as we have pushed 7 registers ( each one is of size 8 byte) , the stack should go misaligned ( 7 * 8 = 56 bytes which is not multiple of 16 ). But during _penter's entry, return address is being pushed on the stack. This is the address of the instruction where the control has to be transferred back once _penter completes it's execution. Because of this, the stack is now perfectly aligned ( 8 [ret address] + 56 = 64 bytes ). You can notice that at the start of _penter, the stack pointer points also points this address.
x64 Calling Convention:
At this point you should be aware of x64 calling convention [ Refer ]. It's similar to __fastcall calling convention where the first 4 integer parameters of the function is passed using the registers [RCX, RDX, R8, and R9]. And any parameters after that are pushed on the stack. If the parameters are say float, then the first 4 parameters are passing via XMM0 to XMM3 and the remaining are pushed on stack.
In _penter we should reserve the space for the first 4 registers, even though it is moved via registers RCX,RDX,R8 and R9. This may be because, later inside the function if the address of the parameter is referred, then this space on the stack will be used.I'm not very sure on this.
_penter proc
push r11
push r10
push r9
push r8
push rax
push rdx
push rcx
; reserve space for 4 registers [ RCX,RDX,R8 and R9 ]
sub rsp,20h
Figure 3.
Compute Function Address:
Now to pick the return address of the function, we have to climb 88 bytes up ( because after the return address , 7 registers were pushed [7 * 8 = 56 bytes ] and 32 bytes was reserved. So the stack has grown totally 56+32 = 88 [58h] bytes). This is nothing but the return address of the _penter.
Once, we know the return address , we can find out the address of the instruction which actually calls the _penter by subtracting 5 bytes from the return address. Because the call instruction in x86 is 5 bytes. The instruction [ CALL ] is 1 byte and the operand [ function address ] is 4 bytes.
This is fine for x86. But is it not 9 bytes for x64? [ 1 byte instruction + 8 byte operand ]. No. It's 5 bytes only. How come? In case of x86, the operand is 4 bytes absolute value.
CALL DWORD PTR[xxxxxxxx] -> where xxxxxxxx is the address of the function.
For x64, the operand is an offset value relative to the address of the instruction from where it is being called.
00300200 : CALL DWORD PTR [xxxxxxxx] -> So the function actually resides at the address 00300200 + xxxxxxxx, where xxxxxxxx is an 4 byte offset value from 00300200. As the operand is a 4 byte offset value, it is still 5 bytes only.
You can notice this in the figure2.
Return address ( 000000013F4C14BA ) - 5 = 000000013F4C14B5.
This address (000000013F4C14B5) has to be passed to the exported function in dll ( FindSymbol ) which finds the name of the function. To pass this address to the function, we just move this value to RCX register. Because the parameters are passed via registers.Refer.
_penter proc
push r11
push r10
push r9
push r8
push rax
push rdx
push rcx
; reserve space for 4 registers [ RCX,RDX,R8 and R9 ]
sub rsp,20h
; Get the return address of the function
mov rcx,rsp
mov rcx,qword ptr[rcx+58h]
sub rcx,5
; Call the exported function in dll which finds the name of the function
FindSymbol
Step4: Getting Function name and start time
Once getting the address of a function, we can get the name of the function using Debug help functions. This is being implemented in a separate dll. Keep in mind that, this dll is compiled with out /GH and /Gh flags. This dll has an exported function, FindSymbol, which takes the address of the calling function and find out its name using debug helper functions. This is the function which is being called from _penter.OK, we know the address of the function, how to get its name?
InitSymbols:
Lets do reverse engg. to find out name of a given symbol,
- We should load all the symbols of the module.
- How do we know name of module whose symbols to be loaded? We can use GetModuleFileName function to get it by using the function's base address.
- How to get the function base address? Call VirtualQuery method to retrieve the information about a given address.
- This is what the InitSymbols function does.
The dll has the entry point function Dllmain. When this function is called with reason as DLL_PROCESS_ATTACH ( usually this happens when the dll is loaded when the process starts in case of implicit linking ), get the base address where the dll is being loaded. I'm using this address to initialize the symbol handler for the current process.This is done in the function InitSymbols.
FindSymbol:
When this function is called from _penter, an address is passed as parameter to this function. This function calls a helper function, FindFunction. This function allocates memory for PSYMBOL_INFO structure and calls the function SymFromAddr to get the name of the function.
The received name ( avaialble at SYMBOL_INFO::Name ) may be a decorated one because of C++ name mangling schemes. To get the undecorated name, call the UnDecorateSymbolName function.
Once after getting the function name, an instance of ProfileInfo structure is created. This structure stores the function name, thread id, start time of the function and end time of the function etc., Now record the time using QueryPerformanceCounter.This is the approximate start time for the function. This instance is then added to the map g_mapProfileInfo.
g_mapProfileInfo:
This map's key is the thread id and the value is the vector of ProfileInfo instances. Each element of the vector belongs to a function. So this map ultimately stores for profiling information for all the functions called during an thread's execution. This map is protected using a critical section.
Step5 : Stack Cleanup
Once the function is recorded by calling FindSymbol, the control comes back to _penter. Next thing to do in _penter is stack clean up. This is done in 2 steps.
- Releasing the 32 bytes space reserved for registers.
- Poping out the volatile registers.
_penter proc
push r11
push r10
push r9
push r8
push rax
push rdx
push rcx
; reserve space for 4 registers [ RCX,RDX,R8 and R9 ]
sub rsp,20h
; Get the return address of the function
mov rcx,rsp
mov rcx,qword ptr[rcx+58h]
sub rcx,5
; Call the exported function in dll which finds the name of the function
FindSymbol
; release the reserved space by moving stack pointer up by 32 bytes
add rsp,20h
; pop out the pushed registers
pop rcx
pop rdx
pop rax
pop r8
pop r9
pop r10
pop r11
ret
Step 6: Stuff _pexit
The assembly code of _pexit is exactly same as the _penter. The only difference is the function called from dll is different and this time it is FindSymbol_1. This function is same as FindSymbol, but it records the end time of the function which called _pexit. FindSymbol_1 records the time at the very start of the function where as FindSymbol measures the time just before adding the ProfileInfo instance into the map.
Step 5: Dump Profiling
Profiling information is dumped just before the program completes its execution. This is done by the function DisplayProfileData. This function is called, when the Dllmain is called with reason as DLL_PROCESS_DETACH. We can modify this how ever we want. For example, we may want to display all the profiling info for a thread,when it completes the execution.This can be done in DLL_THREAD_DETACH case. Just find out the thread which completes its execution and search the thread in the g_mapProfileInfo and display only that information.
About source code
There are 3 sample programs being attached with this article. Both the programs uses the dll called SymbolServer.dll which actually does the job of finding the function name from its address, starting the timers, collecting the profile info for each function etc.,
1.First program is a simple console program named ProfilerX64.
- This program uses a simple static library known as ClientStaticLib to show that the function called from this library is also getting profiled.
- Both ProfilerX64 and ClientStaticLib are compiled using the /Gh and /GH swiches.
2.The second one is also a console application but uses multiple threads instead of one.
3. A Simple MFC Application
- This just draws a circle when left mouse button is clicked in the client area.
- When right mouse button is clicked, a series of dummy calls are executed and a message box is shown at the end.
- A console window is created to show the log.
- Some of the functions are filtered out like GetRunTimeClass,OnPaint,member functions of CDocTemplate,CView etc., Otherwise you will end up with flurry of these entries.
- Function name is dumped during its entry and exit in the console window. Just try left click and right click on the client area. You can notice the entries in the console window immediately. This behaviour is bit different from than SymbolServer.dll used for ProfilerX64.
- At the end/exit of the application, all the profiled information is shown and it waits for the user's input to enter a value before it closes the console window. This is just a hack, otherwise the console window is getting closed along with application.
These sample programs uses x64 Debug configuration. Because as we have mentioned already we need the debug information to get the name of the function in SymbolServer.dll.
Points of Interest
I'm very new to the assembly programming. So it took really some time to understand the stack management, register manipulation etc., I invested some time to get to know these things and it was really interesting.
Future work
In the next version, I'm planning to record the calling function of each function ( kind of callstack ). For example, in case of main I would like to record _tmainCRTStartup in the ProfileInfo associated with main.
May be we can add a pointer to ProfileInfo in ProfileInfo structure ( kind of linked list ) which will point to the ProfileInfo of calling function. This may give you the profile info of the calling function.
And similarly we can have a vector of ProfileInfo pointers in ProfileInfo where ProfileInfo of each child function can be added. This gives you the profile info of all the functions called by a particular function.
Acknowledgement
I would like to thank all who educated me on assembly programming concepts either directly or indirectly when I started this work, who answered my questions with out any delay and authors of various articles which talks about the profiler,assembly programming and x64. In particular I would like to convey my sincere thanks to the MSDN members Mike Danes and Crescens for answering all my questions posted on the forum.
History:
5th August 2014: Article published.