Introduction
Are you interested in how profilers work and how to walk mixed-mode applications? We will explore the ICorProfiler
interface for doing a mixed-mode stack walking. Due to limited documentation, I had to learn some things the hard way. In this article I will share some of the DOs and DONTs, limitations, and possibilities, links to resources, and things I have learned.
Background
This is a continuation of an article where we learned how to traverse a mixed-mode (unmanaged/managed) application using IDebugInterface
and IXCLRProcessData
. The reason I started with this was the need to optimize a C++/C# application, which was slow at importing large data files. Neither a native profiler nor a CLR profiler run separately could pin-point the problem. A Mixed-Mode Stackwalk with IDebugClient Interface which was my first attempt to walk a mixed-mode callstack resulted in a partial success. I could see function calls of the CLR library but my own code appeared camouflaged as some calls to a CLR function where my JITted code was being executed. In this article we will instead explore a profiler API of the CLR runtime, called ICorProfiler
.
Definitions
Mixed-Mode Callstack
When we talk about a mixed-mode callstack, we refer to a non pure managed application. A truly managed application only contains managed code. A managed application, using some sort of interop/pinvoke, is not pure. It is a mixed-mode application. COM, C++/CLI, Interop, and/or PInvoke can be used to mix the two worlds. Native code that calls managed code, which in turn calls native code again. We will illustrate this with a sample.
We start from the bottom, where a C++ app makes a call into a C++/CLI managed DLL, the managed DLL in turn calls native code. Now we have two transitions already. If this native code uses a COM object written in C#, we get yet another transition, and so it can continue. In the normal case, you would only have 1-2 transitions between managed/unmanaged code. If you get 5-6 layers, you probably have an indicator that you should redesign.
Profiling
Profiling is a form of dynamic analysis of the execution of a software. It can measure memory consumption, how many times a function has been called, the duration of a function call, and much more. It is generally used to guide optimization by pin-pointing the worst CPU hogs. Profiling can be done at source code level or at binary level.
The application I wanted to optimize had been "profiled" before, at source code level, by writing the stacktrace to disk in functions we thought was called often. Not very sophisticated, but it worked. Its weakness was that it only supported native code. Unfortunately I had reason to believe that the problem was in managed code.
Instrumenting Profiling
My own experience goes back to the 90s. I used a native profiler called Quantify from Rational. That profiler did instrumentation on a binary level by patching function entry/leave to do logging. The result was a new binary, and every library that was dynamically loaded was also automatically patched.
Event Based Profiling
Nowadays, languages running in a virtual machine have really gained ground. Native profilers are not good for those languages. An instrumenting profiler would profile also the virtual machine. Those languages have their own profiler, to profile code that is run by the VM. The VMs usually provide callbacks for profilers to inform about JIT-compilation, object creation, enter/leave.
Sampling Profiling
Sampling is when you take samples (walk the stack) at regular intervals (every 5-20 ms). This is a very fast method since it lets the application run at almost full speed. If the interval is small, the data collected should be quite accurate.
ICorProfiler
Exploring a sample profiler
Now we are going to build a profiler of both native and managed code. The documentation for ICorProfiler
is minimal and the samples are few. Fortunately, I found an example here at CodeProject: Creating a Custom NET Profiler. Respect! Unfortunately, it is a managed-only profiler, but it serves to give an understanding how the profiler API is supposed to be used.
The profiler was extremely easy to use. Profiling was started by setting two environment variables:
c:\>SET COR_ENABLE_PROFILING=1
c:\>SET COR_PROFILER={C6DBEE4B-017D-43AC-8689-3B107A6104EF}
c:\>SampleApp.exe
The first one is pretty obvious. It enables profiling. The second one points to a specific profiler implementation by its CLSID.
The project was supplied with a sample app, called Hello World. Unfortunately, I could not use this profiler implementation for several reasons.
- It was only for managed code
- It was too slow
- I needed control over the start and stop
- I wanted to avoid deployment involving the Registry
- I wanted to learn how to do the programming
In my particular case, I didn't have a need to measure the performance of the overall execution. Most of the time the application was waiting for user input. In particular it was the performance of the import/restore of a database I wanted to have a closer look at. I needed to start profiling immediately before the import, and stop it immediately after. When the app starts, a lot of data is loaded from the database and filesystem, and some internal data structures are built up. The import uses the same components and to some degree the same functions to do checking data integrity. If I start the profiler and let it run until the end of the import, I will get a big profile trace, which is not useful for me. I will not be able to see if it was generated from the application startup or from the import.
ICorProfiler Documentation
A good start is a Profiling Overview, where we learn that the profiler must be implemented as a COM DLL. This DLL runs in-process, meaning that it is loaded into the process space of the application you want to profile. The profiler uses the Profiler API of the CLR, and gets notifications. Furthermore, the profiler must be written in native/unmanaged code, otherwise its own execution would trigger events in the CLR.
The pages describe two ways of profiling. Using the Enter/Leave/Tail callbacks or using DoStackSnapshot
. I decided to go for the latter because I found a guideline Profiler Stack Walking in the .NET Framework 2.0: Basics and Beyond on how to walk a mixed stack.
Exploring the Interfaces
A good starting point is Profiling Interfaces at MSDN.
There seems to be a lot of different interfaces, but they are in reality a few because many of them are just extensions of the interfaces.
ICLRProfiling
ICorProfilerCallback
(1,2,3) ICorProfilerInfo
(1,2,3)
The reason there are several versions of the same interface is backward compatibility. An interface must never change. If it would, components written for the old interface would break. In order to add functionality, you can create a new interface containing the new functions. There are other options to implement versioning, but this is how Microsoft implements it with ICorProfiler
:
ICorProfilerCallback
is from CLR V1.0 and forward ICorProfilerCallback2
is from v2.0 and forward ICorProfilerCallback3
is from v4.0 and forward.
ICLRProfiling
The first class that appears is ICLRProfiling
. It is the start Interface. It just contains one method.
CLRProfiling::AttachProfiler
HRESULT AttachProfiler(
[in] DWORD dwProfileeProcessID,
[in] DWORD dwMillisecondsMax, [in] const CLSID * pClsidProfiler,
[in] LPCWSTR wszProfilerPath, [in] size_is(cbClientData)] void * pvClientData, [in] UINT cbClientData);
It attaches a profiler to a running process. We have the options to specify a file path, meaning that we can avoid the Registry.
Creating an ICLRProfiling Object
CLR Hosting API
We are going to use the CLR Hosting API. It's an API that lets an unmanaged application integrate a CLR runtime into itself. Not only can you startup the CLR inside a native application, but you can also replace the garbage collector, add global exception handlers, and much more. There is an interesting book regarding the CLR Hosting API: Customizing the Microsoft® .NET Framework Common Language Runtime.
ICLRMetaHost
The CLR Hosting APIs are provided by MSCorEE.dll. It is a thin version-independent API. The first thing to do is to create an ICLRMetaHost
object. This object is the door to the CLR.
#include <Metahost.h>
ICLRMetaHost* m_metahost = nullptr;
HRESULT hr = CLRCreateInstance(CLSID_CLRMetaHost, IID_ICLRMetaHost, (LPVOID*)&m_metahost);
ICLRMetaHost::GetRuntTime
The ICLRMetaHost
object contains functions such as EnumerateInstalledRuntimes
and GetRuntime
, the latter starts a CLR runtime in the current process. Let's do that.
ICLRRuntimeInfo* m_runtimeInfo = nullptr;
hr = m_metahost->GetRuntime(_T("v4.0.30319"), IID_ICLRRuntimeInfo, (LPVOID*)&m_runtimeInfo);
ICLRProfiling
We can use the ICLRRuntime
object to create objects that interact with the running CLR runtime. For doing profiling, we are going to need the ICLRProfiling
object.
ICLRProfiling* m_clrProfiling = nullptr;
hr = m_runtimeInfo->GetInterface(CLSID_CLRProfiling, IID_ICLRProfiling, (LPVOID *)&m_clrProfiling);
ICLRProfiling::AttachProfiler
LPVOID pvClientData = NULL;
DWORD cbClientData = 0;
CLSID clsidProfiler;
HRESULT hr;
hr = CLSIDFromString(L"{C6DBEE4B-017D-43AC-8689-3B107A6104EF}", (LPCLSID)&clsidProfiler);
hr = m_clrProfiling->AttachProfiler(
m_processId,
timeout,
&clsidProfiler,
NULL, pvClientData ,
cbClientData );
The ICLRProfiling
object has an AttachProfiler
method. Since at this point I didn't have an implementation of a profiler, I tried attaching an existing implementation, the Net Profiler.
Attachprofiler
needs the CLSID of the Profiler DLL. If the CLSID is in the Registry, the file path can be NULL
, otherwise you must supply the absolute file path.
It did not work with the Net Profiler! Running the code returned a failure code of 0x80070002: "The system cannot find the file specified". After verifying the file path a 100 times, Googling the error code, double checking the CLSID too, I found the problem. I had not read the manual for AttachProfiler
properly. Among its error codes, is to be found the following error code:
HRESULT_FROM_WIN32(ERROR_FILE_NOT_FOUND)
The target process does not exist or is not running a CLR that supports attachment.
This may indicate that the CLR was unloaded since the call to the runtime enumeration method.
I would have wanted to see a more accurate error code for this. The problem was that my process did not support attachment. The reason was that I had tried to attach a profiler from the application that I wanted to be profiled. This was not allowed according to the API.
In my second attempt, I tried to attach to another process. It didn't work either. The reason was that the profiler that I had was old. It was written for the CRL v2.0 profiling interfaces. Attaching a profiler to running processes is only supported by CLR v4.0 and forward. The sample profiler didn't implement this interface.
The Solution
First of all, if it wasn't for some great sources on the internet such as Profiler Stack Walking in .NET Framework 2.0: Basics and Beyond and Building a mixed mode stack walker I would probably not have finished my own implementation. The sources are theoretical, no source code are available. My contribution is about explaining how it can be done in practice, and tell you about the DOs and DONTs, and my experiences.
What We Need to Implement
We will need to implement the ICorProfilerCallback
interface. The CLR will call the interface methods, implemented by our Profiler, to inform about events.
Startup and Initialization
ICorProfilerCallback::Initialize
If the profiler DLL is loaded at application startup, ICorProfilerCallback::Initialize()
will be called.
ICorProfilerCallback::InitializeForAttach
If we instead attach to a running process using the ICLRProfiling::AttachProfiler
method, ICorProfilerCallback::InitializeForAttach()
is called instead.
ICorProfilerInfo::SetEventMask
Probably the first thing one wants to do is to set up the event mask, which is how you tell the CLR what events you are interested in getting callbacks for. There is a whole lot of events to subscribe for.
COR_PRF_MONITOR_NONE = 0,
COR_PRF_MONITOR_FUNCTION_UNLOADS = 0x1,
COR_PRF_MONITOR_CLASS_LOADS = 0x2,
COR_PRF_MONITOR_MODULE_LOADS = 0x4,
COR_PRF_MONITOR_ASSEMBLY_LOADS = 0x8,
COR_PRF_MONITOR_APPDOMAIN_LOADS = 0x10,
COR_PRF_MONITOR_JIT_COMPILATION = 0x20,
COR_PRF_MONITOR_EXCEPTIONS = 0x40,
COR_PRF_MONITOR_GC = 0x80,
COR_PRF_MONITOR_OBJECT_ALLOCATED = 0x100,
COR_PRF_MONITOR_THREADS = 0x200,
COR_PRF_MONITOR_REMOTING = 0x400,
COR_PRF_MONITOR_CODE_TRANSITIONS = 0x800,
COR_PRF_MONITOR_ENTERLEAVE = 0x1000,
COR_PRF_MONITOR_CCW = 0x2000,
COR_PRF_MONITOR_REMOTING_COOKIE = 0x4000 | COR_PRF_MONITOR_REMOTING,
COR_PRF_MONITOR_REMOTING_ASYNC = 0x8000 | COR_PRF_MONITOR_REMOTING,
COR_PRF_MONITOR_SUSPENDS = 0x10000,
COR_PRF_MONITOR_CACHE_SEARCHES = 0x20000,
COR_PRF_MONITOR_CLR_EXCEPTIONS = 0x1000000,
COR_PRF_MONITOR_ALL = 0x107ffff,
COR_PRF_ENABLE_REJIT = 0x40000,
COR_PRF_ENABLE_INPROC_DEBUGGING = 0x80000,
COR_PRF_ENABLE_JIT_MAPS = 0x100000,
COR_PRF_DISABLE_INLINING = 0x200000,
COR_PRF_DISABLE_OPTIMIZATIONS = 0x400000,
COR_PRF_ENABLE_OBJECT_ALLOCATED = 0x800000,
COR_PRF_ENABLE_FUNCTION_ARGS = 0x2000000,
COR_PRF_ENABLE_FUNCTION_RETVAL = 0x4000000,
COR_PRF_ENABLE_FRAME_INFO = 0x8000000,
COR_PRF_ENABLE_STACK_SNAPSHOT = 0x10000000,
COR_PRF_USE_PROFILE_IMAGES = 0x20000000,
COR_PRF_MONITOR_THREADS
In my case, I needed COR_PRF_MONITOR_THREADS
which give callbacks when threads are created and destroyed.
STDMETHODIMP ICorProfilerCallback::ThreadAssignedToOSThread(ThreadID managedThreadID, DWORD osThreadID)
{
m_managed2NativeMap[managedThreadID] = osThreadID;
m_native2ManagedMap[osThreadID] = managedThreadID;
return S_OK;
}
It is called when a native thread is assigned a managed thread ID. I use the ThreadAssignedToOSThread
callback to maintain a mapping between the two IDs. The callback is only called once per thread. This is actually a problem when we choose to attach to running processes, because the callback will never come. I haven't found a good way to find out if a thread is managed or not. We can use the ICorDebug
interface, but you cannot attach a debugger to yourself.
COR_PRF_ENABLE_STACK_SNAPSHOT
I also needed COR_PRF_ENABLE_STACK_SNAPSHOT
, which permitted me to do the actual stack walk with ICorProfilerInfo2::DoStackSnapshot
. It works kind of like StackWalk64
, but instead of you calling StackWalk64
for each frame, you call DoStackSnapshot
with a callback function pointer, and it calls your callback for each frame.
ICorProfilerInfo2::DoStackSnapshot
The ICorProfilerInfo2::DoStackSnapshot
routine:
HRESULT DoStackSnapshot(
[in] ThreadID thread,
[in] StackSnapshotCallback *callback,
[in] ULONG32 infoFlags,
[in] void *clientData,
[in, size_is(contextSize), length_is(contextSize)] BYTE context[],
[in] ULONG32 contextSize);
A call to DoStackSnapshot
needs a function call pointer and a register CONTEXT (EIP, EBP, ESP). When DoStackSnapshot
is called, the framework calls the function callback for each Managed frame on the stack, and once for each native code block in between (may contain many native frames).
HRESULT __stdcall StackSnapshotCallback (
[in] FunctionID funcId,
[in] UINT_PTR ip,
[in] COR_PRF_FRAME_INFO frameInfo,
[in] ULONG32 contextSize,
[in] BYTE context[],
[in] void *clientData
);
A call on DoStackSnapShot
will get us a call sequence as illustrated below.
For all callbacks, we get a CPU CONTEXT (EIP, EBP, ESP) parameter, which represents the CPU state of the current frame.
When the frame is managed, we get also a FuncId, which we can resolve to a symbol name via the CLR.
If the frame is unmanaged, we can use the context parameter to traverse the callstack until the next managed frame. Unfortunately, we don't know where the next managed frame starts until we get the next callback. So, the best thing to do is to postpone the stackwalk until we know where it ends. In my implementation, I store the callback information FuncId and part of the CONTEXT in a list. Afterwards, when DoStackSnapshot
returns, I iterate over the list expanding all unmanaged frames until they either end or we reach the next managed IP.
In order for a native stackwalk to succeed properly, and to get readable names, it is a must to have the symbol files (.pdb) loaded. This is done with a call to SymInitialize
. The problem with SymInitialize
is that it is not so fast. If the profiling starts right from the from process startup, the PDB files are not yet available, and SymGetModuleBase64
and SymGetSymFromAddr64
fail. They should start working when the PDB file eventually gets loaded.
Wrapping Up
Requirements for my Profiler
- Should work on the v2.0 and v4.0 runtime
- Needs to start and stop profiling
- No need for registration
Forcing Apps into the v4.0 Runtime
The ICorProfiler
interface has many good features in v4.0, not available in v2.0. In order to avoid the work of supporting both v2.0 and v4.0, I chose to concentrate on getting the support of v4.0 working. V2.0 apps would be supported by forcing them to run in the v4.0 runtime. There are very few compatibility issues between v2.0 and v4.0, so it should work 95% of the time.
There is support for this already in the CLR, through an environment variable called COMPLUS_version
.
C:\>SET COMPLUS_version=v4.0.30319
C:\>MyApp.exe
The version string must of course match the one that you have installed.
Setting up a Timer to take Stacktraces
Since the plan was to do a manual stackwalk, I need a timer to be triggered at regular intervals. At the point of writing, it is set to every 40 ms. That is 25 times per second. To get a more accurate profiling, every 5ms or 10ms is better.
Starting and Stopping the Profiler
Since the DLL gets loaded into the process space of the target, we need to find a way to communicate with it. One nice solution would be to register a COM object in the Running Object Table (ROT), and then get that object from the application where I wished to control the profiler. I think ROT requires using the Registry, so I chose an approach using Pipes for communication instead.
There is an advantage of being able to communicate with the profiler. We can start the app with profiling on hold. Doing this permits us to get all the early events that are happening in the CLR which we may miss if we attach at a later point. One callback that particularly interests us is ThreadAssignedToOSThread
.
In the InitializeAttach
routine of the Profiler callback interface, I spawned a new thread, which I dedicate as a command listener. The pipe communication uses may look strange, it is because I experimented a little with asynchronous IO.
Demo Applications
There are in total four sub projects in the solution:
- DiagProfiler.dll
- DiagProfilerClient.dll
- DiagProfilerConsole.exe
- DiagProfilerLauncher.exe
DiagProfiler.dll
This is the profiler. It needs to be configured from the environment variables. You have to do the configuration in a Cmd console before starting the target app. There is a file called profile.bat that you can update in order to reflect your installation. Run it in the Cmd window, then start your app from the same Cmd window.
REM Microsoft CORProfilerAPI
SET COR_ENABLE_PROFILING=1
SET COR_PROFILER={C6DBEE4B-017D-43AC-8689-3B107A6104EF}
SET COR_PROFILER_PATH=C:\Source\MixedModeProfiler\MixedModeProfiler_demo\DiagProfiler.dll
REM CLR Option
SET COMPLUS_Version=v4.0.30319
REM DiagProfiler
SET DIAG_PRF_SYMBOLPATH=SRV*c:\symbols*http://msdl.microsoft.com/download/symbols;
SET DIAG_PRF_DEBUGTRACE=C:\temp\diag_debugtrace.txt
SET DIAG_PRF_STACKTRACE=C:\temp\diag_stacktrace.txt
SET DIAG_PRF_ONLY_MANAGED_THREADS=1
COR_PROFILER_PATH
must have an absolute path. Relative paths don't work. If DIAG_PRF_STACKTRACE
is omitted, the trace will be saved in the My Documents folder. If DIAG_PRF_SYMBOLPATH
is omitted, the current path of the target application will be used.
DiagProfilerClient.dll
This is a managed library that we can use to get further control over the profiler.
public ref class ProfileClrApp
{
bool StartProcess(String^ exePath, String^ stackOutputPath, String^ debugOutputPath);
bool AttachProfiler(DWORD pId);
bool ConnectGUI();
bool StartSampling();
bool StopSampling();
int GetSampleCount();
int GetSampleRate();
bool SetMaxSamples(int maxNoSamples);
bool SetSampleRate(int rateMs);
bool SetStackLoggerFile(String^ stackOutputPath);
};
Profiler Launcher and Controller
I built a small GUI client app to get better control over the profiler. It supports start/stop, attaching, querying the current sample count, etc.
The stack output file will be saved to My Documents of the current user. Double click on the field to change the file location.
DiagProfilerConsole.exe
A small console app that is using the DiagProfilerClient.dll library. The application accepts a PId as a parameter. Running it without parameters makes it start the sample app CppCliApp.exe.
C:\...>DiagProfilerConsole.exe 6292
v2.0.50727
v4.0.30319
AttachProfiler(6292, "C:\...\DiagProfiler.dll") : S_OK
Pipe = \\.\pipe\PipeServer_6292
Pipe = \\.\pipe\PipeServer_6292
Pipe = \\.\pipe\PipeServer_6292
Connected
Number of bytes read: 6
Message = [1][1]
Sampling Rate: 25
Number of bytes read: 6
Message = [2][1]
SetMaxSamples: 100
Number of bytes read: 6
Message = [3][1]
StackLogFile=C:\...\diagstack_6292.txt
Number of bytes read: 8
Message = [4][1]25
Samplerate: 25 ms
Number of bytes read: 6
Message = [5][1]
Started Sampling
Will run #100 samples
Number of bytes read: 6
Message = [6][1]
Stopped Sampling
Number of bytes read: 9
Message = [7][1]100
Got #100 samples
Press any key to continue . . .
Message = [MessageCount=7][CommandSucceded=1]100
"100" is the result from the executed server command.
Two files will be created in the current directory.
diagstack_#pid.txt and diagdebug_#pid.txt
Debugging
It is a bit cumbersome to debug an in-process profiler. One way is using OutputDebugString
, printf
s, and inserting "__debugbreak()
" in the code.
The log files can optionally also echo to OutputDebug
. Files are a bit awkward to work with when you have crashes, since the files are not flushed and closed. If you decide to see how it works, and run into problems, enable the debug output.
g_diagInit.m_debugLogger.Echo2OutputDebug(true);
g_diagInit.m_debugLoggerVerbose.Echo2OutputDebug(true);
DiagInit
is a class that should be used to initialize the objects. Some of my global variables could probably be moved into that class.
Usually when creating a COM DLL, its class factory has to be written, and a lot of glue code. I had a strange error which I couldn't get rid of, so I decided to write my own class factory, instead of using ATL COM. That file is called DiagProfilerFactory
. I made some sanity checks and logging there, but I later moved the code to DiagInit
which is statically created. That is, even before DiagProfilerFactory
, which is created dynamically.
Things Left to Do
Optimizations
There are quite a few optimizations to do. At this point, the symbol information for every call-frame is resolved, and the stacktrace is written to disk. This procedure is repeated, even if the same info has been resolved before. Here is a lot of time to save, caching the symbol information and mappings to managed functions. I would suggest building a cache, and storing only the IPs at every sample interval. When the profiling has finished, you go back to the sample data and resolve the symbol information.
Don't Profile all Threads
At some point, the profiler took samples of all threads belonging to a process. In a normal app, and particularly in a managed app, there are a lot of background threads, e.g., the garbage collector and finalizer threads. Usually the performance of those aren't interesting. Those are usually in a wait state (e.g., WaitForMultipleObjects
) and will always give you the same stacktrace. These threads can be removed too, either by not profiling them or removing them from the trace afterwards. In my implementation, profiling from the application startup profiles only the threads that have been assigned a managed thread ID. With respect to profiling an existing process, it profiles all threads, because I cannot obtain the managed thread IDs.
Fix AttachProfiler with Managed Thread IDs
I tried using ICorDebug
and to enumerate all managed threads. That part worked, but the ID of the thread was the Win32 thread ID, not the managed thread ID. The way to go is probably to enable more CLR callbacks, and call GetCurrentThreadId()
on them. The problem with that approach is that SetEvent
can only be called once, from Initialize
or InitializeForAttach
. Meaning that I cannot turn the events off again.
Points of Interest
Two interesting resources are Profiler Stack Walking in .NET Framework 2.0: Basics and Beyond and Building a mixed mode stack walker.
The managed-only profiler mentioned earlier Creating a Custom Net Profiler actually has a continuation by the same author, called Slimtune. It is also a mixed-mode profiler. I found out about Slimtune at a very late stage. It was a bit discouraging at first, but I decided to continue on my implementation anyway, because I didn't want to miss the learning experience. It is a more advanced version than mine (but you have to figure out for yourself how it works). Their implementation combines DoStackSnapShot
and ENTER/LEAVE/TAIL hooks in a smart way. They also seem to track memory allocations, and garbage collection. If you are interested in learning mixed-mode stackwalking, my implementation is probably easier to understand because it is smaller, has source code comments, and has this accompanying article. It is also a direct implementation of the Profiler Stack Walking in .NET Framework 2.0: Basics and Beyond guideline on MSDN.
Another good source of Managed debugging, ICorDebug
, and random .NET stuff is Mike Stall's blog.