Switching threads

Rajasekharan Vengalil

5.00/5 (13 votes)

16 Sep 2007CPOL17 min read

915

How to switch the thread that a routine is running on.

Introduction
A note on UI and worker threads
OK, so how do we switch threads?
Just enough assembly

Using the stack - push
Using CPU registers - mov
Addition - add
Calling a function - call

How function calls work

Function calling conventions

C calling convention - __cdecl
"Standard" calling convention - __stdcall

Function prologue and epilogue

Thread switching internals
Caveat
Conclusion
Revision History

Introduction

I was working on building an RPC server the other day when I ran into an interesting threading related problem. RPC servers are, as you might know, simple global functions implemented in your executable that the RPC runtime invokes whenever remote clients initiate a call. Here's a block diagram that shows how it works:

One of the key issues here is that the RPC runtime will call your implementation on an arbitrary thread that it creates/draws out from a thread pool. For the application that I was working on, it so happened that there were about 20 functions exposed from the server and all of them needed access to a window that had been popped up on the screen from the server EXE's primary thread. This, of course, presents us with a problem because the Win32 UI system insists that any kind of manipulation on a window must always occur from the thread that created the window. Now, given that my server functions will always be called on arbitrary threads, I had to figure out some way of transferring control from the RPC thread to the thread that had created the window. This article presents a technique that will let us do this by taking advantage of the way function calls occur on 32-bit x86 systems.

A note on UI and worker threads

Before we proceed, I'd like to set some terminology straight. Just so I don't hasten my inexorable descent into carpal tunnel syndrome and lose my means for livelihood any sooner than is strictly necessary, I am going to use the term UI thread to refer to the thread that created the window and the term worker thread while referring to all other threads. The primary distinguishing factor between the two, apart from the fact that the former has created a window, is the presence of a message loop in it. A UI thread will have a message pump going that, typically, looks something like this:

C++

MSG     msg;
BOOL    bReturn;

while( ( bReturn = GetMessage( &msg,
                               NULL,
                               0, 0 ) ) != FALSE )
{
    if( bReturn == -1 )
    {
        // some unpleasant thing happened; handle it!
    }
    else
    {
        TranslateMessage( &msg );
        DispatchMessage( &msg );
    }
}

A worker thread, on the other hand, just does a specific task and terminates.

It is useful to remember that this distinction is only logical, in that from the operating system's (OS) and the CPU's perspective, all threads are created equal and are treated likewise, except of course, when one of them has its thread priority boosted, in which case the OS scheduler gives it the right of way. Now that we've got the terminology out of the way, let's see how you can switch some threads around for fun and profit!

OK, so how do we switch threads?

First, I'll show you how you can use the header tswitch.h in your own C++ project for switching the thread on which a given routine executes. For it to be applicable, however, your project and your routine must satisfy certain pre-requisites:

Your solution must feature a window of some sort (duh!).
Your routine must only take and return real and/or integral type parameters. Pointers are also allowed. This means that your function parameters and return values must be ints, floats, chars, or pointers. This only leaves out doubles, structures, and unions that are passed and returned by value. We'll see why this restriction exists later on in the article.
Your application must target the 32-bit x86 architecture. x64 is not supported.
Your routine must use the __stdcall or the __cdecl function calling convention.

Assume that you have a function called TheGreatDoofus that is executing on a worker thread. You need it to run on the thread that has created a window, the handle to which is available to you in the variable hWnd. Here's how you'd use tswitch.h to switch the execution of the routine to the UI thread.

C++

#include "tswitch.h"

void __cdecl TheGreatDoofus( HWND hWnd,
                             int foo,
                             char *bar,
                             AStructure *pStruct )
{
    //
    // this macro expands to the code that actually does
    // the thread switching
    //

    THREAD_SWITCH( hWnd, TheGreatDoofus, 5, ccCdecl )
    
    //
    // you can happily manipulate your window here
    //

    SetWindowText( hWnd,
                   _T( "I switched the great doofus!" ) );
}

Here's how the macro THREAD_SWITCH has been declared in tswitch.h:

C++

//
// This macro determines whether a thread switch operation is
// required by comparing the current thread ID with the ID of the
// thread that created the given window handle.
//

#define InvokeRequired(hwnd) ( GetCurrentThreadId() != \
    GetWindowThreadProcessId( hwnd, NULL ) )

//
// The THREAD_SWITCH macro is used with functions that do
// not return anything. It accepts the following parameters:
//

//  hwnd    -   handle to the window on whose thread this function
//              is to execute

//  fn      -   address of the function that is to be executed in the
//              UI thread

//  params  -   count of the parameters that this routine accepts

//  conv    -   the calling convention used by this function; value
//              must belong to the "CallingConvention" enum

//
//  Here's an example:
//

//  void __cdecl Doofus( HWND hwnd, int a, float b )
//  {
//     THREAD_SWITCH( hwnd, Doofus, 3, ccCdecl )
//  }
//

#define THREAD_SWITCH(hwnd, fn, params, conv) \
    if( InvokeRequired(hwnd) ) \
    { \
        SendMessage( hwnd, WM_INVOKE, \
                (WPARAM)new ThreadSwitchContext( (DWORD_PTR)fn, \
                ThreadSwitchContext::GetEBP(), params, conv ), 0 ); \
        return; \
    }

I'll explain the code that the macro expands to in a moment. Let's look at another example where the function returns a 32-bit LONG value and happens to use the __stdcall function calling convention (we'll examine calling conventions in some detail later):

C++

#include "tswitch.h"

LONG __stdcall TheGreatDoofus( HWND hWnd,
                               int foo,
                               char *bar,
                               AStructure *pStruct )
{
    //
    // this macro expands to the code that actually does
    // the thread switching
    //

    THREAD_SWITCH_WITH_RETURN( hWnd, TheGreatDoofus,
                               5, ccStdcall, LONG )

    //
    // you can happily manipulate your window here
    //

    SetWindowText( hWnd,
                   _T( "I switched the great doofus!" ) );

    return 0L;
}

The parameters are exactly the same as THREAD_SWITCH except that in this case, it requires one additional parameter to specify the type of the return value (LONG in the example above). And finally, in your window procedure, you'd need to handle a message called WM_INVOKE, like so:

C++

LRESULT CALLBACK WndProc( HWND hWnd, UINT message,
                          WPARM wParam, LPARAM lParam )
{
    //
    // process WM_INVOKE; this cannot be a case statement because
    // WM_INVOKE is not a constant
    //

    if( message == WM_INVOKE )
    {
        THREAD_SWITCH_INVOKE(wParam);
    }

    return DefWindowProc( hWnd, message, wParam, lParam );
}

If you're using MFC, then you'll have to add an ON_REGISTERED_MESSAGE message map entry. The MFC version might look like this:

C++

BEGIN_MESSAGE_MAP( CDoofusFrame, CFrameWnd )
    //
    // other message handlers
    //

    
    ON_REGISTERED_MESSAGE( WM_INVOKE, OnInvoke )
END_MESSAGE_MAP()

LRESULT CDoofusFrame::OnInvoke( WPARAM wParam,
                                LPARAM lParam )
{
    THREAD_SWITCH_INVOKE( wParam );
}

That's it! Couldn't get simpler, could it?! Now, we take a look at the deep dark implementation secrets!

Just enough assembly

Before we get to the deep dark secrets, however, you'll need to know a bit of x86 assembly language programming. I'll cover just enough assembly in this section so you can follow along without too much trouble. While I am no assembly expert, I will, however, try and share my smattering knowledge of it. In fact, we'll only look at four assembly instructions now, because those are the only instructions used here!

A program in assembly is composed of a sequence of instructions. Well, aren't all programs written in just about any language a sequence of instructions, you rightfully ask! Yes, indeed they are. The key difference is that, with assembly, most of the time, a single instruction actually does only a single thing (as opposed to a line of C# code, for instance, which might set off a chain of events on computers distributed all over the planet, and you might not even know about it)! Another aspect of programming using assembly is that you are coding close to the metal, so to speak. You find yourself having to deal with CPU registers, stacks etc., directly.

Using the stack - `push`

The notion of a stack data structure is central to assembly language programming (in fact, we use stacks even when using C and C++ - all function parameters and local variables are allocated on the stack; I am going to assume that you know what it is). As you might know, the stack is a last in first out structure. There are a set of assembly instructions that allow you to manipulate it in various ways. To push an item on to the stack, you use an instruction called, well, push! Here's a sample that pushes the value 10 on to the stack.

ASM

push  10;

Simple, eh?! In the code given above, push is the instruction code and 10 is the operand. If you are using Visual C++, then you can directly embed assembly into your source files. Here's an example:

C++

void DoTheAssemblyJiggyWiggy()
{
    //
    // here's a single line of assembly
    //

    __asm push 20;

    //
    // and here's an assembly block containing
    // multiple lines of assembly code
    //

    __asm
    {
        push 10;
        push 20;
        push 30;
    }

    int value = 30;

    //
    // look ma! i can even access local variables from
    // embedded assembly!
    //

    __asm push value;
}

What's really neat about embedded assembly is that the compiler allows you to refer to and access local variables from the assembly! In the snippet above, for example, the number 30 stored in the variable value is pushed on to the stack by referring to it from the assembly.

Using CPU registers - `mov`

CPU registers are the fastest form of computer memory available to a CPU. There are various kinds of registers on a typical CPU, differentiated by size and purpose. There are four general purpose registers called - for lack of better names - A, B, C, and D! But, somebody thought those names were too short, and decided to append an X to each one. So, we ended up having AX, BX, CX, and DX, instead. These registers are all 16 bits wide, and you can even access their low and high bytes separately, if you are so inclined. For example, AH and AL refer to the high and low bytes of the register AX. You can use a similar scheme for BX, CX, and DX as well. On 32-bit CPUs, the size of these registers became 32-bits wide. The 32-bit versions simply had an E prepended to them. The same four registers, therefore, when 32-bits wide, are called - EAX, EBX, ECX, and EDX.

So, how do you assign a value to one of these registers? You use the mov instruction. Unlike push, mov requires two operands. The first one is the destination where the assigned value must be stored, and the second is the value that is to be assigned. Here're a few examples:

C++

void DoTheAssemblyJiggyWiggyAgain()
{
    //
    // assign 10 to ax
    //

    __asm mov ax, 10;
    
    //
    // assign 5 to bl and bh
    //

    __asm
    {
        mov bl, 5;
        mov bh, 5;
    }
    
    //
    // assign 0xFFFFFFFF to ecx from a C++
    // variable
    //

    unsigned long value = 0xFFFFFFFF;
    __asm mov ecx, value;
}

Addition - `add`

To add two values, you use the add instruction. Like mov, add requires two operands. It simply accumulates the second operand value into the first. The following snippet, for instance, adds 10 to whatever happens to be stored in ECX. If ECX had had the value 5, then after the following line is executed, it will have the value 15.

ASM

add ecx, 10;

Calling a function - `call`

A function call is made using the call instruction. Here's an example:

C++

void Dooga()
{
}

void DoTheAssemblyJiggyWiggyOnceAgainPlease()
{
    //
    // call "Dooga"
    //

    __asm call Dooga;
}

That's it! That's all the assembly you'll need to know to understand thread switching internals. You might get to know a few more registers and learn about two other small instructions off-hand, but otherwise, this is it. And, you thought this was going to be hard!

How function calls work

Let's now take an under-the-hood look at how function calls really work on 32-bit x86 systems. Consider the following insultingly simple C++ code:

C++

int Add( int value, char ch )
{
    return (value + 10);
}

void Booga()
{
    int value = 10;
    value = Add( value, 'a' );
}

Here's the disassembly for this code snippet, generated using DUMPBIN. I've edited the output a bit so that we can focus on the important stuff. Let's try and dissect it.

ASM

?Add@@YAHHD@Z (int __cdecl Add(int,char)):
  push        ebp
  mov         ebp,esp
  mov         eax,dword ptr [ebp+8]
  add         eax,0Ah
  pop         ebp
  ret
?Booga@@YAXXZ (void __cdecl Booga(void)):
  push        ebp
  mov         ebp,esp
  push        ecx
  mov         dword ptr [ebp-4],0Ah
  push        61h
  mov         eax,dword ptr [ebp-4]
  push        eax
  call        ?Add@@YAHHD@Z
  add         esp,8
  mov         dword ptr [ebp-4],eax
  mov         esp,ebp
  pop         ebp
  ret

Function calling conventions

The first thing that might have caught your eye is the keyword __cdecl that seems to have been surreptitiously inserted between the return type and the function name. Here's what I am talking about:

ASM

(int ^__strong lang=<span class="code-string">asm">__cdecl Add(int)):

__cdecl is an extension keyword (i.e., a keyword that is not part of the C or C++ standards) used by the Visual C++ compiler to identify functions that are to use the C function calling convention. A function calling convention is used to tell the compiler exactly how parameters are to be passed to a function and how the stack state should be restored after the function call is complete. We review here two of the most commonly used calling conventions on Windows systems - __cdecl and __stdcall.

C calling convention - `__cdecl`

The __cdecl calling convention is the default for C and C++ programs when using the Visual C++ compiler; i.e., if you do not explicitly specify a calling convention, then the compiler would use __cdecl. With this convention, function parameters are passed via the stack in right-to-left order. Further, the caller is expected to restore the stack state after the function call is complete. If you wanted to call the Add function shown above, passing the value 10 and 'a' to it, then this is how you would do it in assembly:

ASM

push    97;      // pass value for param "ch"
push    10;      // pass value for param "value"
call    Add;     // call "Add"
add     esp, 8;  // restore stack state

ESP is a special purpose register that the CPU uses to track the next available location on the stack. Of the total address space available to a process, stack addresses start out from the high end, and heap addresses start out from the low end. Here's a picture depicting this:

What this means is that when you push something on to the stack, ESP is reduced by the size of the data that was pushed. For example, if ESP is currently pointing to the address 100, then pushing a 4 byte integer would cause the register value to be decremented by 4 so that after the push, ESP would have the value 96. Freeing this space is a simple matter of adding 4 back to ESP using the assembly instruction add. In the code snippet above, since we are using the __cdecl calling convention, it is the caller's responsibility to restore the stack state after the function call completes, which we do so by adding 8 to ESP. We add 8 because we pushed two 4 byte values on to the stack before the call instruction. Even though the second parameter is a 1 byte char, values pushed on the stack need to be 32-bits wide.

As you might imagine, using the __cdecl function calling convention results in code bloat because the add instruction to appropriately adjust ESP must be inserted into the code stream at every point the function call is made.

"Standard" calling convention - `__stdcall`

The __stdcall calling convention is similar to __cdecl in that parameters are passed via stack in right-to-left order. The stack restoration responsibility, however, resides with the called function. If the function Add, as defined above, had been set to use the __stdcall calling convention, then the code for calling it would look like this in assembly:

ASM

push    97;      // pass value for param "ch"
push    10;      // pass value for param "value"
call    Add;     // call "Add"

As you might have noticed, we do not restore the stack after the function call. The function Add would have done this for us.

Function prologue and epilogue

Now that we've seen the effect that calling conventions have upon the code that is generated while making a function call, let us next take a look at the code that is generated inside the called function itself. I reproduce the disassembly for the function Add here, which as you might recall, was configured to use the __cdecl calling convention.

ASM

?Add@@YAHHD@Z (int __cdecl Add(int,char)):
  push        ebp                   ; prologue
  mov         ebp,esp               ; prologue
  mov         eax,dword ptr [ebp+8]
  add         eax,0Ah
  pop         ebp                   ; epilogue
  ret                               ; epilogue

The text ?Add@@YAHHD@Z is the decorated (mangled) name of the function Add. The first two lines represent the function prologue, and the last two lines the function epilogue. Everything in between is the function body. These are automatically inserted by the compiler for all functions (except when the calling convention has been marked as __naked).

The EBP register is used to store the base pointer for a given function. The base pointer points to the location on the stack from where the parameters for the current function can be accessed. As we saw a while back, calling a function like Add involves the pushing of parameters on to the stack, followed by a call instruction. The call instruction does two things - it first pushes the return address on to the stack, and then transfers control to the called address. So, when you're at the first instruction in the function Add, here's what the stack looks like:

The first thing that Add does is to push the current value of EBP, which contains the base pointer of the caller, on to the stack. It does this so that it can restore the caller's base pointer when the function returns. It then loads the current value of ESP into EBP to establish the current function's base pointer. Here's what the stack looks like once the function prologue has been executed:

In any given function, therefore, you can access all the parameters passed to it by adding 8 bytes to the address stored at EBP. We add 8 bytes in order to skip the caller's EBP and the return address.

The function epilogue does the reverse - it pops the caller's base pointer value from the stack into the EBP register and executes a ret instruction that causes the CPU to shift control to the return address that's stored at the top of the stack. The pop instruction, as you might have guessed, does the inverse of push - it reads whatever happens to be at the location pointed at by ESP into the destination that you give as the instruction's operand and adds 4 to ESP.

So, is that all there is to calling functions? Well, not really. There are quite a few other calling conventions that I haven't covered here (__fastcall, __naked, __thiscall) because they aren't relevant to the topic under discussion. But, you should be able to easily find more information on these somewhere else on the great internet!

Thread switching internals

Now that we've covered all the background, thread switching itself is quite straightforward. Here's what we do from the worker thread:

We check if the current routine is executing on the window's thread by comparing the value that GetCurrentThreadId returns with the value that GetWindowThreadProcessId returns. GetWindowThreadProcessId returns the ID of the thread on which the given window has been created. In tswitch.h, this check is performed from a macro called InvokeRequired. Here's how it has been defined:

C++

//
// This macro determines whether a thread switch operation is
// required by comparing the current thread ID with the ID of the
// thread that created the given window handle.
//

#define InvokeRequired(hwnd) ( GetCurrentThreadId() != \
    GetWindowThreadProcessId( hwnd, NULL ) )

If the routine is not running on the UI thread, then we send a message called WM_INVOKE to the given window, along with the following information packed up in a data structure called ThreadSwitchContext, via WPARAM:

Address of the function to be called under the UI thread's context.
The EBP value of the current routine.
The number of parameters accepted by the current function.
And, the calling convention used by it.

This is how the ThreadSwitchContext structure has been declared:

C++

struct ThreadSwitchContext
{
    DWORD_PTR            Address;     // the function which is to be called
                                      // back from the UI thread

    DWORD_PTR            Ebp;         // the value of the EBP register from
                                      // the worker thread

    BYTE                 ParamsCount; // the number of parameters that this
                                      // routine requires

    CallingConvention    Conv;        // calling convention used by this
                                      // function


    ThreadSwitchContext( DWORD_PTR addr, DWORD_PTR ebp,
                         BYTE count, CallingConvention conv ) :
        Address( addr ), Ebp( ebp ),
        ParamsCount( count ), Conv( conv )
    {
    }

    ... other structure members here ...
};

Please note that we send WM_INVOKE to the target window using SendMessage, and not PostMessage. This is important because SendMessage is a blocking call that will not return till the message under question has been processed by the target window, which is a key requirement for this technique to work. Basically, we need the stack frame of the worker thread to remain intact till WM_INVOKE has been processed completely.

In the window procedure of the target window, we process WM_INVOKE like so. This implementation is available in the ThreadSwitchContext::Invoke method.

We cast the WPARAM parameter to a ThreadSwitchContext object and retrieve the worker thread's base pointer value stored in ThreadSwitchContext::Ebp.
We add 8 to ThreadSwitchContext::Ebp to access the first parameter to the function.
We then launch a loop that iterates ThreadSwitchContext::ParamsCount times. During each iteration, we access the worker thread routine's input parameter in the order that it was pushed on to that thread's stack, and push it on to the local thread's stack. Here's what it looks like:

C++

//
// push all the params starting with the first
//

LPBYTE params = (LPBYTE)Ebp;
for( int i = ParamsCount - 1 ; i >= 0  ; --i )
{
    DWORD_PTR param = *((PDWORD_PTR)(params + (i * sizeof(DWORD_PTR))));
    __asm push param;
}

Once this is done, the local stack is fully set up for making a call to the worker thread routine. We do the invocation using the assembly call instruction.

C++

//
// now call the function
//

DWORD_PTR address = Address;
__asm call address;

Once the called routine completes execution, the return value will be made available in the register EAX. We save this value into a local variable.

C++

//
// save the return value
//

DWORD_PTR dwEAX;
__asm mov dwEAX, eax;

And finally, if the calling convention of the routine is __cdecl, then we clean up the stack by adding the space that we had taken off it while pushing the parameters.

C++

//
// compute the size of the stack that's been used
// for the function parameters
//

DWORD stack = ( ParamsCount * sizeof( DWORD ) );

//
// clear the stack if the calling convention is cdecl;
// with stdcall the callee would have done this for us
//

if( Conv == ccCdecl )
{
    __asm add esp, stack;
}

That's all there is to it really! The only other interesting implementation detail is the mechanism we use to access the worker thread's EBP register. I wanted the enabling of thread switching in a given routine to be as painless as possible, which is why I wrapped up all the gory details in friendly macros such as THREAD_SWITCH and THREAD_SWITCH_WITH_RETURN. One problem that I ran into while doing this was with the use of embedded assembly in macros, in that you couldn't! Inline assembly in a macro seems to really mess up the pre-processor! So, this became a problem when I tried to access the local routine's EBP register which, had this problem not existed, I would have done like this:

C++

#define THREAD_SWITCH(hwnd, fn, params, conv) \
    if( InvokeRequired(hwnd) ) \
    { \
        DWORD dwEBP; \
        __asm mov dwEBP, ebp; \   // this messes up the preprocessor

        SendMessage( hwnd, WM_INVOKE, \
                (WPARAM)new ThreadSwitchContext( (DWORD_PTR)fn, \
                dwEBP, params, conv ), 0 ); \
        return; \
    }

Since this was producing weird compilation errors, I decided to use a helper function such as the following to access the EBP value and use it from the macro:

C++

DWORD_PTR GetEBP()
{
    DWORD_PTR dwEBP;
    __asm mov dwEBP, ebp;
    return dwEBP;
}

If you've been reading the article carefully, then you'd know that this cannot work! It cannot work because, now, GetEBP itself is a function and has its own base pointer, which naturally is different from the caller's. So then, I modified GetEBP to look like this:

C++

DWORD_PTR GetEBP()
{
    DWORD_PTR dwEBP;
    __asm mov dwEBP, ebp;
    return *( (PDWORD_PTR)dwEBP );
}

The idea is to take advantage of the fact that the value pointed at by the local function's base pointer is the caller's base pointer (recall that to access the function parameters, we had to skip this and the return address). Now, everything was hunky dory, till I did a release build, i.e., a build having all the compiler optimizations turned on. Now, suddenly, I found my parameters having strange values after the thread switch. As it turned out, since GetEBP was so small, the compiler went ahead and inlined it! Talk about coming around a full circle! So finally, I settled on the following definition for GetEBP:

C++

DWORD_PTR GetEBP()
{
    DWORD_PTR dwEBP;
    __asm mov dwEBP, ebp;

#ifdef _DEBUG
    //
    // return the caller's EBP
    //

    return *( (PDWORD_PTR)dwEBP );
#else
    //
    // "GetEBP" gets inlined in release builds so we just
    // return the current EBP
    //

    return dwEBP;
#endif
}

Caveat

As I had mentioned when listing out the pre-requisites, the technique described here cannot be used with functions that accept structures, unions, or double parameters by value. The same thing applies for function return values as well. The reason why this is so is that our thread switching logic assumes that each parameter to the function consumes exactly 4 bytes on the stack. With structures, unions, and doubles, this might not be the case.

Conclusion

So, there it is! You can download tswitch.h here, and you can download a demo project here. Feel free to report bugs and give feedback on this article's discussion forum, and I'll be happy to respond and oblige!

Revision History

September 16, 2007: Article first published.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Switching threads

Contents

License