Introduction
This article describes, as its title suggests, a way to provide the so-called single-threaded concurrency for complex programs with many components. First, we'll talk about concurrency and distinguish different types of it. In particular, we'll see that concurrency in not a synonym of multi-threading, and in many cases (actually - most of the cases), single-threaded concurrency is the way to go.
The approach that I want to describe in this article is something I've implemented several years ago, and since then, I've written dozens of applications using it, some of which are pretty complex. And, it proved to be extremely handsome and flexible.
I had always felt that there's a need for an article on this topic, and actually, I should have posted it a long time ago. Up until today, I have never seen a correctly designed complex application with concurrency. This may sound too loud, but unfortunately, this is the sad truth. I've worked in several companies, and everywhere, I have see tiny monsters awfully-overloaded with threads, without too much thought about synchronization, deadlock prevention and etc.
Concurrency
First, let's decouple concurrency and multithreading without digging too much into the theory.
Some operations take significant time to complete, and during this time, we want to be able to do something else; that's where concurrency is needed. Let's look at the two main reasons for operations to be time-consuming:
- Operations that involve heavy computations. They actually need CPU time.
- Operations that depend on network / hardware / user interaction / etc. They need time because they need to wait for something to happen.
For operations of the first kind, multithreading is preferred. This is because on MP (multi-processor) systems, several threads may actually be executed simultaneously. This way, higher overall performance is achieved.
However for operations of the second kind, multithreading is useless: creating several threads to wait for some conditions to occur (such as a network event or user interaction) won't help those conditions to occur faster. In fact, a single thread may wait for any of the specified conditions to occur and do what is necessary for any of them.
This is actually the most common case: 99% of all applications do nothing 99% of the time (at least they should). Creating multiple threads for such an application just to put all of them into a suspended state isn't good for the overall system performance (every thread context has a significant cost for the system, context switches, etc.). In addition, multi-threaded applications require some skills to write correctly, without all kinds of deadlocks, synchronization problems, or race conditions. So, single-threaded approach is preferred here.
The single-threaded approach, however, is a challenge to implement for complex applications with different components. When all those components work in different threads - there's no problem, every component does whatever it wants in its own thread, and the whole thing "works" parallel (as said before, actually it doesn't work at all most of the time). However, if we want those components to work in a single thread - this means that a particular component is not allowed to call a waiting function (such as WaitForXXXX
), because by doing that, it disables other components. Instead, it should work as a state machine - respond to some event and return control immediately. And, if it does call a waiting function - it should wait for all the events of all the components in this thread. Moreover, it should pass control to the component whose event was signaled.
In conclusion, let's say these are where multi-threading is preferred:
- Significant CPU time is needed.
- Need to call an awkward synchronous (blocking) API.
- Too complex to convert the components into state machines. This theoretically can be achieved by using fibers, but we won't discuss it here.
In all other cases, single-threading should be preferred.
The purpose of this article is to show a ThreadCtx library that implements the above design, so that multiple components may work in a single thread without direct interaction. We'll see that it's a very lightweight library, and it imposes minimum restrictions. The most important is that it's not a runtime, which steals control from you. You actually have the control over the program execution all the time. The only thing that this library restricts is blocking a component from responding to its events, and all the components are guaranteed to receive their notifications and not block each other. For this to work correctly, all the components must do all the scheduling and waiting via ThreadCtx
, and never call waiting functions directly. The rest is up to the ThreadCtx library.
Wait? For what?
Before we discuss the API and the implementation of ThreadCtx
, let's discuss some technical details about waiting conditions and functions. There're plenty of Win32 functions that may turn a thread into a suspended state (a.k.a. waiting) until some criteria is met. First of all, we'll classify those functions and look at what they actually wait for.
- Timer (
Sleep
, also many other waiting functions accept timeout) - Win32 waitable handles (
WaitForSingleObject
) - Win32 window message (
GetMessage
, WaitMessage
, etc.) - APC completion (
SleepEx
) - I/O completion port (
GetQueuedCompletionStatus
)
The last type (I/O completion port) is an advanced waitable object. It's designed especially to be used by multiple threads, and we won't discuss it here.
So, from the OS's point of view, a thread may wait for one (or more) of the above four types of conditions. Luckily, there're Win32 API functions that combine different types of conditions to wait for, so that when we need to wait for many different types of conditions - we may always use the appropriate API function:
Sleep
- timeoutWaitForMultipleObjects
- timeout, waitable handlesMsgWaitForMultipleObjects
- timeout, waitable handles, messagesSleepEx
- timeout, APCWaitForMultipleObjectsEx
- timeout, waitable handles, APCMsgWaitForMultipleObjectsEx
- timeout, waitable handles, messages, APC
So, theoretically, there's no problem to wait for any of those events, and several objects that need to wait for those events may be served in a single thread. However:
- There's a limitation on the handles count for
WaitForMultipleObjects
(and similar), the MAXIMUM_WAIT_OBJECTS
, which is currently defined as 64. But, this can be worked around: either ThreadCtx
may create internal threads to wait for more, or WaitForMultipleObjects
may be called with timeout=0 multiple times for all the waitable handles, and then if all the objects are not signaled - call Sleep(0)
or SwitchToThread()
. So, for the design of the ThreadCtx
API, this limitation is irrelevant. It should guarantee the proper scheduling for arbitrary count of waitable handles, whereas the consumer of ThreadCtx
should keep in mind that the overall performance may decrease if there're too many waitable handles; and if many synchronization objects are needed, another mechanism should be preferred. For instance, applications with many asynchronous I/O objects (such as sockets) should rather use the APC mechanism (which doesn't have this limitation) to track I/O completion. - The waiting function may receive only one timeout (whereas several objects need to be scheduled with different timers). There're API functions that enable multiple timers (such as
SetTimer
, CreateWaitableTimer
and etc.) but they impose other restrictions. So, ThreadCtx
will be responsible to manage timers internally, and ensure that the waiting functions are called with the appropriate (closest) timeout.
ThreadCtx design
As we said, objects will not call waiting functions directly; instead, they'll use the ThreadCtx
API to do that. The per-thread state of ThreadCtx
will be stored in the thread TLS. For those who're not familiar with TLS: it's a "thread local storage", something like global variables; however, they're unique for every thread. TLS is a very lightweight and fast mechanism; from the performance point of view, accessing a TLS variable is comparable to accessing a regular variable.
The main API element of ThreadCtx
is a Schedule
. It's a base class that represents any waitable condition, and for each of the four condition types (timer, Win32 handle, Win32 message, APC), there's an appropriate derived class with the appropriate extra methods.
class ThreadCtx {
public:
class Schedule
{
public:
bool Wait();
bool get_Signaled() const;
void put_Signaled(bool bVal);
__declspec(property(get=get_Signaled,put=put_Signaled)) bool _Signaled;
virtual void OnSchedule();
};
};
As we can see, a Schedule
has the following:
Signaled
property. When set, this means that the waitable condition is satisfied, yet not processed (more on this later).OnSchedule
virtual function. This function is invoked when ThreadCtx
detects that the waitable condition is satisfied. If overridden, it may do the in-place processing of the event. If it sets the Signaled
property (like the default implementation does), this means that in-place processing was not enough, and the control should return to the appropriate component (more on this later).Wait
function. This is a blocking function (and this is the only blocking function) that waits for the condition this Schedule
represents. If this condition is satisfied, the function returns true
. However, this function actually keeps in mind the waitable conditions of all other Schedule
s, and if any of them is satisfied before, and its in-place processing was not enough (Signaled
is set), the Wait
function returns false
.
Now, a more detailed explanation. The main function is Wait
. Actually, this is the only place where ThreadCtx
gets involved. This function does the following:
- Checks if there're signaled schedules. If so, go to step (5).
- Look at all the schedules instantiated in the current thread, and perform the waiting for any of the conditions requested (via one of the Win32 API waiting functions discussed above).
- After the waiting function returns, identify the satisfied condition, and process the appropriate schedule(s) (call the appropriate
OnSchedule
). - Go to step (1).
- If the schedule on which
Wait
was called is in the signaled state, reset it (set Signaled
to false
) and return true. Otherwise, return false
.
This may sound complicated so far, but in fact, there's nothing complex here. The idea is that Wait
attempts to wait for the condition of the called schedule. If it returns true
, the condition is met, and the caller may go ahead. But, if Wait
returns false
, it means that the condition was not satisfied, however it's impossible to wait any more, because another condition is satisfied and not handled so far.
The OnSchedule
may be overridden. There, when it's called, you may do something without setting the Signaled
property. This would mean that you handled the condition and allowed the caller of the Wait
of some schedule (no matter which) to continue waiting for that condition. Such a behavior enables to do something "while waiting" for something else (a.k.a. background processing).
As we've said, there're four types of conditions. That's how they're defined:
class ThreadCtx {
public:
class ScheduleHandle
:public Schedule
{
public:
void put_Handle(HANDLE hVal);
HANDLE get_Handle() const { return m_hHandle; }
__declspec(property(get=get_Handle,put=put_Handle)) HANDLE _Handle;
};
class ScheduleTimer
:public Schedule
{
public:
void KillTimer() { _Timeout = INFINITE; }
ULONG get_Timeout() const;
void put_Timeout(ULONG);
__declspec(property(get=get_Timeout,put=put_Timeout)) ULONG _Timeout;
bool IsTimerSet() const { return 0 != m_Key; }
};
struct NeedApc {
NeedApc();
~NeedApc();
};
struct NeedMsg {
NeedMsg();
~NeedMsg();
};
ScheduleHandle
is a Win32 waitable handle condition. When Handle
is set to a non-NULL
value - it's attached to a scheduling mechanism, so that Wait
is aware of it.ScheduleTimer
is a single-shot timer with the standard precision of the Win32 waiting function (not a high-precision multimedia timer). When _Timeout
is set to some value other than INFINITE
, it's attached to a scheduling mechanism, so that Wait
is aware of it.NeedApc
represents the wish to wait for a queued APC completion. Whenever this object is instantiated in the thread, the ThreadCtx
is aware of that, hence the consequently called Wait
function will select the APC-aware Win32 waiting function. Note: The NeedApc
is not derived from Schedule
. This is because the processing of the APC event is done automatically by the OS (the APC routine is called).NeedMsg
- same as with NeedApc
. It represents the wish to wait for a Windows message in the messages queue, and this tells the Wait
function to select the messages-aware Win32 waiting function. Whenever messages are detected in the messages queue, they're automatically dispatched (by the DispatchMessage
), and the actual message processing is done by the appropriate window.
Below, we'll look at a couple of examples to make things clear.
ThreadCtx::ScheduleHandle abortSched;
abortSched._Handle = handleThreadStop;
ThreadCtx::ScheduleTimer timeoutSched;
timeoutSched._Timeout = 5000;
CallIo();
void CallIo()
{
ThreadCtx::ScheduleHandle socketIoSched;
socketIoSched._Handle = handleSocketEvent;
if (!socketIoSched.Wait())
return;
}
In this example, CallIo
performs some blocking operation, which may be aborted by different conditions: either timeout expiration or the abort event signaled. The most important is that CallIo
doesn't have to be aware of all the "aborting" conditions; it just attempts to do what it wants and the abortion happens automatically.
class PeriodicalWork :public ThreadCtx::ScheduleTimer
{
public:
virtual void OnSchedule()
{
_Timeout = 3000; }
};
class SomeEventHandler :public ThreadCtx::ScheduleHandle
{
public:
virtual void OnSchedule()
{
}
};
PeriodicalWork worker1;
worker1._Timeout = 2000;
SomeEventHandler worker2;
worker2._Handle = ;
CallIo();
In this example, two objects may process some events (Win32 waitable handle and timer) while another blocking operation is in progress, whereas that operation does not need to be aware of that.
PeriodicalWork worker1;
worker1._Timeout = 2000;
SomeEventHandler worker2;
worker2._Handle = ;
ThreadCtx::Schedule term;
term._Handle = ;
term.Wait();
In this example, we initialize several objects, and then wait for the termination handle to become signaled. During this period, all the objects will process their events.
Something to remind
This design reminds the Windows messages system in some way. That is, you may create several windows and "run the message loop" which is a loop of GetMessage
and DispatchMessage
(with optional TranslateMessage
, idle processing, and other things), and all the windows will get their messages and their window procedure will process them. Such windows may process some events asynchronously, without direct interaction between each other. Plus, windows could do some timed jobs, using per-window timers. In addition, there're some blocking calls, such as MessageBox
, DialogBoxParam
, GetOpenFileName
and etc. They're wrapped in MFC by something like AfxMessageBox
, DoModal
etc. Those are all blocking calls; however, they run the message loop as well, and hence all the windows created by the thread process their messages as well.
This is similar to our design, though pretty awkward. Creating a window object is equivalent (in some way) to creating a Schedule
in our design; the role of the window procedure now plays the virtual OnSchedule
function. The major difference is that when you create a window, you receive all the messages in the world, most of which you just pass to DefWindowProc
, whereas our schedule gets only one specific notification.
Calling one of the "blocking" functions (such as DialogBoxParam
) is equivalent to the Wait
function of our schedule. Exactly as in Windows API calling DialogBoxParam
doesn't prevent other windows to receive their messages, in our API, calling Wait
on one Schedule
doesn't disable all the others.
Note: our Wait
may be aborted (if another Schedule
gets scheduled and its Signaled
property is set), and this is designed like that in Windows too! In Windows, when you call GetMessage
, it returns 0 if the encountered message is WM_QUIT
, and by convention, the appropriate message loop should stop. You call the PostQuitMessage
function, which automatically "aborts" the blocking calls. Means, the DialogBoxParam
or MessageBox
usually returns the result passed to EndDialog
; however, if the WM_QUIT
message is encountered, they return -1 to indicate the "aborted blocking operation".
In the times of Windows 3.11, things worked that way. In fact, anyone could block others from execution either by doing some long-duration things (such as heavy calculations) or call the GetMessage
for a specific window only (actually, I have no idea why this was allowed), but that was known to be bad and not welcome of course, and everyone was aware of that. Things got screwed when Win32 arrived with all the threads, synchronization objects, APC queues, and many different waiting functions that wait for specified things only. In fact, the wait-for-anything function (MsgWaitForMultipleObjectsEx
) was added in Windows 98 only.
Our design is not totally new. It just creates a convention that had to arrive with Win32. Plus, unlike the awkward messages subsystem, we have very clear definitions about how things behave. Plus, instead of one heavy window object that should be responsible for everything, we have lightweight primitives, each responsible for one specific thing.
ThreadCtx implementation
The implementation is very lightweight. Its extra cost is negligible in comparison with a call to any waiting function (which involves a kernel-mode transaction).
The ThreadCtx
object must be instantiated in every thread that may access it (where Schedule
s are used). For instance, it can be created on the stack "at the beginning" of the thread. In the constructor, it saves its pointer in the TLS, and every function involved on a Schedule
accesses it.
Note: optionally, this can be changed. The ThreadCtx
object may be created in every thread on-demand. (That is, a first call to a Schedule
automatically creates the ThreadCtx
for the given thread). This method demands the per-thread uninitialization as well.
The ThreadCtx
consists of the following:
- List of pointers to N
ScheduleHandle
objects. Every ScheduleHandle
object automatically inserts itself into this list when its Handle
is set to a non-NULL
value, and removes itself from this list when it's set to NULL
(obviously, it ensures removal from this list in d'tor). - Binary tree of
ScheduleTimer
objects. When the _Timeout
of the ScheduleTimer
object is set to something other than INFINITE
- it inserts itself into that tree, whereas the tree key is the "wakeup" time of this timer (GetTickCount()
+ _Timeout
). By this, before invocation of any waiting function, the ThreadCtx
can identify the closest timer in logarithmic time (which is fast, of course). It correctly handles the ticks counter wraparound, as well as detects already-late timers. - A counter of instantiated
NeedMsg
objects, and a counter of instantiated NeedApc
objects. Those objects increment/decrement those counters in c'tor/d'tor. By this, before invocation of a waiting function, ThreadCtx
knows if the waiting should be messages/APC-aware (the appropriate counter is non-zero). - A counter of schedules which are in
Signaled
state. Upon state change, every schedule automatically increments/decrements this counter. So ThreadCtx
always knows if there're signaled schedules before calling a waiting function.
When you call Schedule::Wait
, it accesses the ThreadCtx
and invokes a waiting function in a loop and schedules the appropriate Schedule
object(s), until one of them sets the Signaled
property. Every invocation of a waiting function goes in the following order:
- Find the earliest timer (in the binary tree). If this timer is already late, remove it from the tree, schedule it, and return. If it's not late, calculate its timeout. And, if the tree is empty, the timeout is
INFINITE
. - Check the counters of instantiated
NeedMsg
objects and NeedApc
objects. Based on this, select the appropriate waiting function. - Prepare all the Win32 waitable handles in an array that can be used as a parameter to the selected waiting function. If the count of objects don't exceed the
MAXIMUM_WAIT_OBJECTS
limitation, call the waiting function with all the objects. Otherwise, call it several times (every time, put different handles in the array) with timeout=0, and if the waiting function returns on timeout, call Sleep
(0). - Now, analyze the return value of the waiting function. If it returns on either handle or timeout, schedule the appropriate
Schedule
. If it returns the Win32 message, do the PeekMessage
/DispatchMessage
loop.
Optionally, the ThreadCtx
may be modified to perform additional steps for APC completion and Windows messages. For instance, we may want to have a global 'pre-translator' for window messages (or the per-thread messages handler).
Mixing with UI
Theoretically, this can be done. Means, the whole application may be done single-threaded, including the UI. What's needed for this to work correctly is to avoid calling blocking functions directly, such as GetMessage
and WaitMessage
. In particular, the program message loop should be rewritten:
MSG msg;
while (GetMessage(&msg, NULL, 0, 0) > 0)
DispatchMessage(&msg);
class TermSched :public ThreadCtx::ScheduleMsg
{
public:
virtual void TranslateMsg(MSG& msg)
{
if (WM_QUIT == msg.message)
_Signaled = true;
}
};
TermSched sched;
sched.Wait();
LRESULT CALLBACK MainWndProc(UINT uMsg, )
{
switch (uMsg)
{
case WM_DESTROY:
globalTermSched._Signaled = true;
return 0;
}
ThreadCtx::NeedMsg careMsg;
ThreadCtx::Schedule globalTermSched;
globalTermSched.Wait();
Note: In the last case, we've used the base Schedule
class, not its derivative. This schedule does not have an intrinsic (built-in) waitable condition, and it may only become signaled explicitly. Its main use is to execute "message loops" which may be interrupted explicitly.
So theoretically, all including the UI may be done in a single thread. However, in big projects with heavy UI, it may be a challenge to guarantee that all the blocking functions are executed in terms of ThreadCtx
and never directly. Big UI projects are usually written using some runtime (such as MFC) which conceals the message loop completely. Theoretically, you may override the message loop (in MFC, it's possible to override CWinThread::Run
, CWinThread::PumpMessage
), but still it's risky, G*d knows how many such places there are. And, it's very easy to screw things: one call to MessageBox
is enough to turn off the whole mechanism until the user closes the message box.
Another limitation is that it becomes impossible to use standard UI components that may do their message loop. This concerns all the standard Windows dialogs, such as GetOpenFileName
, ChooseFont
, ChooseColor
, various ActiveX controls, and "console" functions such as scanf
, getch
(a.k.a. easy-win).
Theoretically, there're ways to "intercept" those functions calls (such as J. Richter's method for replacing import section addresses in an executable module) and do what's needed in a ThreadCtx
-aware way. But all those methods are tricky, and not guaranteed to work always.
In addition, projects with "rich" UI tend to be heavy. That is, UI that in some circumstances actually consumes significant CPU time to be processed. So:
- No problem to mix a "simple" (and well-controlled) UI with other stuff.
- May be complex for "rich" UI, especially when used with some runtime. Better not to mix.
Writing complex objects
So far, we've used schedules as-is, or inherited them to override the OnSchedule
to perform in-place processing of something. But, all our schedules are primitives - each only processes one specific event. What if we need to process various events of different kinds in a dependent way? There're some patterns to do so:
#define NESTED_OBJ(outer_class, this_var, outer_name) \
outer_class& Get##outer_name() { return * (outer_class*)
(((char*) this) - offsetof(outer_class, this_var)); } \
__declspec(property(get=Get##outer_name)) outer_class& ##outer_name; \
#define NESTED_SCHEDULE(outer_class, this_var, sched_class, outer_func) \
class SchedNested_##this_var :public sched_class { \
NESTED_OBJ(outer_class, this_var, _parent) \
virtual void OnSchedule() { \
_parent.outer_func(); \
} \
} this_var;
class ComplexWorker
{
NESTED_SCHEDULE(ComplexWorker, m_schedMyEvent1, Thread::ScheduleHandle, ProcessEvt1)
NESTED_SCHEDULE(ComplexWorker, m_schedMyEvent2, Thread::ScheduleHandle, ProcessEvt2)
NESTED_SCHEDULE(ComplexWorker, m_schedMyTimer1, Thread::ScheduleTimer, ProcessTimer1)
NESTED_SCHEDULE(ComplexWorker, m_schedMyTimer2, Thread::ScheduleTimer, ProcessTimer2)
public:
void ProcessEvt1()
{
m_schedMyTimer1.KillTimer();
m_schedMyTimer2._Timeout = 1000;
}
};
This may look complicated; however, there's nothing complex. The magic macro NESTED_SCHEDULE
declares a class that is derived from a schedule type you pass as a parameter (third parameter to the macro), overrides OnSchedule
, and there it invokes the function passed (fourth parameter) of the outer object. Another magic is how the declared class has the pointer to the outer object. This is due to the NESTED_OBJ
macro, which gives the inner object the pointer to the outer object, based on the offsets at compile time.
Additional notes
First, it is worth to say that there're no limitations with re-entrance. Means, you may call Wait
inside the OnSchedule
which was executed in the context of another Wait
(that might be executed inside another OnSchedule
and so on). But, you should be careful with re-entrance, since program flow may become complicated there. Also, take into account that if one Wait
is executed inside another Wait
, and the Schedule
of the outer Wait
is Signaled
before the inner one, the inner Wait
will stop waiting and returns false
. So, re-entrance is a complex matter, and if used, should be controlled very well.
The correct pattern should be state-machine. That is, most objects should not call Wait
at all. They should do whatever is needed in OnSchedule
and return immediately. One object per thread can be implemented in an "algorithmic" way, but not more.
Also worth to note that the whole mechanism is exception-proof. Means, when an exception is raised, it passes freely all the Wait
/OnSchedule
scopes and propagates directly to the catching block. And, if there're schedules in the way declared on stack, they're destroyed (like all C++ objects) and removed from the scheduling mechanism (in d'tors). So there's no problem.
All schedules may be allocated on either stack or heap, this doesn't matter. Of course, using stack memory is very much faster, and since schedules are very light-weight objects, it's better to allocate them on stack.
Application may be multi-threaded, and every thread may use the ThreadCtx
library (it maintains per-thread state). However, all the operations on a particular schedule must be done in the same thread! It's very immoral and absolutely impossible to work with the same Schedule
object in several threads.
In order to use ThreadCtx
in DLLs - it must be modified. It uses TLS - hence it requires the global TLS index. Currently, this is achieved by decorating the per-thread pointer to the ThreadCtx
by the __declspec(thread)
semantics. But this won't work correctly with DLLs. Instead, the access of TLS must be explicit. That is, a global TLS index must be explicitly allocated (TlsAlloc
). It must be shared with the DLL.
Conclusions
Well, this whole article may seem complicated on the first read, especially the implementation part. But in fact, it's not so complex. Just look at the implementation code: it's just about 350 lines of code!
One more thing: if you really intent to look at the code level - I recommend reading the following articles:
This, by the way, is a good demonstration of the flexibility and light-weight nature of the above container classes.
You may also not dig into the implementation details, and just try to use it. I hope the examples provided are enough to understand the principle of working with ThreadCtx
.
As I've said, I've written a lot of applications and Win32 services using this design, and it proved to be really handsome. I'll post some demonstrations ASAP.
Criticism is welcome.