Preemptive and priority scheduling create numerous critical regions that applications must protect with mutexes, which have nothing to do with implementing a product specification and all too often lead to errors that are difficult to reproduce and debug. This article presents a Thread class that supports cooperative and proportional scheduling, which can dramatically reduce the number of mutexes required in applications.
Introduction
Race conditions are an unbecoming source of complexity in many systems. Developers must constantly anticipate them and protect against them—and when they sneak into released software, try to reproduce them and hunt them down. The emergent cesspool of locks, semaphores, and their variants is sheer artificial complexity that has nothing to do with the specifications that the software seeks to implement. And in many cases, most of this complexity is actually avoidable.
The techniques described in this article apply when designing a new system, but even a legacy system may be able to adopt them with relative ease. Nevertheless, you may find what follows to be rather iconoclastic, so be prepared to think outside the box. But the approaches outlined in this article have been successfully used in large servers, including the core network servers that handle the calls in AT&T's wireless network. Those servers run on a proprietary operating system, but this article describes how to implement the same concepts on commercially available operating systems such as Windows and Linux. If the concepts are a fit for your system, you will improve its quality and your productivity by averting bugs that occur too easily and that can be difficult to pinpoint.
Many of the concepts in this article were covered in my book Robust Communications Software, which includes this compelling statement from Dennis DeBruler:
As a system integrator of code being developed by several programmers, it seems that my job consists of finding critical region bugs. Most critical region bugs can be found only by subjecting the system to load testing. Because of this need to do load testing, we have the additional challenge of finding and fixing the most difficult bugs late in the development cycle. Also, testing becomes a matter of peeling the onion as you find one unguarded critical region after another. Each one smaller and harder to find. Each one closer to the deadline—or further behind the deadline. Thus we have the joy that the worse the bug, the more layers of management are asking, "Have you found the bug yet?" The really small critical region bugs make it out to the field and become the reason for the "once every month or so in one of the hundred customer sites" bug. These are very hard to even recreate in the lab, let alone find.
Unless dealing with these kinds of challenges fills you with delight, read on.
Background
It is assumed that the reader is familiar with the issue of thread safety and how to achieve it with critical region protection mechanisms such as locks and semaphores.
This article's title is a little cheeky. Two essential contributions by the late Edsger Dijkstra were Go To Statement Considered Harmful, whose thesis is now accepted virtually without question, and Over de sequentialiteit van procesbeschrijvingen, which introduced the semaphore operations P (wait
) and V (signal
) to protect critical regions.
Dijkstra argued that goto
should be eliminated from high-level programming languages but allowed for it in machine code. So although goto
is indispensable, only compilers and people programming in assembler should use it. This article makes a similar argument about semaphores: although indispensable, their use in application code should be restricted by usually confining them to the internal workings of a base Thread
class. Just as the need for goto
in application software points to a flawed programming language, the need for semaphores throughout application software points to a flawed threading/scheduling framework.
How We Got In…
It is deplorable that preemptive and priority scheduling have become the de facto standards in virtually all of today's operating systems. Preemptive scheduling arose in timesharing systems so that one CPU could serve multiple users. Some users were deemed more important than others, so they received higher priority. Priorities later proved useful in hard real-time systems, where not doing something on time can have severe consequences.
But in many systems, preemptive and priority scheduling are inappropriate. Your system is a unified whole, not the disparate software of users competing for time, so why should your threads be scheduled in and out at random? And presumably your system doesn't include software that enjoys being starved by higher priority work. Such software couldn't be very important, so why would it be there?
Both scheduling disciplines—preemptive and priority scheduling—force applications to protect critical regions at a granular level. Within the same priority, context switches occur haphazardly, and they also occur as soon as a thread of higher priority is ready to run.
It probably shouldn't be too surprising that preemptive and priority scheduling, which work well when scheduling unrelated processes, are often inappropriate when scheduling a process' threads, which are working toward a common goal.
…And How to Get Out
To reduce the number of critical regions, you need to control context switching yourself.1 In many cases, a thread has a limited amount of work to do when it is ready to run, so let it run that work to completion. That is, let it run unpreemptably. When it has finished its work, it sleeps, and the next thread runs. Or if it has many things to do, it does some of them and then yields, returning to the end of the ready queue to give other threads a chance to run. This is known as cooperative scheduling, and it significantly reduces the number of critical sections by completing one logical unit of work before starting the next one. If each application thread returns to its work loop before sleeping, yielding, or handling the next work item, there can be no race conditions between them.
Naturally, if it were that simple, everyone would be doing it. But it requires a suitable base Thread
class. And it does give rise to some issues. So it's time to dive into a bit of code.
Using the Code
The code in this article is taken from the Robust Services Core (RSC). If this is the first time that you're reading an article about an aspect of RSC, please take a few minutes to read this preface.
Walkthroughs
We will now look at how RSC's Thread
class controls scheduling, which is one of its two main responsibilites. The other is to recover from exceptions that would otherwise cause an abort, as described in Robust C++: Safety Net.
Creating a Thread
How to create a thread is described here. Each Thread
has a SysThread
member, which encapsulates a native thread as part of RSC's operating system abstraction layer. A Thread
subclass adds its own data, so there is no need for thread_local
in RSC.
Entering a Thread
Two design principles eliminate race conditions, and hence the need for semaphores, in application software:
- All application threads run unpreemptably (also referred to as running locked). Before a thread starts to run, it must become the active thread. This allows it to run until it finishes one or more logical units of work. At that point, the next thread gets its turn, and so on. The active thread is tracked by the variable
ActiveThread_
, of type std::atomic<Thread*>
, in Thread.cpp. - All application threads run at the same priority. Strictly speaking, this isn't necessary when each of them must become the active thread in order to run. But because it's a consequence of the design, we'll be honest about what's going on and give each unpreemptable application thread the same priority.
Each thread, as soon as it is created and entered, must become the active thread before proceeding:
main_t Thread::EnterThread(void* arg)
{
auto self = static_cast<Thread*>(arg);
self->Ready();
self->Resume();
RegisterForSignals();
return self->Start();
}
void Thread::Ready()
{
priv_->readyTime_ = SteadyTime::Now();
priv_->waiting_ = true;
if(ActiveThread() == nullptr)
{
Singleton<InitThread>::Instance()->Interrupt(InitThread::Schedule);
}
systhrd_->Wait();
priv_->waiting_ = false;
priv_->currStart_ = SteadyTime::Now();
priv_->locked_ = (priv_->unpreempts_ > 0);
}
void Thread::Resume()
{
auto time = InitialTime() << ThreadAdmin::WarpFactor();
priv_->currEnd_ = priv_->currStart_ + time;
}
Note that Resume
records a time by which the thread must yield or otherwise cause itself to be scheduled out. A warp factor is applied to this time if, for example, the thread's function calls are being recorded, which will cause it to run much slower. We will return to this point later.
A Thread Yields
A thread invokes Pause
to schedule itself out after it has completed one or more logical units of work:
DelayRc Thread::Pause(msecs_t time)
{
auto drc = DelayCompleted;
auto thr = RunningThread();
EnterBlockingOperation(BlockedOnClock);
{
if(time != TIMEOUT_IMMED) drc = thr->systhrd_->Delay(time);
}
ExitBlockingOperation();
return drc;
}
If time
is TIMEOUT_NEVER
, the thread sleeps until another thread invokes Thread::Interrupt
to wake it up. If time
is finite, the thread (unless interrupted) sleeps until its specified timeout occurs, after which it must again become the active thread before it can run.
Pause
invokes EnterBlockingOperation
and ExitBlockingOperation
because a thread must allow another thread to become active whenever it stops running, and become the active thread again before it resumes execution. BlockedOnClock
is defined in the following enum
:
enum BlockingReason
{
NotBlocked, BlockedOnClock, BlockedOnNetwork, BlockedOnStream, BlockedOnDatabase, BlockingReason_N };
A Thread Blocks
To perform a blocking operation, a thread must surround it with calls to EnterBlockingOperation
and ExitBlockingOperation
, the same as in Pause
. Here is a fragment from TcpIoThread
, which services TCP sockets:
EnterBlockingOperation(BlockedOnNetwork);
{
ready = SysTcpSocket::Poll(sockets, size, msecs_t(2000));
}
ExitBlockingOperation();
The code between the two calls is placed in an inner block as a coding convention. Let's see what happens:
void Thread::EnterBlockingOperation(BlockingReason why)
{
auto thr = RunningThread();
thr->blocked_ = why;
thr->Suspend();
}
void Thread::Suspend()
{
priv_->currEnd_ = SteadyTime::Now();
Schedule();
}
void Thread::Schedule()
{
auto active = this;
if(!ActiveThread_.compare_exchange_strong(active, nullptr)
{
return;
}
Singleton<InitThread>::Instance()->Interrupt(InitThread::Schedule);
}
At this point, when the thread performs the blocking operation, it is no longer the active thread. When the operation is completed, the thread must again become the active thread before resuming, the same as in EnterThread
:
void Thread::ExitBlockingOperation()
{
auto thr = RunningThread();
thr->priv_->currStart_ = SteadyTime::Now();
thr->blocked_ = NotBlocked;
thr->Ready();
thr->Resume();
}
A Thread Has a Big Job to Do
Maybe you noticed that Schedule
mentioned a preemptable thread. What's that all about, given that it reopens the door to race conditions?! Although RSC makes each thread unpreemptable when it is created, a thread must run preemptably when it has a very time consuming job to do, such as reading or writing a large file. The definition of "time consuming" depends on your system's latency requirements. In other words, something is time consuming if it could prevent other threads from running for whatever your specifications say is an unacceptable length of time. When faced with such a situation, a thread makes itself preemptable:
void Thread::MakePreemptable()
{
auto thr = RunningThread(std::nothrow);
if(thr->priv_->unpreempts_ == 0) return;
if(--thr->priv_->unpreempts_ == 0) Pause();
}
When a thread becomes preemptable, Pause
is invoked to schedule it out. (The default time
argument for Pause
is TIMEOUT_IMMED
, in which case the thread simply goes to the end of the ready queue.)
After it has finished its large job, the thread should resume running unpreemptably:
void Thread::MakeUnpreemptable()
{
auto thr = RunningThread(std::nothrow);
if(thr->priv_->unpreempts_ >= MaxUnpreemptCount)
{
Debug::SwLog(Thread_MakeUnpreemptable, "overflow", thr->Tid());
return;
}
if(++thr->priv_->unpreempts_ == 1) Pause();
}
Note that calls to MakeUnpreemptable
and MakePreemptable
can be nested, so that a thread only becomes preemptable when all the functions on its stack agree that this is acceptable.
Instead of invoking these functions directly, most threads use the class FunctionGuard
. For example, FileThread
is responsible for receiving an ostringstream
and writing it to a file. Just before it opens the file, it does this:
FunctionGuard guard(Guard_MakePreemptable);
When guard
is constructed, it invokes MakePreemptable
. It's a stack variable, and its destructor invokes MakeUnpreemptable
when it goes out of scope. This ensures that the call to MakeUnpreemptable
will always occur, even if an exception gets thrown while performing the big job.
A Thread Doesn't Want to Run Too Long
Recall that Resume
recorded a time before which a thread should yield. If a thread has many small jobs to do, it may want to handle as many as it can before its deadline. Thus the following:
void Thread::PauseOver(word limit)
{
if(RtcPercentUsed() >= limit) Pause();
}
word Thread::RtcPercentUsed()
{
auto thr = RunningThread();
if(!thr->IsLocked()) return 0;
nsecs_t used = SteadyTime::Now() - thr->priv_->currStart_;
nsecs_t full = thr->priv_->currEnd_ - thr->priv_->currStart_;
if(used < full) return ((100 * used) / full);
return 100;
}
A Thread Needs More Time
When an exception occurs, RSC captures the thread's stack. This is a time-consuming operation, so Exception
's constructor gives the thread more time:
void Thread::ExtendTime(const msecs_t& time)
{
auto thr = RunningThread(std::nothrow);
if(thr == nullptr) return;
thr->priv_->currEnd_ += time;
}
A Thread Runs Too Long
The danger with cooperative scheduling is a lack of cooperation! A thread might run for longer than is acceptable, or it might even get into an infinite loop. This is unlikely to matter during testing, but it's a serious problem in something like a live server. Consequently, there must be a way to kill a thread that doesn't yield before its deadline.
RSC's InitThread
is responsible for killing a thread that has run too long. It is created soon after main
is entered and is responsible for initializing the system.
Once the system is up and running, InitThread
ensures that each unpreemptable thread yields before it reaches its deadline. To do this, InitThread
runs preemptably, without having to become the active thread. By default, the deadline imposed on an unpreemptable thread is 10 ms, which is far more than the manic pace at which most operating systems perform context switching. This means that InitThread
will soon get scheduled in and be able to kill a thread that runs too long.
When InitThread
detects that a thread has run locked too long, it invokes Thread::RtcTimeout
to send the signal SIGYIELD
to that thread.2 This results in an exception being thrown on the target thread. Instead of forcing the thread to exit, RSC allows it to clean up the work that it was performing. RSC then reinvokes the thread's entry function so that it can continue to service requests.
The functions Ready
and Schedule
both interrupted InitThread
to schedule the next thread. After scheduling the next thread, InitThread
sleeps until the time before which the thread should have scheduled itself out. If the thread cooperates, InitThread
is again interrupted to schedule the next thread. But if the thread runs too long, InitThread
times out and sends it the SIGYIELD
signal:
void InitThread::HandleTimeout()
{
auto thr = LockedThread();
if(thr == nullptr)
{
Thread::SwitchContext();
return;
}
else
{
if(thr->priv_->waiting_)
{
thr->Proceed();
return;
}
}
if((thr->TimeLeft() == ZERO_SECS) && !ThreadAdmin::BreakEnabled())
{
thr->RtcTimeout();
}
}
Scheduling a Thread
When InitThread
gets interrupted to schedule the next thread, it invokes the following:
Thread* Thread::SwitchContext()
{
auto curr = ActiveThread();
if((curr != nullptr) && curr->IsLocked())
{
if(curr->priv_->waiting_)
{
curr->Proceed();
}
return curr;
}
auto next = Singleton<ThreadRegistry>::Instance()->Select();
if(next != nullptr)
{
if(next == curr)
{
return curr;
}
if(!ActiveThread_.compare_exchange_strong(curr, next))
{
return curr;
}
if(curr != nullptr) curr->Preempt();
next->Proceed();
return next;
}
return curr;
}
void Thread::Proceed()
{
systhrd_->SetPriority(SysThread::DefaultPriority);
if(priv_->waiting_) systhrd_->Proceed();
}
The thread that will run next is currently selected with a simple round-robin strategy:
Thread* ThreadRegistry::Select()
{
auto t = threads_.find(NextSysThreadId_);
if(t == threads_.end()) t = threads_.begin();
Thread* next = nullptr;
for(NO_OP; t != threads_.end(); ++t)
{
auto thread = t->second.thread_;
if(thread == nullptr) continue;
if(thread->CanBeScheduled())
{
next = thread;
break;
}
}
if(next == nullptr)
{
for(t = threads_.begin(); t != threads_.end(); ++t)
{
auto thread = t->second.thread_;
if(thread == nullptr) continue;
if(thread->CanBeScheduled())
{
next = thread;
break;
}
}
}
if(next != nullptr)
{
auto entry = threads_.find(next->NativeThreadId());
while(true)
{
++entry;
if(entry == threads_.end())
{
NextSysThreadId_ = 0;
break;
}
auto thread = entry->second.thread_;
if(thread == nullptr) continue;
NextSysThreadId_ = thread->NativeThreadId();
break;
}
}
return next;
}
Preempting a Thread
By default, an application thread runs unpreemptably. But if it cannot finish a logical unit of work before being signaled for running too long, it must run preemptably. During this time, unpreemptable threads take turns running. The platform's scheduler, however, will still give time to the preemptable thread. This steals time from the unpreemptable thread, which could cause it to miss its deadline. Therefore, when an unpreemptable thread is selected and a preemptable thread is running, we lower the priority of the preemptable thread so that it won't contend with the unpreemptable thread:
void Thread::Preempt()
{
priv_->readyTime_ = SteadyTime::Now();
systhrd_->SetPriority(SysThread::LowPriority);
}
When RSC was ported to Linux (using WSL2 and Ubuntu), priority scheduling was not available. In that case, an unpreemptable thread receives a longer timeslice if a preemptable thread is also running.
A Thread Has a Hard Deadline
As previously mentioned, a "hard" real-time system is one in which not doing certain things on time can have severe consequences. These systems use priority scheduling to ensure that those things will happen before their deadlines. Does this mean that they cannot use what is described in this article?
They can, but not fully. Any software that faces a hard deadline should run at a higher priority. Such software usually runs briefly, at regular intervals, to interface with external hardware. Maybe it reads sensors or controls servos, for example. It might also update memory with the results of its calculations or its new settings. If this causes a critical region problem, something in <atomic>
can probably address it.
What remains is the software that doesn't face hard deadlines. There might not be much of it, or it might be most of the system. It's this software that is a good candidate for cooperative scheduling. Even then, some of it would run preemptably while, as previously discussed, it was performing a big job.
Points of Interest
Balancing the Workload
If a thread can only run for so long before yielding, and if it then goes to the end of the ready queue, what do you do if it becomes a bottleneck?
One solution is to create multiple instances of the thread: a thread pool. This is also a common solution when a thread frequently blocks. Providing multiple instances of the thread then reduces the turnaround time for its clients.
Depending on what a system does, thread pools can get complicated. The size of each pool must be engineered to keep the system in balance. And if user behavior can differ from one site to the next, the situation starts to get unwieldy.
Perhaps the most flexible solution is proportional scheduling. Here, each thread belongs to a faction, which is defined based on the type of work that its threads perform. Each faction is guaranteed a minimum percentage of the CPU time. When the system is heavily loaded, some factions may receive much more time than others, but each will receive at least some time. A server, for example, should dedicate most of its time to handling client requests. However, it should also ensure that time is reserved for logging problems, responding to administrative requests, and auditing object pools. None of these things is absolutely more important than the others to the point where it should run at a higher priority. Yet there have been cases of servers assigning a higher priority to client requests. So when they were heavily loaded, they failed to respond to administrative commands and were therefore rebooted in the tragic belief that they were hung.
Under proportional scheduling, a thread pool is still needed if a thread frequently blocks. But multiple instances of the thread are not required for its work to be allotted enough CPU time.
RSC will support proportional scheduling by changing ThreadRegistry::Select
. If priority scheduling is available, the overall scheme will continue to be
- Highest priority: watchdog threads, such as
InitThread
(priority scheduling) - Higher priority: hard real-time threads (priority scheduling)
- Standard priority: unpreemptable threads (cooperative/proportional scheduling)
- Lower priority: preemptable threads (when a preemptable thread is selected, it runs at standard priority)
If priority scheduling is not available, the watchdog threads run at standard priority, the same as application threads. There can be no "higher priority" hard real-time threads, and an unpreemptable thread gets extra time to do its work if a preemptable thread is also running.
As mentioned at the outset of this article, there can be no race conditions if each thread returns to its work loop before the next thread runs. This holds true among all the threads that run unpreemptably, so it's the other threads that can cause problems. Fortunately, their interaction with standard priority threads tends to be rather limited, so the net effect is to dramatically reduce the number of critical regions when compared to haphazard preemptive scheduling. This can be seen with RSC's >mutexes
CLI command, which lists all of the system's mutexes. RSC's Mutex
class records how often it caused blocking, and most mutexes rarely do. They are only needed in case a preemptable thread does something that contends with another thread.
Symmetric Multiprocessing
Symmetric multiprocessing (SMP) refers to a platform in which multiple CPUs share a common memory. This allows multiple instances of an executable to run in parallel, usually improving throughput.
For a legacy system that needs to improve its capacity, SMP is an appealing solution. However, it may create additional race conditions that must be found and protected. This is because, when running on a single CPU, a thread usually interacts only with different threads, where there can be a reasonable separation of concerns that reduces the number of race conditions. But when a thread runs on an SMP platform, it may also have to interact with other instances of itself running on other cores, which can introduce previously unseen race conditions.
When designing a new system, it should be unsurprising that I consider using an SMP platform in the usual way to be an abomination because it reintroduces the need for widespread thread safety crapola. The throughput of an SMP system can also quickly reach an asymptote because of semaphore contention or cache collisions—or even fall off a cliff as the result of a deadlock.
I also look at it this way, that there are only three good numbers: 0, 1, and ∞. So if I need more than one core, will 2n always be enough? The general solution is to design software that runs in a distributed (networked) manner. This adds complexity of its own, but it's the truly scalable solution and is unavoidable if you need an arbitrary amount of horsepower. The resulting software can also run on an SMP platform by dividing the shared memory so that each core has its own sandbox, with any shared memory perhaps used only for efficient interprocessor messaging. And that efficiency is best saved for a subsequent software release, to recover lost performance after customers complain that the software has slowed down—the result of accommodating their feature requests, of course.
Acknowledgments
I want to thank honey the codewitch, whose post if you hate writing and trying to test multithreaded code raise your hand prompted me to write this article. A little of it was lifted almost verbatim from some of what I wrote during our discussion.
Notes
1 The techniques described in this article are implemented on top of a standard preemptive/priority scheduler by causing it to context switch at more appropriate times, not by replacing it.
2 See Robust C++: Safety Net for a discussion of signals, exceptions, and how RSC handles them. SIGYIELD
is delivered to the targeted thread as described in Signaling Another Thread.
History
- 15th December, 2019: Updated to reflect software changes since the article was originally published
- 25th September, 2019: Add more about hard real-time systems, legacy systems, and proportional scheduling
- 23rd September, 2019: Initial version