Introduction
In this article, I will present a method for detecting time-constraint violations within a complex system embedded in a process.
Background
In complex systems, it often happens that a thread deadlocks, or falls into an infinite loop. Of course, this is a strongly undesirable behaviour, yet much more difficult to spot than simple access violation or anything that triggers an exception, because the system in the OS's and processor's view is working correctly. When the system is very big, such errors are extremely hard to reproduce and fix. Fixing such an error in its working environment is rarely possible due to physical distance to the customer, release compilations that strip all debugging information, and numerous different reasons.
The system I present can be used for detecting such situations by means of defining a time constraint upon a code region and detecting its violation using a watcher thread. It does not introduce any symbolic information accessible by any form of debuggers, nor does it degrade performance in a significant way.
Basic concept
The basic concept behind this code is an idea of code regions. Unlike functions, they can be defined at any scope, ranging from a single line of code up to several function calls. These regions are linked to form a stack, which can be useful when retrieving information about the violation point.
Example
void function()
{
REGION("My region", 1000)
{ }
ENDREGION();
}
The main effort in implementing this system is to keep the region entry/leave code as lightweight as possible, moving the load rather to the watcher thread, which performs its job relatively rarely. That means that no dynamic allocation nor long conditional sequences are allowed in the region guarding code.
The other part of the idea is to build a watcher that would monitor the threads and detect termination events and time constraint violations. This thread must have a set of handles to the thread, so that no handles are created nor closed during its normal operation. When a constraint is violated, the watcher invokes user-defined callbacks, providing information about the current region, the offended region, and providing thread information.
Keeping the stack frames linked
In order to keep a stack-like structure linked across function calls, some global data is necessary. Global data is not a good idea when threads are concerned, and that is where Thread Local Storage (TLS) comes. TLS is a set of data that resides at a specified memory location or under an index in some array, but every thread sees its own copy of the data. Moreover, when TLS is allocated, it appears in all existing threads and all threads to be created.
Since we have a perfect place to store our stack frame pointer, we can define structures.
Structures
Since the solution is quite low-level, it is somewhat structural. The structures needed for this concept to function include:
- thread information
- region description
- region-stack frame
The thread information block contains the thread name, the current top-most stack frame pointer, the thread ID, and the thread handle.
class ThreadInfo
{
public:
inline ThreadInfo();
inline ~ThreadInfo();
class Lock
{
public:
inline Lock(ThreadInfo *ti);
inline ~Lock();
volatile LONG *pLockCount;
};
ThreadCodeRegion *volatile Region;
volatile LONG LockCount;
int ThreadId;
HANDLE ThreadHandle;
std::string ThreadName;
};
The region description contains the name and the timeout. Some additional data might be put there.
class CodeRegion
{
public:
inline CodeRegion(const char *name, int timeout);
const char *Name;
int Timeout;
};
The region stack frame contains information about the region, the time it was entered by the thread, and a link to a lower frame.
class ThreadCodeRegion
{
public:
inline ThreadCodeRegion(CodeRegion *cr);
inline ~ThreadCodeRegion();
CodeRegion *CurrentRegion;
ThreadCodeRegion *PrevFrame;
int EntryTime;
bool Ignore;
};
The watcher
The main object of the system is called TimeCop
. It implements a thread that watches all threads that are registered in the system. It maintains a ThreadId
to the ThreadInfo
map, which is used in performing periodical stack-frame searches. It also provides functions to register threads and retrieve descriptor blocks for the calling thread.
Algorithm
The algorithm used is straightforward. The watcher iterates through all thread information blocks, and for each of them, it does the following:
- acquire the spin lock
- go from the top of the stack to the bottom, looking for a frame where the time constraint has been violated
- release the spin lock
If a violating frame has been found, the system suspends the offending thread and invokes user-defined callbacks that can perform program-specific action. After the callbacks have returned, the thread is resumed.
The watched class header is as follows:
class TimeCop
{
public:
TimeCop();
~TimeCop();
static TimeCop *Instance;
static inline ThreadInfo *GetCurrentThreadInfo();
ThreadInfo *InitCurrentThreadInfo(const char *ThreadName);
typedef void (*Callback)(
void *context,
ThreadInfo *Thread,
int TimeInRegion,
ThreadCodeRegion *TopFrame,
ThreadCodeRegion *FrameAtFault);
void AddCallback(Callback cb, void *context);
void RemoveCallback(Callback cb, void *context);
private:
static int TlsEntry;
CRITICAL_SECTION m_MapLock;
typedef std::map <dword, /> ThreadMap;
ThreadMap threadmap;
DWORD StartAtId;
CRITICAL_SECTION m_CBLock;
struct Delegate
{
Callback cb;
void *context;
};
std::vector <delegate /> callbacks;
void InvokeCallbacks(
ThreadInfo *Thread,
int TimeInRegion,
ThreadCodeRegion *TopFrameAtFault,
ThreadCodeRegion *FrameAtFault);
void Start();
void Run();
void Analyze();
bool Analyze(
ThreadMap::iterator it,
int &TimeInRegion,
ThreadCodeRegion *&FrameAtFault,
ThreadCodeRegion *&TopFrameAtFault);
static ULONG WINAPI _run(void *This);
HANDLE m_hThread;
HANDLE m_hWait;
bool ExitRequested;
};
Entering and leaving regions
As I stated in the Background section, the region entering and leaving code design is crucial to keep the system efficient. The code is placed in the constructor and the destructor of the ThreadCodeRegion
class.
ThreadCodeRegion::ThreadCodeRegion(CodeRegion *cr)
: CurrentRegion(cr)
{
ThreadInfo *ti=TimeCop::GetCurrentThreadInfo();
PrevFrame=ti->Region;
EntryTime=GetTickCount();
Ignore=false;
InterlockedExchangePointer(&(ti->Region), this);
}
The code above contains very few API calls, and none of them does change the processor mode (no true system calls are issued).
Leaving a region is a bit more problematic, since the frame may be in use. To ensure it is not, a lock is acquired on the ThreadInfo
before the object is detached from the stack and can be safely destroyed.
ThreadCodeRegion::~ThreadCodeRegion()
{
ThreadInfo *ti=TimeCop::GetCurrentThreadInfo();
ThreadInfo::Lock lock(ti); InterlockedExchangePointer(&(ti->Region), PrevFrame);
}
This spinlock is in fact a quasi-spinlock - if the critical section cannot be acquired, the thread yields execution. If it succeeds, no system call is issued.
class Lock
{
public:
inline Lock(ThreadInfo *ti) : pLockCount(&ti->LockCount)
{
while (InterlockedCompareExchange(pLockCount, 1, 0))
SwitchToThread(); }
inline ~Lock()
{
InterlockedDecrement(pLockCount);
}
volatile LONG *pLockCount;
};
Using the code
The system can be used by the end developer by means of three macros:
REGION(name, timeout)
ENDREGION()
INIT_THREAD(name)
The first one defines a region of code. It defines a static local variable of the CodeRegion
class, and a stack-based local variable of type ThreadCodeRegion
entering the aforementioned CodeRegion
.
#define REGION(name, timeout) {\
static CodeRegion __code_region(name, timeout);\
ThreadCodeRegion __region(&__code_region);
The ENDREGION()
is simply a '}
'. The INIT_THREAD
is a convenient access to the InitCurrentThread
function.
#define INIT_THREAD(name) TimeCop::Instance->InitCurrentThreadInfo(name)
A sample thread-entry function can look as follows:
ULONG WINAPI Thread(LPVOID param)
{
INIT_THREAD("My thread");
REGION(__FUNCTION__, 10000)
REGION(__FUNCTION__ " internal region", 1000)
ENDREGION();
ENDREGION();
return 0;
}
Callbacks
To do something useful with the system, we need to set some callback. It can be done by simply calling AddCallback
on the TimeCop
object. The callback consists of a function and an arbitrary pointer that may be used to pass some extra context for the callback.
Example
void Violation(void *context,
ThreadInfo *Thread,
int time,
ThreadCodeRegion *top,
ThreadCodeRegion *fault)
{
cerr << "Thread " << Thread->ThreadName
<< " violated a time constraint" << endl;
for (const ThreadCodeRegion *frame=top; frame; frame=frame->PrevFrame)
{
cerr << " " << frame->CurrentRegion->Name << endl;
}
fault->Ignore=true;
}
void main()
{
TimeCop tc;
tc.AddCallback(Violation, 0);
CreateThread(0, 0, Thread, 0, 0, 0);
...
...
}
Applications
This system can be used for a variety of purposes. The simplest is logging the event or displaying some warning message (not recommended for services that don't have a desktop). It can also be used to terminate the offending thread, or do some control work on blocking objects. In some cases, this system can be used for managing performance. For example, we can schedule a thread to do some time-consuming job in the background, with low priority, while keeping the system responsive. However, if the task is not completed in the desired time, we can increase the thread's priority in order to force the task to complete sooner.
Future works
This code uses quite an inaccurate time measurement method - GetTickCount()
- perhaps some better timers could be used (though, if the violations are checked for periodically, every second - who would care?). Examining only the thread's execution time (not the system time) might also prove useful.
I will also try to make the violation checking mechanism more responsive, i.e., by introducing some control variable defining the time of the next timeout - then the period could be much less (e.g., 50ms) and the actual check made only when the system time exceeds the time stored in the variable.
Final note
I plan to use this code in some server applications where sometimes extremely rare deadlock conditions occur. If you need to perform some time-constraint-based self-diagnostic or performance management, the system might prove useful. If you find this subject interesting or the code useful, I'd be glad to know, and even more satisfied to have proved myself helpful.