TimeCop - a tool for detecting time-constraint violations

Mike65536

4.71/5 (3 votes)

23 Nov 2007CPOL6 min read

145

This article demonstrates a tool for detecting situations when certain code region execution times exceed the specified timeout, and provides a run-time reaction mechanism for these situations.

Download source code - 5.23 KB

Introduction

In this article, I will present a method for detecting time-constraint violations within a complex system embedded in a process.

Background

In complex systems, it often happens that a thread deadlocks, or falls into an infinite loop. Of course, this is a strongly undesirable behaviour, yet much more difficult to spot than simple access violation or anything that triggers an exception, because the system in the OS's and processor's view is working correctly. When the system is very big, such errors are extremely hard to reproduce and fix. Fixing such an error in its working environment is rarely possible due to physical distance to the customer, release compilations that strip all debugging information, and numerous different reasons.

The system I present can be used for detecting such situations by means of defining a time constraint upon a code region and detecting its violation using a watcher thread. It does not introduce any symbolic information accessible by any form of debuggers, nor does it degrade performance in a significant way.

Basic concept

The basic concept behind this code is an idea of code regions. Unlike functions, they can be defined at any scope, ranging from a single line of code up to several function calls. These regions are linked to form a stack, which can be useful when retrieving information about the violation point.

Example

C++

void function()
{
    REGION("My region", 1000)
    { // code here should not execute for longer
      // than 1000 milliseconds
    }
    ENDREGION();
}

The main effort in implementing this system is to keep the region entry/leave code as lightweight as possible, moving the load rather to the watcher thread, which performs its job relatively rarely. That means that no dynamic allocation nor long conditional sequences are allowed in the region guarding code.

The other part of the idea is to build a watcher that would monitor the threads and detect termination events and time constraint violations. This thread must have a set of handles to the thread, so that no handles are created nor closed during its normal operation. When a constraint is violated, the watcher invokes user-defined callbacks, providing information about the current region, the offended region, and providing thread information.

Keeping the stack frames linked

In order to keep a stack-like structure linked across function calls, some global data is necessary. Global data is not a good idea when threads are concerned, and that is where Thread Local Storage (TLS) comes. TLS is a set of data that resides at a specified memory location or under an index in some array, but every thread sees its own copy of the data. Moreover, when TLS is allocated, it appears in all existing threads and all threads to be created.

Since we have a perfect place to store our stack frame pointer, we can define structures.

Structures

Since the solution is quite low-level, it is somewhat structural. The structures needed for this concept to function include:

thread information
region description
region-stack frame

The thread information block contains the thread name, the current top-most stack frame pointer, the thread ID, and the thread handle.

C++

class ThreadInfo
{
public:
    inline ThreadInfo();
    inline ~ThreadInfo();

    class Lock
    {
    public:
        inline Lock(ThreadInfo *ti);
        inline ~Lock();
        volatile LONG *pLockCount;
    };

    ThreadCodeRegion *volatile  Region;
    volatile LONG               LockCount;
    int             ThreadId;
    HANDLE          ThreadHandle;
    std::string     ThreadName;
};

The region description contains the name and the timeout. Some additional data might be put there.

C++

class CodeRegion
{
public:
    inline CodeRegion(const char *name, int timeout);
    const char *Name;
    int         Timeout;

};

The region stack frame contains information about the region, the time it was entered by the thread, and a link to a lower frame.

C++

class ThreadCodeRegion
{
public:
    inline ThreadCodeRegion(CodeRegion *cr);
    inline ~ThreadCodeRegion();
    CodeRegion          *CurrentRegion;
    ThreadCodeRegion    *PrevFrame;
    int                  EntryTime;
    bool                 Ignore;
};

The watcher

The main object of the system is called TimeCop. It implements a thread that watches all threads that are registered in the system. It maintains a ThreadId to the ThreadInfo map, which is used in performing periodical stack-frame searches. It also provides functions to register threads and retrieve descriptor blocks for the calling thread.

Algorithm

The algorithm used is straightforward. The watcher iterates through all thread information blocks, and for each of them, it does the following:

acquire the spin lock
go from the top of the stack to the bottom, looking for a frame where the time constraint has been violated
release the spin lock

If a violating frame has been found, the system suspends the offending thread and invokes user-defined callbacks that can perform program-specific action. After the callbacks have returned, the thread is resumed.

The watched class header is as follows:

C++

class TimeCop
{
public:
    TimeCop();
    ~TimeCop();

    static TimeCop *Instance;
    static inline ThreadInfo *GetCurrentThreadInfo();
    ThreadInfo *InitCurrentThreadInfo(const char *ThreadName);

    typedef void (*Callback)(
        void *context,
        ThreadInfo *Thread,
        int TimeInRegion,
        ThreadCodeRegion *TopFrame,
        ThreadCodeRegion *FrameAtFault);

    void AddCallback(Callback cb, void *context);
    void RemoveCallback(Callback cb, void *context);

private:
    static int TlsEntry;

    CRITICAL_SECTION m_MapLock;
    typedef std::map <dword, /> ThreadMap;
    ThreadMap threadmap;

    DWORD StartAtId;

    CRITICAL_SECTION m_CBLock;
    struct Delegate
    {
        Callback cb;
        void *context;
    };
    std::vector <delegate /> callbacks;
    void InvokeCallbacks(
         ThreadInfo *Thread,
         int TimeInRegion,
         ThreadCodeRegion *TopFrameAtFault,
         ThreadCodeRegion *FrameAtFault);

    void Start();
    void Run();   
    void Analyze();
    bool Analyze(
        ThreadMap::iterator it,
        int &TimeInRegion,
        ThreadCodeRegion *&FrameAtFault,
        ThreadCodeRegion *&TopFrameAtFault);

    static ULONG WINAPI _run(void *This);
    HANDLE m_hThread;
    HANDLE m_hWait;
    bool ExitRequested;
};

Entering and leaving regions

As I stated in the Background section, the region entering and leaving code design is crucial to keep the system efficient. The code is placed in the constructor and the destructor of the ThreadCodeRegion class.

C++

ThreadCodeRegion::ThreadCodeRegion(CodeRegion *cr)
    : CurrentRegion(cr)
{
    // this is a single lookup in the TLS
    ThreadInfo *ti=TimeCop::GetCurrentThreadInfo();
    PrevFrame=ti->Region;
    EntryTime=GetTickCount();
    Ignore=false;

    // this comes last, when all else is initialized
    // - we can avoid acquiring any locks here
    InterlockedExchangePointer(&(ti->Region), this);
}

The code above contains very few API calls, and none of them does change the processor mode (no true system calls are issued).

Leaving a region is a bit more problematic, since the frame may be in use. To ensure it is not, a lock is acquired on the ThreadInfo before the object is detached from the stack and can be safely destroyed.

C++

ThreadCodeRegion::~ThreadCodeRegion()
{
    ThreadInfo *ti=TimeCop::GetCurrentThreadInfo();
    ThreadInfo::Lock lock(ti); // acquire a spinlock
    InterlockedExchangePointer(&(ti->Region), PrevFrame);
}

This spinlock is in fact a quasi-spinlock - if the critical section cannot be acquired, the thread yields execution. If it succeeds, no system call is issued.

C++

class /*ThreadInfo::*/ Lock
{
public:
    inline Lock(ThreadInfo *ti) : pLockCount(&ti->LockCount)
    {
        // spinlock
        while (InterlockedCompareExchange(pLockCount, 1, 0))
            SwitchToThread(); // yield execution
    }
    inline ~Lock()
    {
        InterlockedDecrement(pLockCount);
    }
    volatile LONG *pLockCount;
};

Using the code

The system can be used by the end developer by means of three macros:

REGION(name, timeout)
ENDREGION()
INIT_THREAD(name)

The first one defines a region of code. It defines a static local variable of the CodeRegion class, and a stack-based local variable of type ThreadCodeRegion entering the aforementioned CodeRegion.

C++

#define REGION(name, timeout) {\
    static CodeRegion __code_region(name, timeout);\
    ThreadCodeRegion __region(&__code_region);

The ENDREGION() is simply a '}'. The INIT_THREAD is a convenient access to the InitCurrentThread function.

C++

#define INIT_THREAD(name) TimeCop::Instance->InitCurrentThreadInfo(name)

A sample thread-entry function can look as follows:

C++

ULONG WINAPI Thread(LPVOID param)
{
    INIT_THREAD("My thread");

    REGION(__FUNCTION__, 10000) // ten seconds
        // some code

        // only one second here
        REGION(__FUNCTION__ " internal region", 1000)
           // some critical code
        ENDREGION();
        // some code
    ENDREGION();

    return 0;
}

Callbacks

To do something useful with the system, we need to set some callback. It can be done by simply calling AddCallback on the TimeCop object. The callback consists of a function and an arbitrary pointer that may be used to pass some extra context for the callback.

Example

C++

void Violation(void *context,
    ThreadInfo *Thread,
    int time,
    ThreadCodeRegion *top,
    ThreadCodeRegion *fault)
{
    cerr << "Thread " << Thread->ThreadName
         << " violated a time constraint" << endl;

    // dump all region names to STDERR
    for (const ThreadCodeRegion *frame=top; frame; frame=frame->PrevFrame)
    {
        cerr << "  " << frame->CurrentRegion->Name << endl;
    }
    fault->Ignore=true;
}

void main()
{
    TimeCop tc;
    tc.AddCallback(Violation, 0);

    CreateThread(0, 0, Thread, 0, 0, 0);
    ...
    ...
}

Applications

This system can be used for a variety of purposes. The simplest is logging the event or displaying some warning message (not recommended for services that don't have a desktop). It can also be used to terminate the offending thread, or do some control work on blocking objects. In some cases, this system can be used for managing performance. For example, we can schedule a thread to do some time-consuming job in the background, with low priority, while keeping the system responsive. However, if the task is not completed in the desired time, we can increase the thread's priority in order to force the task to complete sooner.

Future works

This code uses quite an inaccurate time measurement method - GetTickCount() - perhaps some better timers could be used (though, if the violations are checked for periodically, every second - who would care?). Examining only the thread's execution time (not the system time) might also prove useful.

I will also try to make the violation checking mechanism more responsive, i.e., by introducing some control variable defining the time of the next timeout - then the period could be much less (e.g., 50ms) and the actual check made only when the system time exceeds the time stored in the variable.

Final note

I plan to use this code in some server applications where sometimes extremely rare deadlock conditions occur. If you need to perform some time-constraint-based self-diagnostic or performance management, the system might prove useful. If you find this subject interesting or the code useful, I'd be glad to know, and even more satisfied to have proved myself helpful.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)