Fast Readers/Writer Lock

Ultrano

4.70/5 (21 votes)

5 Mar 2007CPOL7 min read

926

Optimized implementation of a readers/writers-lock sync object.

Download source code - 25.3 KB

Introduction

Readers/Writer Lock (RWLock) is a mechanism to synchronize access to a resource/object in a multithreaded software. Further information can be found in the following articles:

Background

The implementation of a Readers/Writer Lock requires attention, since it can degrade the overall performance and stability of a multithreaded application/server if not done right. Fortunately, the x86 CPU has specialized instructions for perfect synchronization on multi-processor systems, which makes it possible to make a completely-stable implementation of an RWLock. Since it'll be only an advantage, we'll use assembly-language for the whole implementation, pack it in a .lib, and use it from our C/C++ software.

But, simply writing it in assembly isn't a complete solution - we must be aware of how Windows works and really how fast it works. I'll be using a Windows 2000 SP4 running on a Sempron3000+, for the benchmarks.

First of all, context-switching. It is largely believed to be something really slow... but it is not. The actual numbers are: 322 cycles if no other threads are active, 587 cycles when one other thread is active, 600-700 cycles when there are 2-80 active threads. This benchmark was done many times while there were 530-610 threads (from 27-36 processes) running, and the CPU load was 10-30%. So, if our app is switched with another, that simply switches back, just 1175 cycles will have passed. Also, switching to another process' thread takes only a dozen or so extra cycles.

An application can force a context-switch via the the SwitchToThread() API. Though, this function is not available on Win98 (only a stub is provided there), so we have to use Sleep(0). But, Sleep(0) ignores threads with higher priority than the current thread's priority, so we should avoid it on NT-based Windows (Win2K, XP, Server2K3, Vista). Thus, our code should call Sleep(0) on Windows 98, and SwitchToThread() on the newer OSs.

Kernel Events are commonly used for notifications that some object/data is ready: here, we could use them as notifications that the RWLock object is ready for reading or writing. Quynh Nguyen has already presented such an approach here. There, the thread that requires a notification calls CreateEvent(...) and starts waiting by WaitForSingleObject(...). From then on, the thread-scheduler ignores this thread until the specified event is signaled. But, CreateEvent(...) by itself takes 1500-1900 cycles, which equals the time of three thread-switches. SetEvent(), which is used to signal the event, takes at least 600 cycles. Thus, the event-based approach is suitable when lock-downs (i.e., exclusive writing) takes at least (1900+600)/700*Quantum/3 times more time than the system's timer's period (15ms usually). "Quantum" is 6 on desktops, and up to 36 on servers. So, on a server, the event-based approach is most useful if lock-downs usually take more than 514ms (85ms on a desktop system).
Critical Sections - those 24-byte objects that allow synchronized exclusive access to a resource/object are quite fast. EnterCriticalSection() and LeaveCriticalSection() take just 20 cycles. We'll need something like a CSection to check/modify our RWLock's count of readers/writers. Still, 24 bytes could be a bit too much... knowing that we can actually use just one single byte to do the same stuff, and do it faster. The "xchg" and "cmpxchg" instructions of x86 CPUs come in handy here.
Other synchronization objects like mutexes, semaphores, and so on are too slow to consider using: we already have fast thread-switching by SwitchToThread(), and a fast locker to guard our numReaders and numWriters. So, we start coding based on the pseudocode for RWLock.

Implementation

typedef struct{
    char  locker1; // 0 for unlocked, 1 for locked

    char  numWriters; // [0..1]

    short numReaders; // [0..65535]

}RWLock; // 4 bytes


void StartRead(){
    do{
        LockByte();
        // get exclusive access to numReader
        // and numWriter via "locker1"

        if(numWriters==0){
            numReaders++;
            UnlockByte(); // release "locker1"

            return; // we've gained read-only access

        }
        UnlockByte();// unsatisfactory conditions,

        SwitchThread();// so we'd best yield execution

    }while(1);
}

LockByte() here is interesting, because it'll have to behave differently depending on whether the system has one CPU or more. The code between a LockByte() -> UnlockByte() block is usually just a few instructions, so on multiprocessor systems, we'll only have to waste a few additional cycles and retry to gain access if the first attempt was unsuccessful. But, if we do the same on a uni-processor system, we'll be wasting 30 to 120ms! So, if one CPU is present, we'd best call SwitchThread() immediately. Also, on multiprocessor systems, if a thread was switched-out before it calls UnlockByte(), and the next thread wants to LockByte() this same object... we'd be wasting cycles for a whole time slice - thus a spin-count is necessary (similar to CriticalSections).

SwitchThread() is also environment-dependent, since we'll call Sleep(0) on Win98, and SwitchToThread() on NT-based Windows.

Next, we implement StartWrite(), based on the pseudocode:

void StartWrite(){
    do{
        LockByte();
        if(numWriters==0){
            numWriters=1; // halt subsequent StartRead() calls if numReaders>0

            do{
                if(numReaders==0){
                    UnlockByte();
                    return; // we've gained read-write access

                }
                UnlockByte();
                SwitchThread();
                LockByte();
            }while(1);
        }
        UnlockByte();
        SwitchThread();
    }while(1);
}

We could make the StartWrite() allow the creation of more readers (than the already existing ones), but it would only delay the write-operation indefinitely. We might never start writing that way.

The EndRead() and EndWrite() are much simpler:

void EndRead(){
    LockByte();
    numReaders--;
    UnlockByte();
}
void EndWrite(){
    LockByte();
    numWriters=0;
    UnlockByte();
}
// Both procs can be unified in one, whose pseudocode would be:

void EndAccess(){
    LockByte();
    if(numReaders)numReaders--;
    else numWriters=0;
    UnlockByte();
}

Now, we have the access-locking procs, but what's left are the Init() and Free() procedures. Especially, we should note that Free() should wait for all readers and the writer to leave our RWLock object. The code:

void Init(){
    locker1=0;
    numReaders=0;
    numWriters=0;
}
void Free(){
    do{
        LockByte();
        if(numReaders==0 && numWriters==0){
            return; // !! leaves the object locked on purpose

        }
        UnlockByte();
        SwitchThread();
    }while(1);
}

Avoiding hidden risks

We'll leave the RWLock object locked after Free(), because this way, during application-development, we will notice a deadlock if we're inappropriately using the RWLock object. Where can we go wrong? Only when we call Free() on an object that is directly or indirectly known by the other threads. Example:

Dynamo* List1[10]={null};

class Dynamo{
    RWLock locker;
    int ID;
    char* myData;
    
    Dynamo(){
        myData = malloc(1000); // allocate memory for myData

        Init(&locker);
        ID = FindSuitableID_In_List1(); // returns 0..9

        List1[ID] = this; // "registers" the Dynamo object

    }
    ~Dynamo(){
        Free(&locker); 
        free(myData); // release the memory, taken for myData

        List1[ID] = null;
        // "unregisters" the Dynamo object. Here's our error.

    }    
    
    void SomeFunc(){
        if(random(2)==0){
            StartReading(&locker);
            [- read stuff from myData[] -]
            EndReading(&locker);
        }else{
            StartWriting(&locker);
            [- write stuff into myData[] -]
            EndWriting(&locker);
        }
    }
};

Let's assume we've got a multithreaded app, and several threads iteratively call List1[iterator]->SomeFunc() most of the time. If we delete one Dynamo object, and just before "List1[ID]=null", some thread starts this same object's SomeFunc(), we're in trouble. Why? Because, this Dynamo object and its "myData" probably don't exist anymore. Instead, we must make sure we "unregister" our Dynamo object before Free(&locker)! And, we must use Free(&locker) before we call free(myData). So, the correct destructor is:

~Dynamo(){
    List1[ID] = null;
    // at this point several threads
    // could be using this Dynamo object

    Free(&locker);
    // wait for all threads to release this Dynamo object

    free(myData);
    // ok, we can now safely destroy stuff from this Dynamo object
}

Same trouble would happen if our "Init(&locker)" line was after "List1[ID]=this". So far so good. But, to make sure our app is perfectly stable, we should also guard access to the List1[]'s contents. By using a global RWLock :) . Otherwise, memory/application cache might play a dirty trick. Having taken care of even the negligible risks with a probability 1/(10^18) makes the software stable: and ultimately makes happy users and even happier developers :).

Changing privileges

So far, the implementation seems complete... until we find-out that our locked objects have some operation that would take too much time... reminding of roadblocks in peak-hours. It happens that when we need to do a long Read+Write operation on a locked object, it's best to initially get Read-only access, do computations/allocations, then release the read-access, gain write-access, and modify that object. It also happens that our fingers are crossed during the transition from Read-Only to Read-Write, in the hope that the data we've read hasn't been modified by anyone else before we gain write-access. So, we'll need a way to let a reader become a writer, without giving write-access to anyone else meanwhile. Thus, we compose, and later implement, the following pseudo-code:

void UpgradeToWriter(){
    do{
        LockByte();
        if(numReaders==1 && numWriters==0){
            numReaders=0;
            numWriters=1;
            UnlockByte();
            return;
        }
        UnlockByte();
        SwitchThread();
    }while(1);
}

// and let's provide for completeness:

void DowngradeToReader(){
    LockByte();
    numReaders=1;
    numWriters=0;
    UnlockByte();
}

Multi-lock

If you need to write-lock two or more objects at once, you'd usually do:

StartWrite(object1); 
StartWrite(object2);
StartWrite(object3);
//... now read+write from+to the three objects

But, this way, while our app is waiting to gain exclusive access to object1, readers and writers from other threads can lock object2 and object3. If that is not acceptable, then we should do a simultaneous lock on the objects:

Mark the objects as prepared for writing (this will stop new reader-threads from locking object2 and object3),
wait until all existing read-locks are released. We can save up to (NumObjects-1) context-switches this way.

So, we compose, and then implement, the following pseudo-code:

void MultiStartWrite(int NumObjects,Objects){
    char credits[300]={2};
    // each object starts with 2 credits. Credits = # of chores to do

    int TotalCredits = 2*NumObjects; // total number of chores to do

    
    while(TotalCredits){
        for(int i=0;i<NumObjects;i++){
            if(credits[i]){
            // if this object still has something to be done on

                LockByte(); // this is Objects[i]->LockByte();

                if(credits==2){
                // if we've not marked this object for writing

                    if(numWriters==0){
                    // if there's no other writer

                        numWriters=1; // reserve this object for writing

                        credits[i]=1; // this object has one more thing to do

                        TotalCredits--;
                        if(numReaders==0){
                            credits[i]=0; // this object is completely ready

                            TotalCredits--;
                        }
                    }
                }else{ // credits=1

                    // we've marked the object for writing, now we only have

                    // to wait for the readers to release it

                    if(numReaders==0){
                        credits[i]=0; // this object is completely ready

                        TotalCredits--;
                    }
                }
                UnlockByte(); // this is Objects[i]->UnlockByte();

            }
        }
        if(TotalCredits==0)return; // we're ready

        SwitchThread();
    }
}
void MultiEndWrite(int NumObjects,Objects){
    for(int i=0;i<NumObjects;i++){
        LockByte();
        numWriters=0;
        UnlockByte();
    }
}

The implemented code (in ASM) can be called in two ways:

//------[ stack-based parameter-passing ]---------[

RWLock_MultiStartWrite(3,object1,object2,object3);
... // do stuff

RWLock_MultiEndWrite(3,object1,object2,object3);
//------------------------------------------------/


//or

//------[ array-based parameter-passing ]-------[

RWLOCK* Array1[3];
... // fill-in Array1


RWLock_MultiStartWrite(-3,Array1);
... // do stuff

RWLock_MultiEndWrite(-3,Array1);
//----------------------------------------------/

Expanded outlook

Since this RWLock object contains only numbers in its structure, and no Windows objects, one can combine it with CreateFileMapping+MapViewOfFile to make inter-process RWLock objects. Since the RWLOCK object takes only 4 bytes, and does not allocate any additional memory, large database-like designs can be simplified a lot. One can easily allocate dozens to hundreds of millions of RWLock-guarded objects without sweat. The objects can also be saved to disk, and later restored. Each of the Start/End procedures takes 30-40 cycles, leaving enough CPU power in your hands to use for your multithreaded application.

Finally, we're ready with this simple, but fast, implementation of a Readers/Writer Lock. I've included the ASM source code, the .h header, and the pre-built .lib. You're ready for using the library in your C/C++ right-away. If you want to modify+recompile the ASM code, you need MASM: find ml.exe in your Visual Studio folders, or get the MASM32 package from here. It's important to check the version of your ml.exe: version 8.x is extremely buggy, so avoid it. Version 6.1X is the best to use.

History

06.Feb.2007

First release.

06.Mar.2007

Fixed missing "lock" prefix before cmpxchg in RWLock.asm;
Fixed deadlock in StartWrite();

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)