Introduction
Readers/Writer Lock (RWLock) is a mechanism to synchronize access to a resource/object in a multithreaded software. Further information can be found in the following articles:
Background
The implementation of a Readers/Writer Lock requires attention, since it can degrade the overall performance and stability of a multithreaded application/server if not done right. Fortunately, the x86 CPU has specialized instructions for perfect synchronization on multi-processor systems, which makes it possible to make a completely-stable implementation of an RWLock. Since it'll be only an advantage, we'll use assembly-language for the whole implementation, pack it in a .lib, and use it from our C/C++ software.
But, simply writing it in assembly isn't a complete solution - we must be aware of how Windows works and really how fast it works. I'll be using a Windows 2000 SP4 running on a Sempron3000+, for the benchmarks.
First of all, context-switching. It is largely believed to be something really slow... but it is not. The actual numbers are: 322 cycles if no other threads are active, 587 cycles when one other thread is active, 600-700 cycles when there are 2-80 active threads. This benchmark was done many times while there were 530-610 threads (from 27-36 processes) running, and the CPU load was 10-30%. So, if our app is switched with another, that simply switches back, just 1175 cycles will have passed. Also, switching to another process' thread takes only a dozen or so extra cycles.
An application can force a context-switch via the the SwitchToThread()
API. Though, this function is not available on Win98 (only a stub is provided there), so we have to use Sleep(0)
. But, Sleep(0)
ignores threads with higher priority than the current thread's priority, so we should avoid it on NT-based Windows (Win2K, XP, Server2K3, Vista). Thus, our code should call Sleep(0)
on Windows 98, and SwitchToThread()
on the newer OSs.
- Kernel Events are commonly used for notifications that some object/data is ready: here, we could use them as notifications that the RWLock object is ready for reading or writing. Quynh Nguyen has already presented such an approach here. There, the thread that requires a notification calls
CreateEvent(...)
and starts waiting by WaitForSingleObject(...)
. From then on, the thread-scheduler ignores this thread until the specified event is signaled. But, CreateEvent(...)
by itself takes 1500-1900 cycles, which equals the time of three thread-switches. SetEvent()
, which is used to signal the event, takes at least 600 cycles. Thus, the event-based approach is suitable when lock-downs (i.e., exclusive writing) takes at least (1900+600)/700*Quantum/3 times more time than the system's timer's period (15ms usually). "Quantum" is 6 on desktops, and up to 36 on servers. So, on a server, the event-based approach is most useful if lock-downs usually take more than 514ms (85ms on a desktop system). - Critical Sections - those 24-byte objects that allow synchronized exclusive access to a resource/object are quite fast.
EnterCriticalSection()
and LeaveCriticalSection()
take just 20 cycles. We'll need something like a CSection
to check/modify our RWLock's count of readers/writers. Still, 24 bytes could be a bit too much... knowing that we can actually use just one single byte to do the same stuff, and do it faster. The "xchg
" and "cmpxchg
" instructions of x86 CPUs come in handy here. - Other synchronization objects like mutexes, semaphores, and so on are too slow to consider using: we already have fast thread-switching by
SwitchToThread()
, and a fast locker to guard our numReader
s and numWriter
s. So, we start coding based on the pseudocode for RWLock.
Implementation
typedef struct{
char locker1;
char numWriters;
short numReaders;
}RWLock;
void StartRead(){
do{
LockByte();
if(numWriters==0){
numReaders++;
UnlockByte();
return;
}
UnlockByte();
SwitchThread();
}while(1);
}
LockByte()
here is interesting, because it'll have to behave differently depending on whether the system has one CPU or more. The code between a LockByte()
-> UnlockByte()
block is usually just a few instructions, so on multiprocessor systems, we'll only have to waste a few additional cycles and retry to gain access if the first attempt was unsuccessful. But, if we do the same on a uni-processor system, we'll be wasting 30 to 120ms! So, if one CPU is present, we'd best call SwitchThread()
immediately. Also, on multiprocessor systems, if a thread was switched-out before it calls UnlockByte()
, and the next thread wants to LockByte()
this same object... we'd be wasting cycles for a whole time slice - thus a spin-count is necessary (similar to CriticalSections).
SwitchThread()
is also environment-dependent, since we'll call Sleep(0) on Win98, and SwitchToThread() on NT-based Windows.
Next, we implement StartWrite()
, based on the pseudocode:
void StartWrite(){
do{
LockByte();
if(numWriters==0){
numWriters=1;
do{
if(numReaders==0){
UnlockByte();
return;
}
UnlockByte();
SwitchThread();
LockByte();
}while(1);
}
UnlockByte();
SwitchThread();
}while(1);
}
We could make the StartWrite()
allow the creation of more readers (than the already existing ones), but it would only delay the write-operation indefinitely. We might never start writing that way.
The EndRead()
and EndWrite()
are much simpler:
void EndRead(){
LockByte();
numReaders--;
UnlockByte();
}
void EndWrite(){
LockByte();
numWriters=0;
UnlockByte();
}
void EndAccess(){
LockByte();
if(numReaders)numReaders--;
else numWriters=0;
UnlockByte();
}
Now, we have the access-locking procs, but what's left are the Init()
and Free()
procedures. Especially, we should note that Free()
should wait for all readers and the writer to leave our RWLock object. The code:
void Init(){
locker1=0;
numReaders=0;
numWriters=0;
}
void Free(){
do{
LockByte();
if(numReaders==0 && numWriters==0){
return;
}
UnlockByte();
SwitchThread();
}while(1);
}
Avoiding hidden risks
We'll leave the RWLock object locked after Free()
, because this way, during application-development, we will notice a deadlock if we're inappropriately using the RWLock object. Where can we go wrong? Only when we call Free()
on an object that is directly or indirectly known by the other threads. Example:
Dynamo* List1[10]={null};
class Dynamo{
RWLock locker;
int ID;
char* myData;
Dynamo(){
myData = malloc(1000);
Init(&locker);
ID = FindSuitableID_In_List1();
List1[ID] = this;
}
~Dynamo(){
Free(&locker);
free(myData);
List1[ID] = null;
}
void SomeFunc(){
if(random(2)==0){
StartReading(&locker);
[- read stuff from myData[] -]
EndReading(&locker);
}else{
StartWriting(&locker);
[- write stuff into myData[] -]
EndWriting(&locker);
}
}
};
Let's assume we've got a multithreaded app, and several threads iteratively call List1[iterator]->SomeFunc()
most of the time. If we delete one Dynamo
object, and just before "List1[ID]=null
", some thread starts this same object's SomeFunc()
, we're in trouble. Why? Because, this Dynamo
object and its "myData
" probably don't exist anymore. Instead, we must make sure we "unregister
" our Dynamo object before Free(&locker)
! And, we must use Free(&locker)
before we call free(myData)
. So, the correct destructor is:
~Dynamo(){
List1[ID] = null;
Free(&locker);
free(myData);
}
Same trouble would happen if our "Init(&locker)
" line was after "List1[ID]=this
". So far so good. But, to make sure our app is perfectly stable, we should also guard access to the List1[]
's contents. By using a global RWLock :) . Otherwise, memory/application cache might play a dirty trick. Having taken care of even the negligible risks with a probability 1/(10^18) makes the software stable: and ultimately makes happy users and even happier developers :).
Changing privileges
So far, the implementation seems complete... until we find-out that our locked objects have some operation that would take too much time... reminding of roadblocks in peak-hours. It happens that when we need to do a long Read+Write operation on a locked object, it's best to initially get Read-only access, do computations/allocations, then release the read-access, gain write-access, and modify that object. It also happens that our fingers are crossed during the transition from Read-Only to Read-Write, in the hope that the data we've read hasn't been modified by anyone else before we gain write-access. So, we'll need a way to let a reader become a writer, without giving write-access to anyone else meanwhile. Thus, we compose, and later implement, the following pseudo-code:
void UpgradeToWriter(){
do{
LockByte();
if(numReaders==1 && numWriters==0){
numReaders=0;
numWriters=1;
UnlockByte();
return;
}
UnlockByte();
SwitchThread();
}while(1);
}
void DowngradeToReader(){
LockByte();
numReaders=1;
numWriters=0;
UnlockByte();
}
Multi-lock
If you need to write-lock two or more objects at once, you'd usually do:
StartWrite(object1);
StartWrite(object2);
StartWrite(object3);
But, this way, while our app is waiting to gain exclusive access to object1
, readers and writers from other threads can lock object2
and object3
. If that is not acceptable, then we should do a simultaneous lock on the objects:
- Mark the objects as prepared for writing (this will stop new reader-threads from locking
object2
and object3
), - wait until all existing read-locks are released. We can save up to (
NumObjects-1
) context-switches this way.
So, we compose, and then implement, the following pseudo-code:
void MultiStartWrite(int NumObjects,Objects){
char credits[300]={2};
int TotalCredits = 2*NumObjects;
while(TotalCredits){
for(int i=0;i<NumObjects;i++){
if(credits[i]){
LockByte();
if(credits==2){
if(numWriters==0){
numWriters=1;
credits[i]=1;
TotalCredits--;
if(numReaders==0){
credits[i]=0;
TotalCredits--;
}
}
}else{
if(numReaders==0){
credits[i]=0;
TotalCredits--;
}
}
UnlockByte();
}
}
if(TotalCredits==0)return;
SwitchThread();
}
}
void MultiEndWrite(int NumObjects,Objects){
for(int i=0;i<NumObjects;i++){
LockByte();
numWriters=0;
UnlockByte();
}
}
The implemented code (in ASM) can be called in two ways:
RWLock_MultiStartWrite(3,object1,object2,object3);
...
RWLock_MultiEndWrite(3,object1,object2,object3);
RWLOCK* Array1[3];
...
RWLock_MultiStartWrite(-3,Array1);
...
RWLock_MultiEndWrite(-3,Array1);
Expanded outlook
Since this RWLock object contains only numbers in its structure, and no Windows objects, one can combine it with CreateFileMapping
+MapViewOfFile
to make inter-process RWLock objects. Since the RWLOCK object takes only 4 bytes, and does not allocate any additional memory, large database-like designs can be simplified a lot. One can easily allocate dozens to hundreds of millions of RWLock-guarded objects without sweat. The objects can also be saved to disk, and later restored. Each of the Start/End procedures takes 30-40 cycles, leaving enough CPU power in your hands to use for your multithreaded application.
Finally, we're ready with this simple, but fast, implementation of a Readers/Writer Lock. I've included the ASM source code, the .h header, and the pre-built .lib. You're ready for using the library in your C/C++ right-away. If you want to modify+recompile the ASM code, you need MASM: find ml.exe in your Visual Studio folders, or get the MASM32 package from here. It's important to check the version of your ml.exe: version 8.x is extremely buggy, so avoid it. Version 6.1X is the best to use.
History
- 06.Feb.2007
- 06.Mar.2007
- Fixed missing "lock" prefix before
cmpxchg
in RWLock.asm; - Fixed deadlock in
StartWrite()
;