Introduction
Various homegrown reader/writer lock objects were suggested by a number of developers striving to provide better performance than the existing API constructs could offer. Recently, Microsoft introduced the SRWLOCK
, a new built-in reader/writer primitive for Windows Vista.
Existing API synchronization constructs provide simple and fast access to a shared resource. On the other hand, API primitives (for example, a critical section) might not work well in scenarios when high contention on the lock occurs. Lock convoy is not the only problem here. If a writer competes with multiple readers, performance of a particular reader may not be guaranteed, varying in a wide range, to the extent where minimum performance is not acceptable due to the application requirements. As a rule, custom reader/writer lock objects are slower than API constructs. However, the benefit of using custom locks is that a hand-coded object may provide reasonable performance for all reader/writer threads. We will be looking for a solution that compromises between maximum performance of multiple readers and acceptable performance of a writer.
The tests results are summarized in Figure 1 and Figure 6. One writer and four readers were concurrently accessing a shared resource. The writer thread acquired an exclusive lock, filled the shared buffer with random bytes, and calculated the CRC. Each reader thread acquired a shared lock, calculated the CRC, and compared the latter with the CRC saved by the writer (an exception was supposed to be thrown in case of a CRC error, which would indicate incorrectness of the lock implementation). The following reader/writer lock objects were tested: the “CSRW
” object was the simplest implementation using the Windows API critical section; the “RWFavorNeither
” and “RWLockFavorWriters
” objects were taken from the MSDN magazine article “CONCURRENCY: Synchronization Primitives New To Windows Vista” written by Robert Saccone and Alexander Taskov (http://msdn.microsoft.com/en-us/magazine/cc163405.aspx); the “Ruediger_Asche_RWLock
” object was taken from the MSDN article “Compound Win32 Synchronization Objects” written by Ruediger R. Asche (http://msdn.microsoft.com/en-us/library/ms810427.aspx). This lock implements the “CRWLock
” compound object. The “Jim_B_Robert_Wiener_RWLock
” lock object was taken from “Multithreading Applications in Win32: The Complete Guide to Threads” written by Jim Beveridge and Robert Wiener. The new Windows Vista API user-space lock, SRWLOCK
, was also tested.
Figure 1. Testing one writer and four readers on a Uniprocessor PC, AMD Sempron(tm) 3000+, 2.00 GHz, 32-bit Windows XP Home Edition SP3, compiled in VC++ 2005 Express. Duration per operation in microseconds and the coefficient of variation, %.
Without locking
reader: t=0.213 (s=0.85%)
writer: t=3.234 (s=0.46%)
Critical section CSRW lock
reader: t1=41586.02 (s1=161.18%) t2=0.974 (s2=40.32%)
writer: t1=83.78 (s1=202.08%) t2=6.938 (s2=0.69%)
RWFavorNeither lock
reader: t1=1.166 (s1=9.18%) t2=1.09 (s2=7.05%)
writer: t1=230.71 (s1=118.71%) t2=7.741 (s2=0.86%)
RWLockFavorWriters lock
reader: t1=1.553 (s1=19.52%) t2=1.431 (s2=10.41%)
writer: t1=28.7240 (s1=11.4%) t2=8.623 (s2=1.86%)
Jim_B_Robert_Wiener_RWLock
reader: t1=30.202 (s=8.05%) t2=9.973 (s2=1.26%)
writer: t1=6.303 (s=6.84%) t2=6.105 (s2=2.68%)
Ruediger_Asche_RWLock
reader: t1=2.605 (s1=6.84%) t2=2.444 (s2=1.53%)
writer: t1=4484987.15 (s1=76.96%) t2=14.438 (s2=1.22%)
The average duration per read/write operation “t”, “t1”, and “t2” in microseconds and the coefficients of variation “s”, “s1”, and “s2” in % were calculated based on 50 tests; each of them included 1,000,000 and 4,000,000 iterations for a writer and a reader, respectively. Comparison of the “t” value (duration per operation without acquiring locks) with “t1”, and “t2” values (the duration when thread concurrency is present) shows performance degradation due to contention on the locks. The average duration “t2” was measured from the beginning to the end of the working threads. After all the threads were done with the given number of iterations, the overall duration per read/write operations was calculated. The following pseudo-code illustrates calculating “t2”:
Figure 2. Calculating average duration per operation.
worker_thread_procedure()
{
...
worker.m_perf_counter.start();
for (int i = 0; i < test_iteration; i++)
{
if( false == worker.simulate_work() )
{
::RaiseException( STATUS_NONCONTINUABLE_EXCEPTION, 0, 0, 0);
}
else
{
}
}
worker.m_perf_counter.end();
worker.m_perf_counter.set_iteration_done(test_iteration);
}
The duration “t2” in Figure 1 shows that there is not much difference in the locks' performance. Although using “t2” as criteria may appear simple and reasonable at first glance, it has several problems. If many reader threads are holding a shared lock, a writer can become completely locked from acquiring an exclusive lock due to the race condition. Being almost blocked and therefore having poor performance in the beginning, the writer may show “fine” overall performance at the end of the test after the other reader threads have completed the given “test_iteration
”. The “Ruediger_Asche_RWLock
” object is an example of a lock that behaves this way. As one of four readers has done all the 4,000,000 iterations, the writer may complete only 1-7 iterations (from 1,000,000), see Figure 3:
Figure 3. Individual results of testing the “Ruediger_Asche_RWLock” object. The number of iterations done and the duration per operation in milliseconds.
reader thread 1: n=3999973 (t=0.0025) writer thread: n=7 (t=1433.846)
reader thread 2: n=4000000 (t=0.0025)
reader thread 3: n=3999993 (t=0.0025)
reader thread 4: n=3999901 (t=0.0025)
When all the readers have finished, the writer unlocks and quickly finishes the remaining 999,993 iterations. Although, the “overall” performance of the “Ruediger_Asche_RWLock
” object may appear acceptable (t2=14.438 microseconds per write operation), it does not. Take a look at the duration t1=4484987.15 microseconds per writing operation in Figure 1. The estimation “t1” is similar to “t2”, but it is calculated during a short period of time at the beginning of the test, when high-contention on the shared/exclusive locks exists. The pseudo-code below illustrates calculating the duration “t1”:
Figure 4. Calculating average duration per operation when all the threads work concurrently.
worker_thread_procedure()
{
...
worker.m_perf_counter.start();
int i;
for (i = 0; i < test_iteration; i++)
{
if( false == worker.simulate_work() )
{
::RaiseException( STATUS_NONCONTINUABLE_EXCEPTION, 0, 0, 0);
}
else
{
}
if(TRUE == g_stop_test)
{
break;
}
else
{
}
}
worker.m_perf_counter.end();
::InterlockedExchange(&g_stop_test, TRUE);
worker.m_perf_counter.set_iteration_done(i);
}
“t1” and “s1” show poor performance of the critical section object “CSRW
” on Windows XP. Several reader threads can be almost “blocked” under high concurrency condition. While one reader thread has completed 4,000,000 iterations, the others may complete 6-48 iterations only. Figure 5 illustrates deviation in the “CSRW
” performance within individual threads and tests:
Figure 5. Individual test results when testing the “CSRW” object. The number of iterations done and the duration per operation in milliseconds.
...
test1:
reader thread 1: n=138302 (t=0.013412) writer thread: n= 4151 (t=0.449985)
reader thread 2: n=4000000 (t=0.000464)
reader thread 3: n=69812 (t=0.026571)
reader thread 4: n=3513004 (t=0.000492)
test2:
reader thread 1: n=30 (t=64.038158) writer thread: n= 294219 (t=0.006664)
reader thread 2: n=30 (t=64.035178)
reader thread 3: n=30 (t=64.032356)
reader thread 4: n=4000000 (t=0.000480)
...
Although, some reader threads are very fast (0.464-0.492 microseconds), the average reading cost, however, is extremely high (t1= 41586.02 microseconds). The high value of the coefficient of variation s1=161.18% indicates that performance of the individual reader threads varies in a large extent. It is difficult for the CSRW
to guarantee performance of all the readers. When using a new superior Windows Vista API SRWLOCK
object, high contention may have a very negative impact on performance. If one writer competes with many readers running on a Multi-core PC, it can be quite difficult for the writer to acquire an exclusive lock. Figure 6 shows poor SRWLOCK
writer performance, t1=186.171 microseconds, with a quite high coefficient of variation, s1=25.17%, on a Quad-core PC. A writer could only complete 5,000-30,000 iterations (from 1,000,000), while readers had almost finished all the 4,000,000 iterations. When changing the number of concurrent reader threads, the SRWLOCK
writer performance may vary in a great extend. If the number of readers increases from 2 to 4, most of the lock objects show performance regression from 40% to 100%, but the SRWLOCK
does not. The SRWLOCK
writer performance degrades form 7.1 to 186.2 microseconds (~26 times).
Figure 6. Testing one writer and four readers on a Quad-CPU PC, Intel Q6700, 2.66 GHz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP1, compiled in VC++ 2008 Express. Duration per operation in microseconds and the coefficient of variation, %.
Without locking
reader: t=0.106 (s=1.45%)
writer: t=1.416 (s=0.79%)
Critical section CSRW lock
reader: t1=19.337 (s1=48.99%) t2=1.322 (s2=5.65%)
writer: t1=2.018 (s1=3.17%) t2=2.194 (s2=4.71%)
Windows Vista SRWLOCK lock
reader: t1=0.991 (s1=11.77%) t2=0.881 (s2=8.2%)
writer: t1=186.171 (s1=25.17%) t2=5.204 (s2=1.31%)
RWFavorNeither lock
reader: t1=1.274 (s1=6.59%) t2=1.381 (s2=3.07%)
writer: t1=72.13 (s1=5.31%) t2=6.981 (s2=1.64%)
RWLockFavorWriters lock
reader: t1=6.311 (s1=4.83%) t2=5.511 (s2=3.48%)
writer: t1=21.948 (s1=4.34%) t2=21.127 (s2=5.06%)
Jim_B_Robert_Wiener_RWLock
reader: t1=61.688 (s1=1.03%) t2=59.166 (s2=0.59%)
writer: t1=84.911 (s1=9.15%) t2=85.005 (s2=9.6%)
Ruediger_Asche_RWLock
reader: t1=1.755 (s1=11.69%) t2=1.803 (s2=7.95%)
writer: t1=2107697.58 (s1=108.77%) t2=10.607 (s2=5.17%)
The decision on which reader/writer lock is better to employ depends on the hardware and software architecture and a particular application scenario. However, the test results show that a likely candidate might be the “RWLockFavorWriters
” object. The cost of locks is relatively low (1.6 - 6.3 and 22 - 29 microseconds for reader/writer, respectively). Unlike the others, the “RWLockFavorWriters
” object performs consistently on different hardware/software platforms and test scenarios. If a writer competes with two readers (in place of four readers in Figure 1 and Figure 6), the readers and the writer perform ~30-60 % faster on both Uniprocessor and Quad-core PCs. The coefficient of variation stays relatively low (the same improving performance of the “RWFavorNeither
” object will lead to increasing the coefficient of variation up to 60-70%). On the other hand, depending on the number of concurrent reader threads, the kernel CPU loading can be high ~10-40%; a large number of thread context switching per second ~150000-200000 can cause additional overheads. As against to custom locks, the new user-space Windows slim lock performs very well, loading kernel CPU ~0%, and generating a small number of thread context switching per second ~10-1000. The SRWLOCK
object seems to be the best reader/writer lock to employ, if the writer performance is acceptable, and solution scalability is considered with respect to the increasing number of reader threads; as the number of reader threads increases, high contention on the lock may have a negative impact on the SRWLOCK
writer performance.
Conclusion
- It might be a disadvantage to use the Windows API critical section lock
CSRW
as a reader/writer lock in scenarios with multiple readers and under the condition when high contention on the lock occurs. - Windows Vista
SRWLOCK
seems to be the best object to employ, if potential regression of the writer performance along with increasing number of readers is acceptable. - The “
RWLockFavorWriters
” lock object is a high-performance, robust implementation of a reader/writer lock object in a wide range of application scenarios, performing consistently on both Uniprocessor and Multi-core platforms, and giving reasonable parity of reader/writer performance.