This article covers Benchmarks and Cross-Benchmarks for implemented synchronization primitives under GreenSuperGreen library of UnifiedConcurrency framework for two distinct scenarios, the Heavy Contention, and the Bad Neighbor.
The first two articles prepared the stage for the benchmarking of all the implemented synchronization primitives under Unified Concurrency and cross-benchmarking to understand advantages, disadvantages, costs, and behavior of synchronizations primitives between each other based on the scenarios and circumstances.
The first group of benchmarks here studied will cover each synchronization primitive individually each for both scenarios, the Heavy Contention, and the Bad Neighbor:
Monitor
(C# lock) - from the previous article LockUC
SpinLockUC
TicketSpinLockUC
AsyncLockUC
The AsyncSpinLockUC
and the AsyncTicketSpinLockUC
won't be covered, because their behavior is exactly the same as for their general threading counterparts, the SpinLockUC
and the TicketSpinLockUC
. The AsyncSpinLockUC
and the AsyncTicketSpinLockUC
are always returning the completed awaiter implemented as a struct
, not a class, so memory allocation is avoided and awaiting is avoided as well and then the behavior is obviously the same as for the non-async
counterparts what is supported by measured data.
The second group will cover cross-benchmarking each for both scenarios, the Heavy Contention, and the Bad Neighbor:
- Monitor (C# lock),
SpinLockUC
and TicketSpinLockUC
- Monitor (C# lock),
LockUC
- Monitor (C# lock),
LockUC
and AsyncLockUC
- Monitor (C# lock),
LockUC
and AsyncLockUC
, SpinLockUC
, TicketSpinLockUC
The Unified Concurrency framework is implemented in the open-source GreenSuperGreen library available on the GitHub and the Nuget (3 packages, lib, unit tests, benchmark).
The throughput period of Monitor
, C# lock, behaves very well, monotonically in a similar way as sequential throughput period without surprises. But, the amount of spent CPU resources is troubling. The shorter the throughput period gets, then more CPU resources are wasted in heavy contention scenarios! That is actually significant as it is contrary to common sense! In general, we expect that short time throughput periods should result in lower CPU load, an expectation driven by thread suspend/resume technique, yet the Monitor
, C# lock is driven in these cases by its hybrid nature. If the contention is high enough or there is a possibility that peak load will create high contention, then Monitor
, C# lock become a serious bottleneck. There are two possible cases:
- The high contention occurs normally, the code base is made for high throughput period and thus will be recognized and handled soon.
- The high contention occurs occasionally during high peaks, that are difficult to re-create with the complex code base, then the performance issue of the code base can be easily overlooked and the issue can stay hidden for a long time.
The explanation is very simple. As it was stated, the Monitor
, C# lock, is a hybrid lock. During an attempt to gain access, the thread is not suspended immediately when other thread already owns the access, but it is rather spin-waiting in an attempt to gain access sooner in the hope to limit thread suspend/resume costs and if that did not work, the thread is suspended. This hybrid approach was introduced during times of first multi-core boxes and it made throughput period faster then, but now with 24 cores or more, it comes with heavy costs and with each extra core, it gets worse.
Scenarios where we have just short work in Monitor, C# lock, but with many requests lose lots of CPU resources! Obviously, we cannot neglect legacy code-base and aging software architectures lagging behind hardware architectures anymore.
Based on the chart, throughput period below 64 ms on the given box should be avoided, because over 60% of the CPU resources will be dissipated into the heat! The number can get lower with a lower number of threads. Avoiding high levels of concurrency is always good checking point, whether we do not throw too much parallelism on the sequential problem. But the throughput period level at which this gets 60 % or more CPU resources wasted will go higher with more CPU cores, and that is the future trend where it goes currently.
Chart 1: Monitor (C# lock) - Heavy Contention Benchmark
Please remember, that the Bad Neighbor scenario is about the cumulative effects of multiple unrelated thread pairs each with the minimal contention of only two threads - a pair. As was mentioned in the report above in Heavy Contention scenario for Monitor, C# lock, with low contention the throughput period accessible through Monitor, C# lock, can be significantly lower without great CPU wastage, but even here we can see, that Monitor, C# lock, is playing role in CPU wastage, twice as expected for throughput periods below 16 ms, thus Monitor, C# lock, is really a Bad Neighbor, few unrelated services/threads with low contention can cumulatively dissipate over 50 % CPU resources into heat with only 48 % of useful work being done. This statement is based on ideal sequential trend lines in the chart.
Chart 2: Monitor (C# lock) - Bad Neighbor Benchmark
The LockUC is a new synchronization primitive based on TaskCompletionSource<object>
in usage very similar to Monitor
, C# lock, but avoids spin-waiting, the hybrid part of the Monitor
, C# lock. Access is fair. LockUC
was designed to outperform Monitor
, C# lock, avoiding its hybrid nature in the area where thread suspends/resume technique is still useful and for legacy code, that does not require super high throughput period in the wide range of throughput periods, but in the same time does not dissipate CPU resources into heat needlessly. A thread unable to gain lock access is suspended immediately! This should be encouraged scenario for all cases where we do not expect high contention or even do not care about it. But, what we should care about is an impact of overlooked synchronization primitive in edge cases which are difficult to create in other parts of the service. We should go for most easy tools with the lowest impact and capable to guarantee its behavior. The LockUC
is a good choice for classic threading without async
/await
option. LockUC
clearly outperforms the Monitor
, C# lock with fewer CPU resources wasted and even throughput period is better up to the 1.5ms throughput period with given box. Below 1.5 ms throughput period Monitor
, C# lock is gaining on LockUC
, but they both already burn almost all available CPU resources, effectively burning more CPU resources on synchronization than actually on doing some useful work! Suggesting that threading alone with thread suspend/resume approach won't work well beyond that line and other approaches must be considered.
Chart 3: LockUC - Heavy Contention Benchmark
Please remember, that the Bad Neighbor scenario is about the cumulative effects of multiple unrelated thread pairs each with minimal contention. The LockUC
is losing throughput period a little on Monitor
, C# lock, but for that, CPU performance is significantly improved. Where the throughput period would be an issue, other synchronization primitive should be used. Like SpinLockUC
.
Chart 4: LockUC - Bad Neighbor Benchmark
The SpinLockUC
is a thin wrapper around .NET SpinLock struct
and is designed to offer the best throughput period for cases, where accessing threads have low probability to meet each other in the protected section and avoiding fairness helps to improve throughput period. If the operation in the protected section is very short, it is hard, even with 24 threads, to cause the situation, where multiple threads would be spinning for a longer time to gain access. The more probable situation is that before new access request comes, the previous access request will be already processed most of the times. It is an ideal situation, but the chart above says that 24 threads on the given box reach critical point when throughput period is worse than 324 µs where some threads under high contention after that point are effectively spending their time with spinning/waiting to gain access even for seconds if there are more threads attempting to gain access in same time because the access is not fair. This situation is a great problem on NUMA boxes because for some threads accessing certain memory is easier than for others based on metric of NUMA nodes distance of particular memory vs. CPU core which is amplifying issue of lack of fairness. The effect of Cross NUMA node access costs can be relatively measured by the tool CoreInfo from Mark Russinovich, Microsoft: https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo. The design requirement to seek the highest throughput period for SpinLock came with this price. Please understand that these limitations are not some errors left in the SpinLock implementation! These issues are intrinsic properties of atomic instructions based synchronization primitives where access fairness is not guaranteed! The future seems to be walking in the direction of more CPU cores and more NUMA nodes in the same box. That is a promise, that the future boxes will be pushing the critical point of throughput period way below 324 µs. We should be prepared and count with this because these issues are the reason why Red Hat Enterprise Linux already implemented a TicketSpinLock
. It has little worse throughput period, but for that, the TicketSpinLock
is capable to guarantee fair access which is important for better load balancing and is pushing the system closer to real-time operating systems qualities.
Unfair access can play a significant role in the unexplained momentary degradation of throughput period and we should keep in mind how to avoid it! We must make execution in the SpinLock
entry very fast, ideally, only memory update instructions, avoiding slow computations and strictly avoiding thread suspension from I/O or kernel calls and if possible also avoiding memory allocation.
The unfairness of the SpinLockUC
is clearly visible on Chart 5 because it is causing peaks on the throughput period statistics because different threads are playing starkly differently. Some threads are getting lots of entries and few no entries at all for ten seconds on throughput periods above 324 µs.
Chart 6 is showing only throughput period and specifically median and max throughput period where we can see the behavior of two distinct groups of threads, one that gets almost always entry, those are closer to the ideal throughput period and another closer to the max throughput period of 10 seconds for those that don't get so many entries or no entry at all! It has to be said, that attempting to execute long running code in the SpinLock
is a just bad idea and Chart 5 and 6 show how pointless it is.
Chart 5: SpinLockUC/AsyncSpinLockUC - Heavy Contention Benchmark
Chart 6: SpinLockUC/AsyncSpinLockUC - Heavy Contention Benchmark - Unfairness - Median vs. Max Throughput Period
The SpinlockUC
based on the .NET SpinLock struct
is playing pretty well in the case of the Bad Neighbor scenario because the contention is minimal yet still concurrent with a pair of threads. SpinLock
is designed for situations where contention will be infrequent and execution inside locked entry will be kept extremely short. The CPU wastage can be furthermore lowered by balancing the code in the way that the execution path will take longer time outside locked entry than inside locked entry.
Chart 7: LockUC - Bad Neighbor Benchmark
The TicketSpinLockUC
is a general TicketSpinLock
implementation that will contend the atomic operation until it gets access based on the ticket order of arrival, thus the access is fair. There is no spin-waiting or yielding or sleeping like for the .NET Spinlock
and for that, it is paying the heavy cost of atomic operation contention and takes a lot of CPU resources and because of the fairness, it is also losing the throughput period below 200 µs on a given box, but it is not important! It is meant for extremely short execution paths in the locked entry, ideally just a few instructions or updating some variables with certainty that fair access, the FIFO order, will be ensured. This is a synchronization primitive useful for load balancing algorithms with ensured thread starvation protection.
Chart 8: SpinLockUC - Heavy Contention Benchmark
The TicketSpinLockUC
implementation is showing improvement in the throughput period below 200 µs on given box as the number of threads is lowered for the Bad Neighbor scenario just to two threads per synchronization primitive. This is a great example, that throwing too much parallelism on a sequential problem is a bad idea and gives us the incentive to strike the right balance for the number of threads and ratio between execution time inside locked entry and outside locked entry where the latter should take longer time.
Chart 9: LockUC - Bad Neighbor Benchmark
The AsyncLockUC
is only real async
/await
synchronization primitive that is actually returning incomplete awaiter, timing and scheduling are then based on the ThreadPool
. Obviously, if the ThreadPool
threads are depleted, then the synchronization will be losing performance, but let's face it, that would also imply poorly designed software and suspend/resume synchronization techniques would lose performance in this case as well due to many active threads competing for the CPU time slot.
The AsyncLockUC
seems to have interesting performance and specifically the CPU wastage is here very low. Access is fair. Locking inside internal data structures is lock-free based and as fast and as short as possible, to avoid any contention. This is only synchronization primitive in this report, that does not cause CPU gridlock => the moment when CPU is burning cycles, but the work being done is actually very small because the cost of synchronization is taking almost all the CPU resources.
The AsyncLockUC
is demonstrating that it can handle any workload including peak Heavy Contention, it does not dissipate CPU resources needlessly and other services on the box have still available resources to carry on with their duties. The CPU usage is here below 42 % max!
The AsyncLockUC is as close to Ideal Synchronization Primitive as it can get because for throughput periods longer than 200 µs its behavior is fast approaching Ideal Synchronization's Primitive performance!
Chart 10: AsyncLockUC - Heavy Contention Benchmark
The AsyncLockUC
seems to behave reasonably in the Bad Neighbor scenario as well.
Chart 11: AsyncLockUC - Bad Neighbor Benchmark
Finally, we get to the first Cross-Benchmark! Monitor, C# lock, has hybrid nature so it is fitting to cross-examine its behavior with SpinLockUC
and TicketSpinLockUC
.
Monitor, C# lock is a thread suspend/resume synchronization primitive enhanced with the hybrid heuristical avoiding of the thread suspension with spin-waiting, but it is not free of charge! We can see, that below 1 ms throughput period SpinLockUC
and Monitor
, C# lock behaves very much the same with the CPU and the throughput period. TicketSpinLockUC
with fair access is losing on them with the throughput period.
I believe that the chart is showing nicely relations and especially where to use what:
- Obviously,
SpinLockUC
should be avoided for any throughput period above 324 µs, because thread starving is highly probable. TicketSpinLockUC
should be used only in case we need ensured fairness, like load balancing algorithms and only for very short operations in the locked area. I would suggest below 100 µs throughput period because otherwise, the CPU waste is not reasonable. - This leaves open question what would be a replacement for
Monitor
, C# lock between 324 µs and 32 ms throughput period?
Chart 12: Monitor (C# lock) & SpinLockUC & TicketSpinLockUC- Heavy Contention Cross-Benchmark
The Cross-Benchmark of the Monitor
, C# lock for the Bad Neighbor scenario is clearly showing where the SpinLockUC
will find its best usage. In algorithms where the throughput period is short and contention is minimal. The SpinLockUC
will be capable to save considerable CPU resources because the hybrid nature of the Monitor
, C# lock in these cases is clearly wasting a lot of CPU resources.
Chart 13: Monitor (C# lock) & SpinLockUC & TicketSpinLockUC- Bad Neighbor Cross-Benchmark
Cross-Benchmark of the Monitor, C# lock and the reasonable replacement, the LockUC
in the Heavy Contention scenario is answering what synchronization primitive to use between 324 µs and 32 ms throughputs periods. The LockUC
clearly serves pretty well as a replacement because the CPU wastage is lower and it is getting worse way later than with Monitor, C# lock.
Chart 14: Monitor (C# lock) & LockUC- Heavy Contention Cross-Benchmark
The Bad Neighbor scenario for Cross-Benchmark of Monitor, C# lock and LockUC with minimal contention is showing performance issues of the Monitor, C# lock because even for extremely short execution paths where LockUC's CPU waste is dropping, the Monitor, C# lock is still burning the CPU!
Chart 15: Monitor (C# lock) & LockUC - Bad Neighbor Cross-Benchmark
Finally, we can examine the Cross-Benchmark of Monitor(C# lock), LockUC and objectively best synchronization primitive the AsyncLockUC
in the Heavy Contention scenario.
If we have to stay in general threading environment, it is best to replace Monitor (C# lock) with LockUC, because it is wasting fewer CPU resources and actually even doing more work in same time in some configurations.
In case we can rewrite the code for Async
/Await
environment, we can choose the best option there is under GreenSuperGreen/Unified Concurrency and that is the AsyncLockUC
which is as close to the Ideal Synchronization Primitive as it can get plus avoiding the CPU gridlock of other synchronization primitives!
There is some case that was carried out by the Monitor
, C# lock where the low throughput period is necessary, disregarding the CPU wastage. I hope that I have sufficiently described with benchmarks, that the CPU wastage is insanely high to use the Monitor
, C# lock in this case. Such throughput periods are achievable by the SpinLockUC
/AsyncSpinLockUC
, TicketSpinLockUC
/AsyncTicketSpinLockUC
and all the benchmarking suggests that these parts of the code require different thinking and design, otherwise they become a bottleneck.
Chart 16: Monitor (C# lock) & LockUC & AsyncLockUC- Heavy Contention Cross-Benchmark
The Bad Neighbor scenario should not be any surprise. The suggestions are the very same as for Heavy Contention scenario.
If we have to stay in general threading environment, it is best to replace Monitor (C# lock) with LockUC
, because it is wasting fewer CPU resources and actually even doing more work in the same time in some configurations.
In case we can rewrite the code for Async
/Await
environment, we can choose the best option there is under GreenSuperGreen/Unified Concurrency and that is the AsyncLockUC
which is as close to the Ideal Synchronization Primitive as it can get plus avoiding the CPU gridlock
of other synchronization primitives!
Chart 17: Monitor (C# lock) & LockUC & AsyncLockUC- Bad Neighbor Cross-Benchmark
Finally, we can examine the Cross-Benchmark of Monitor (C# lock), LockUC
and objectively best synchronization primitive the AsyncLockUC
in the Heavy Contention scenario.
If we have to stay in general threading environment, it is best to replace Monitor (C# lock) with LockUC, because it is wasting fewer CPU resources and actually even doing more work in same time in some configurations.
In case we can rewrite the code for Async
/Await
environment, we can choose the best option there is under GreenSuperGreen/Unified Concurrency and that is the AsyncLockUC
which is as close to the Ideal Synchronization Primitive as it can get plus avoiding the CPU gridlock
of other synchronization primitives!
There is some case that was carried out by the Monitor
, C# lock where the low throughput period is necessary, disregarding the CPU wastage. I hope that I have sufficiently described with benchmarks, that the CPU wastage is insanely high to use the Monitor
, C# lock in this case. Such throughput periods are achievable by the SpinLockUC
/AsyncSpinLockUC
, TicketSpinLockUC
/AsyncTicketSpinLockUC
and all the benchmarking suggests that these parts of the code require different thinking and design, otherwise they become a bottleneck.
Chart 18: Monitor (C# lock) & LockUC & AsyncLockUC & SpinLockUC & TicketSpinLockUC - Heavy Contention Cross-Benchmark
The Bad Neighbor scenario should not be any surprise. The suggestions are the very same as for Heavy Contention scenario.
If we have to stay in general threading environment, it is best to replace Monitor (C# lock) with LockUC
, because it is wasting fewer CPU resources and actually even doing more work in the same time in some configurations.
In case we can rewrite the code for Async
/Await
environment, we can choose the best option there is under GreenSuperGreen/Unified Concurrency and that is the AsyncLockUC
which is as close to the Ideal Synchronization Primitive as it can get plus avoiding the CPU gridlock of other synchronization primitives!
Chart 18: Monitor (C# lock) & LockUC & AsyncLockUC & SpinLockUC & TicketSpinLockUC - Bad Neighbor Cross-Benchmark
Summary
This article covered Benchmarks and Cross-Benchmarks for implemented synchronization primitives under GreenSuperGreen
library of UnifiedConcurrency
framework for two distinct scenarios, the Heavy Contention, and the Bad Neighbor.
We discussed advantages, disadvantages, costs, weaknesses, strengths and behavior of synchronizations primitives individually and comparatively between each other based on the scenarios and circumstances.
Revision History
- 28th June, 2018: Complete Cross-benchmark: Monitor (C# lock),
LockUC
and AsyncLockUC
, SpinLockUC
, TicketSpinLockUC