Introduction
This first article Back to Basics discusses concepts related to multi-threading to lay the groundwork for subsequent articles. As with any of my articles if you see an error or believe I've skipped something important please leave a comment!
Since this article will be a sort of crash course on low-level details to build up to subsequent articles, there is a lot of non-essential information I'll be skipping over. I will provide links for further reading where appropriate.
Table of Contents
Note: ToC will be updated as more articles in this series come out.
CPU, Operating System, and the CLR
Let's look at the Core i5 3570K and the Core i7 7700T. What we're mainly interested in are what the spreadsheet calls "Threads" which represent the number of logical CPU cores. "Cores" on the spreadsheet is the number of physical CPU cores. A core is a processing unit which executes instructions. For more information about why the number of hardware threads is higher than the number of cores on the i7 check out hyper-threading.
In operating systems the kernel is the central unit that controls core functions such as management of threads, processes, memory, I/O, and scheduling - in fact it controls everything that occurs in the OS. When your executable file is run the OS creates a process kernel object for your application. A process is an isolated data container of sorts. It holds all the necessary information for an application including virtual address space and threads. Each process contains at least one thread kernel object. These kernel threads are what get scheduled onto CPU cores for processing. Each scheduled kernel thread gets a time-slice during which it executes. This execution can be interrupted by the kernel if
- The thread's time-slice expires.
- A higher-priority thread becomes schedulable.
- The thread blocks on some task like I/O and yields its time-slice.
When this interruption occurs the kernel saves the thread's state then schedules the next ready thread during what is called a context-switch. If the thread being switched to isn't from the same process a switch to the new process's address space is also required.
After the process is created the OS loads the CLR into the process's address space since this is a .NET application as indicated in the executable file's header. The CLR sets up the .NET environment for your application by loading domain-neutral DLLs such as MSCorLib.dll
into the process, creating a default AppDomain
for the application, and loading assemblies and required DLLs into the AppDomain
. Once all this is setup the application is ready to execute.
In a nutshell, an application runs inside an AppDomain
which is managed by the CLR which runs inside a process which is managed by the OS/kernel.
Thread and AppDomain
When inside the managed environment all threads are represented by a Thread
object. These are called user or managed threads. Managed threads can map 1:1, N:1, or N:M with kernel threads depending on the CLR and external factors.
An operating-system ThreadId has no fixed relationship to a managed thread, because an unmanaged host can control the relationship between managed and unmanaged threads.
Good to know but knowing the default might be useful so let's use the Windows API and Thread
information to do some comparisons by looking at the main ways you'll access threading in C# - Thread
, ThreadPool
, and the Task Parallel Library.
[StructLayout(LayoutKind.Sequential, Pack = 2)]
public struct SystemTime
{
public ushort Year;
public ushort Month;
public ushort DayOfWeek;
public ushort Day;
public ushort Hour;
public ushort Minute;
public ushort Second;
public ushort Milliseconds;
public string ToHMSM() =>
$"{Hour}:{Minute}:{Second}:{Milliseconds,3}";
}
class Program
{
[DllImport("Kernel32.dll", CallingConvention = CallingConvention.StdCall)]
public static extern uint GetCurrentThreadId();
[DllImport("Kernel32.dll", CallingConvention = CallingConvention.StdCall)]
public static extern void GetLocalTime(out SystemTime time);
static void Main(string[] args)
{
int threadNumber = 10;
Stopwatch sw = new Stopwatch();
Console.WriteLine("Starting Threads...");
Console.WriteLine("CLR ThreadID OS ThreadID Time");
Console.WriteLine("------------ ----------- ----");
CountdownEvent countdownEvent = new CountdownEvent(threadNumber);
sw.Start();
for (int i = 0; i < threadNumber; i++)
new Thread(() =>
{
SystemTime time;
GetLocalTime(out time);
Console.WriteLine($"{Thread.CurrentThread.ManagedThreadId, 12} {GetCurrentThreadId(), 11:X} {time.ToHMSM()}");
countdownEvent.Signal();
}).Start();
countdownEvent.Wait();
sw.Stop();
Console.WriteLine($"Total run time (ms): {sw.ElapsedMilliseconds}");
Console.WriteLine("\nStarting ThreadPool threads...");
Console.WriteLine("CLR ThreadID OS ThreadID Time");
Console.WriteLine("------------ ----------- ----");
countdownEvent.Reset();
sw.Reset();
sw.Start();
for (int i = 0; i < threadNumber; i++)
ThreadPool.QueueUserWorkItem(_ =>
{
SystemTime time;
GetLocalTime(out time);
Console.WriteLine($"{Thread.CurrentThread.ManagedThreadId,12} {GetCurrentThreadId(),11:X} {time.ToHMSM()}");
countdownEvent.Signal();
});
countdownEvent.Wait();
sw.Stop();
Console.WriteLine($"Total run time (ms): {sw.ElapsedMilliseconds}");
Console.WriteLine("\nStarting Tasks...");
Console.WriteLine("CLR ThreadID OS ThreadID Time");
Console.WriteLine("------------ ----------- ----");
Task[] tasks = new Task[threadNumber];
sw.Reset();
sw.Start();
for (int i = 0; i < threadNumber; i++)
tasks[i] = Task.Run(() =>
{
SystemTime time;
GetLocalTime(out time);
Console.WriteLine($"{Thread.CurrentThread.ManagedThreadId,12} {GetCurrentThreadId(),11:X} {time.ToHMSM()}");
});
Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine($"Total run time (ms): {sw.ElapsedMilliseconds}");
Console.WriteLine("\nStarting TPL tasks...");
Console.WriteLine("CLR ThreadID OS ThreadID Time");
Console.WriteLine("------------ ----------- ----");
sw.Reset();
sw.Start();
Parallel.For(0, threadNumber, _ =>
{
SystemTime time;
GetLocalTime(out time);
Console.WriteLine($"{Thread.CurrentThread.ManagedThreadId,12} {GetCurrentThreadId(),11:X} {time.ToHMSM()}");
});
sw.Stop();
Console.WriteLine($"Total run time (ms): {sw.ElapsedMilliseconds}");
Console.ReadKey();
}
}
It appears that by default there is a 1:1 relationship between a Thread
and a kernel thread. Don't give too much weight to the run times. I'll be doing more strict comparisons when I discuss each in future articles. Not all of these runs are strictly equivalent due to things like the CountdownEvent
. However, the general times do reinforce that managing short-running tasks on a small pool of threads is more efficient than creating a unique thread for each task.
So now that we understand the relationship between a kernel thread and a managed Thread
, what is the relationship between a process and an AppDomain
? The point of a process is to isolate application environments. Since all managed code passes through the CLR it can make certain guarantees about the code. This allows .NET to use a lightweight process, or AppDomain
, to isolate an application. Context-switching between AppDomain
s running inside a single process requires no kernel (a.k.a. system) calls - it is entirely managed by the CLR. Process creation and context-switching is time-consuming so using AppDomain
s avoids these costly operations. This is a win-win for both space and time overheads. AppDomain
s are also useful to allow unloading of individual assemblies since all domains containing the unloaded assembly must be unloaded as well. Creating a new AppDomain
for an assembly allows you to unload it in the future without unloading your main AppDomain
and the other assemblies it contains.
In summation, a Thread
is a lightweight, managed version of a kernel thread. An AppDomain
is a lightweight, managed version of a process. They are basically managed abstractions over their operating system counterparts with subtle differences such as kernel threads being owned by their process while a Thread
can traverse AppDomain
s.
When multiple threads exist a strategy for handling execution is necessary. A core can only execute a single thread at a time. While multiple cores exist on modern computers this is still insufficient for handling all the threads executing on a computer while remaining responsive. Concurrency is a strategy for dealing with this situation. Concurrency is when multiple threads are executing in the same time frame. Note I said time frame. When threads execute at the same time this is called parallelism which is a subset of concurrency.
The implementation of concurrency is commonly a technique called context-switching which was discussed in the article's first section. This is represented in the left image above. Each thread is given an amount of time to execute after which the state of the thread is saved and execution context is switched to another thread. This prevents a single thread from starving other threads of CPU time. Asynchrony refers to not being forced to execute synchronously - surprise! Something that is asynchronous may move on to other tasks during a blocking operation like waiting on data being sent then resume the blocking task later. This does not imply multi-threading. Implementations of asynchrony may be multi-threaded but the concept itself does not necessitate this. Asynchronous behavior is useful when dealing with external systems where delays can occur.
As an example, imagine going to fill out a form such as a driver's license form. You request the form then check your phone while you wait for the form. You (thread #1) are waiting for the form asynchronously by performing other tasks until the form becomes available. Having another person with you waiting wouldn't help at all. Now you finally get the form and are filling it out. If another person (thread #2) was helping, you would be filling out the form concurrently. Having another person with you filling it out would improve the time it takes. A contrived example for sure but I hope it gets the point across.
The goal is to share as little data as possible. The more data is local to a thread the less room for error exists. This is often not an option though so understanding how memory works and what you need to guard against is important. So what happens in this scenario?
static void Main(string[] args)
{
int a = 0, b = 1;
Random rnd = new Random();
Parallel.For(0, 10, index => Parallel.Invoke(
() =>
{
a = rnd.Next(100);
b = rnd.Next(100);
Console.WriteLine($"Set {a}, {b} on Thread {Thread.CurrentThread.ManagedThreadId},
Iteration {index}\n");
},
() =>
{
Console.WriteLine($"Read {a}, {b} on Thread {Thread.CurrentThread.ManagedThreadId},
Iteration {index}\n");
}));
Console.ReadKey();
}
Here's an execution example:
I have two tracepoints placed on each Console.WriteLine
. Some problems should be immediately noticeable. First, the reads and sets are out of order between the outputs even though both outputs are on the same statement. Second, the reads and sets are not reporting the same values. For example, look at the last three reads of each output. The console reports 33,91 twice while the tracepoint reports 34,96 twice. If looking closely over these results something even more disastrous is noticeable. Look at lines 8 and 9 of the console output (thread 11/iteration 3 and thread 10/iteration 6). Now look at the console sets and tracepoint output.
There is a third problem. Data can be lost depending on how the ordering and race conditions between reads and sets plays out. In fact, the console and tracepoint outputs have 7 distinct datasets with one dataset each that doesn't appear in the other (36,79 and 94,30) while 10 datasets should have been generated. We have lost 30% of the datasets per output (20% total) including both reads and sets. If considering only reads it's even worse. What is going wrong here?
The compiler and JIT-er are free to optimize and re-arrange code however they see fit as long as it does not affect the single-threaded execution behavior. In practice many optimizations don't frequently occur but being aware of them is still a good idea for debugging purposes.
Read Optimization
This can occur in multiple flavors. Reads can be eliminated entirely or reads can be multiplied. For the case where reads are eliminated consider these examples:
int a = 1;
int b = a;
int a = 1;
int b = 1;
int c = 1;
c = 2;
int c = 2;
For an example where reads are multiplied:
public void Test(Object obj)
{
Object obj2 = obj;
if (obj2)
Console.Write(obj2);
}
public void Test(Object obj)
{
if (obj)
Console.Write(obj);
}
Instead of being read twice obj
is read three times after optimization. Now the code could write a null
if obj
is modified after the conditional whereas in the pre-optimization example this would not occur due to the local assignment to obj2
.
Loop-Invariant Code Extraction
This occurs when code inside a loop does not depend on the loop header or body.
int x = 10;
int count = 0;
for (int i = 0; i < x; i++)
{
count++;
}
int x = 10;
int count = 0;
for (int i = 0; i < x; i++)
{
}
count = x;
In fact, the compiler or jitter may even optimize it further by simply doing int count = x
or int count = 10
.
Other optimizations include loop-unrolling (repeating loop code in the loop body to avoid loop condition checks) and function inlining (copy-pasting function code directly where it's called). I'm sure there are others I either forgot or simply don't know about. I'm not even mentioning optimizations the jitter can do at run-time like remove entire statements such as an if
statement that is always true in that application instance. However in practice the most common optimization will simply be statement re-ordering. The classic example is:
int x = 0;
bool isCompleted = false;
public void Func1()
{
x = 10;
isCompleted = true;
}
public void Func2()
{
if (isCompleted)
Console.WriteLine(x);
}
If Func1
and Func2
are executed on separate threads is it possible for Func2
to write 0 to the console? Yep, if isCompleted = true
is moved before x = 10
and isCompleted = true
is executed then Func2
is executed in full before x = 10
.
It is difficult to talk about this topic without writing a small book because many details vary wildly between different architectures like x86/x86-64, ARM, and Itanium. They all differ in instruction sets, memory models, and physical architecture. There are so many different things that may or may not be present like a store buffer, load buffer, write-combining buffer, invalidation queue, and memory-ordering buffer just to name a couple. Then there are all the various controllers, buses, and point-to-point interconnects that differ between architectures. Since this is a C# series, I will only be discussing common concepts that impact our applications. Specifics on each architecture can usually be found on the vendor's site if more information is desired - ex. Intel64 and IA-32 Developer Manuals.
Cache
Cache is extremely fast memory that exists on the CPU. The speed is due to many factors such as proximity to the core, type of RAM (SRAM), and not having to travel through the front-side bus or memory bus. In modern CPUs there are usually multiple caches. Some shared between CPU cores and others dedicated to a specific core. Cache memory is split into data blocks called cache lines. These cache lines hold the data along with other information such as the data's address in main memory (the memory slots in your motherboard). When data is requested from main memory by a core the cache is first checked to see if it contains that data. If it doesn't the cache fetches the data from main memory. From this description it is apparent that cores deal mainly with cache for less latency. If each core mainly deals with its own cache though how is data integrity managed when the caches inevitably need to write new data back to main memory? Multiple caches may be handling the same main memory address.
Cache-coherency Protocols
To handle this cache coherence problem cache-coherency protocols were introduced. The basic idea is that each cache line will also have a state associated with it. With the MESI protocol, for example, each cache line can be in either the modified, exclusive, shared, or invalid state. Depending on the current state and what operation is performed on a cache line, various signals can be sent over a bus so the caches can ensure coherency between them. This will ensure that if some data in main memory is loaded into one cache then another cache attempts to load the same data the first cache will service the load request with both caches aware they do not hold the only copy (shared state). If one of these caches then tries to write to this data it knows due to the shared state to inform other caches to invalidate their cache line containing this data. This communication ensures when the data is stored back to main memory the cache which performs this write is the only modified copy. When exactly this write to main memory occurs is called the cache's write policy.
The takeaway from this is that coherency between caches is something already handled by the hardware. It is not an issue the software developer usually has to deal with unless the architecture does not implement a cache-coherency protocol or there is some re-ordering introduced by the protocol in which case the re-ordering will most likely already be handled by handling the main problem developers face in a multi-threaded environment - out-of-order execution.
Out-of-order Execution
Executing every instruction in-order would waste valuable instruction cycles if a delay occurred like waiting for data from memory. This is why the CPU is allowed to re-order instruction execution. We don't actually care about this execution re-ordering directly though because the instruction results are queued and re-ordered according to the architecture's memory ordering. What we care about is this result/memory re-ordering. There are four different flavors of memory ordering:
- Load-Load: loads can be re-ordered with other loads.
- Load-Store: loads can be re-ordered with subsequent stores.
- Store-Load: stores can be re-ordered with subsequent loads.
- Store-Store: stores can be re-ordered with other stores.
Different architectures have different guarantees on what they allow to be re-ordered. To my knowledge no architecture allows any re-ordering which would affect single-threaded behavior - the exception to this might be Alpha when dealing with data dependency. Leave a comment if you know! This means operations on the same address (data) won't be re-ordered if it would change the result in a single-threaded environment. The problem is when multiple threads are involved. For example on x86:
mov [x], 1
mov a, [y]
mov [y], 1
mov b, [x]
One could propose due to interleaving this example might give the results of [a=b=1]
, [a=0,b=1]
, or [a=1,b=0]
. However, there is actually a fourth option. x86 allows (and in fact only allows) Store-Load memory re-ordering. This means both loads could be moved prior to the stores resulting in [a=b=0]
. Since both individual instructions per core are on separate addresses this is allowed. The following, however, returns what you would expect:
mov [x], 1
mov a, [x]
mov [y], 1
mov b, [y]
The only result from this is [a=b=1]
. Notice that each core is operating on a single address now thus no re-ordering as this would affect single-threaded behavior.
Side note: Actually some re-ordering in this example is still possible because stores can be delayed in a buffer. What prevents the application from viewing this subtle re-ordering is known as store-to-load forwarding.
While a store is temporarily held in a processor's store buffer, it can satisfy the processor's own loads but is not visible to (and can not satisfy) loads by other processors.
Memory barriers (a.k.a. fences) allow developers to specifiy where compiler and/or CPU memory re-ordering should not occur. This is where we are gonna step out from hardware into C#. The reason being that C# doesn't have native support for assembly so assembly barriers are largely irrelevant to C# developers unless you "hack" together something with a VC++ DLL or C++/CLI. In addition CLR barriers prevent compiler and JIT-er re-ordering so why not use them? CLR barriers offer three distinct semantics:
- Acquire: a read/acquire barrier ensures no subsequent memory operations execute before preceding reads.
- Release: a write/release barrier ensures no preceding memory operations execute after subsequent writes.
- Memory/Full: a full barrier combines acquire and release semantics preventing any execution movement.
The "...before preceding reads" and "...after subsequent writes" distinction is very important. Some books and online sources will describe these as "...before/after the barrier." This behavior would be dangerously incorrect:
someObject.SomeValue = 3;
<write_barrier>
someObject.SomeValue = 5;
In this example nothing stops someObject.SomeValue = 5
from moving before someObject.SomeValue = 3
if the barrier only prevents re-ordering from before the barrier to after. This is why the true distinction is important. Since the barrier actually prevents re-ordering from before the barrier to after subsequent writes this re-arranging can not take place.
So since <write_barrier>
is obviously not real C# code, what barrier primitives does C# actually expose? There are five: Volatile.Read
, Volatile.Write
, Thread.VolatileRead
, Thread.VolatileWrite
, and Thread.MemoryBarrier
. MemoryBarrier
is a full barrier. It prevents compiler, JIT-er, and CPU re-ordering in both directions according to the "Full" rule above. Volatile.Read
prevents re-ordering according to the "Acquire" rule above while Volatile.Write
prevents re-ordering according to the "Release" rule.
Thread.VolatileRead
and Thread.VolatileWrite
are a little different. They were originally a mistake. They used a MemoryBarrier
to prevent re-ordering which yielded a stronger guarantee (thus more overhead) than was actually needed. This is why Volatile.Read
and Volatile.Write
were implemented to provide proper implementation for volatile semantics. In multi-threading, your intention needs to be cleanly expressed in code else you run the risk of difficult debugging sessions and/or unmaintainable code so my advice is to not use Thread.VolatileRead
/Write
if you have access to Volatile.Read
/Write
included in .NET 2.0.
You'll notice I didn't mention the volatile
keyword. I recommend never using it. Not only does it obscure behavior into the type definition but the exact semantics are in a state of flux. For example, some older documentation will say that volatile
translates into Thread.VolatileRead
/Write
calls while it actually translates into Volatile.Read
/Write
now [^]. If in the future they change this again your code could break overnight for no easily discernible reason.
If working solely on a specific operating system, synchronization primitives exposed by that operating system's API such as Windows API Synchronization Functions are also available.
Additional Notes
I avoided talking about this earlier but the memory barriers provided by the architecture often have different guarantees than the ones provided by your language or framework. Generally the language barriers are composites of the architecture barriers. For example, these are the guarantees for memory barriers on x86:
- Loads cannot pass an earlier lfence or mfence.
- Stores cannot pass an earlier lfence, sfence, or mfence.
- An lfence cannot pass earlier loads.
- An sfence cannot pass earlier stores.
- An mfence cannot pass earlier loads or stores.
Also in some languages such as C++11 there can be a distinction between an acquire/release operation and acquire/release barrier. An operation guarantees acquire/release semantics in relation to a single operation while a barrier guarantees semantics to all relevant reads/writes.
Bonus
I've seen a question elsewhere regarding the exact implementation of Thread.VolatileRead
/Write
where part of the question was ignored by every response so I figured it would be an interesting bonus section to answer! Let's look at Thread.VolatileRead(int)
:
[MethodImpl(MethodImplOptions.NoInlining)]
public static int VolatileRead(ref int address)
{
int num = address;
MemoryBarrier();
return num;
}
The ignored part of the question revolved around why [MethodImpl(MethodImplOptions.NoInlining)]
is used. The answer is that function calls provide their own ordering guarantees. The compiler can't make assumptions about the side effects of a function call and therefore can't re-order around the call. This makes complete sense if you consider the function call may exist in an external library so its effect on memory would be unknown. If the function is inlined, however, now the compiler knows exactly what operations will be performed and optimizations can occur. Since this function is designed to be used in any scenario, the NoInlining
option is a safeguard against any possible inlining and subsequent re-ordering that may break the intention of the function. In fact, the entire ref
and num = address
statements might be optimized out by the compiler when inlined.
int x = 5;
ref int address = ref x;
int num = address;
MemoryBarrier();
Removing the two statements above the MemoryBarrier
wouldn't affect single-threaded results since x = 5
satisfies num = 5
(the result of num = address
).
You may also be wondering why ref
is used in VolatileRead
. I suspect the reason is to provide the latest possible value until the num = address
copy is made. The copy itself is necessary to prevent changes to the value after the barrier but before the return. This could occur if another interleaving thread had the integer's reference.
Final Comments
If interested in more technical aspects of processors and caches I recommend checking out the links in this article and researching related topics like write-through/write-back caches and cache associativity. Very interesting stuff but a little too technical for the purposes of this article. I also highly recommend CLR via C# by Jeffrey Richter if interested more about the CLR. I may be biased here but there are some really interesting topics in that book.
Next article we'll be going into higher-level synchronization objects such as lock
, Semaphore
, and Mutex
which are easier to use and reason about along with an important concept known as atomicity. Thread
and ThreadPool
might also make their debut!
History
- 1/24/17: Initial release.
- 1/24/17: Updates to some wording and other minor changes.