We will jump straight to the code. This innocent looking little program has a major issue (when compiled for release build with optimizations on my Mac using GCC, Apple’s CLANG, and LLVM, as well as on Windows using Visual Studio 2017, and ran on a multicore machine). Can you spot the problem?
#include <iostream>
#include <thread>
using namespace std;
int main(int argc, char** argv)
{
bool flag = false;
thread t1([&]() {
this_thread::sleep_for(100ms);
cout << "t1 started" << endl;
flag = true;
cout << "t1 signals and exits" << endl;
});
thread t2([&]() {
cout << "t2 started" << endl;
while(flag == false) ;
cout << "t2 got signaled and exits" << endl;
});
t1.join();
t2.join();
return 1;
}
That’s right! It will never terminate! It will hang forever! The while
loop in line 18 will never break. But why? Thread t1
sets flag
to true
after all. Yes, but it does so too late (notice the 100ms sleep). At that point, thread t2
has already L1 cached flag
and will never see its updated value. If you think that making flag
volatile
will help you’re wrong. It may work on your compiler/machine but it is no guarantee. Now what?
This was one of the hardest lessons in C++ and computer science for me. Before continuing to the fix section, I highly recommend you read about the following: memory barriers, C++ memory model as well as C++ Memory Model at Modernest C++, and memory ordering. I’ll see you in a couple of days.
The Fix
The simplest fix is to wrap access to flag around a mutex
lock
/unlock
or make flag
an atomic<bool>
(both of those solutions will insert appropriate memory barriers). But that’s not always an option for other data types…
We need to make sure that t2
can see the actions of t1
that happened later in time. For this, we need to force cache synchronization between different CPU cores. We can do it in three ways:
- By inserting memory barriers in the right places
- By inserting loads and stores of an atomic variable using release/acquire semantics
- By inserting loads and stores of a dependent atomic variable using release/consume semantics
Below is the corrected version of our example; uncomment each #define
to engage different fixes:
#include <iostream>
#include <atomic>
#include <thread>
using namespace std;
#if defined ATOMIC_FENCE
#define FENCE_ACQUIRE atomic_thread_fence(memory_order_acquire)
#define FENCE_RELEASE atomic_thread_fence(memory_order_release)
#elif defined ATOMIC_RELEASE
atomic_bool f{false};
#define FENCE_ACQUIRE f.load(memory_order_acquire)
#define FENCE_RELEASE f.store(true, memory_order_release)
#elif defined ATOMIC_CONSUME
atomic_bool f{false};
#define FENCE_ACQUIRE f.load(memory_order_consume)
#define FENCE_RELEASE f.store(flag, memory_order_release)
#else
#define FENCE_ACQUIRE
#define FENCE_RELEASE
#endif
int main(int argc, char** argv)
{
bool flag = false;
thread t1([&]() {
this_thread::sleep_for(100ms);
cout << "t1 started" << endl;
flag = true;
FENCE_RELEASE;
cout << "t1 signals and exits" << endl;
});
thread t2([&]() {
cout << "t2 started" << endl;
while(flag == false) FENCE_ACQUIRE;
cout << "t2 got signaled and exits" << endl;
});
t1.join();
t2.join();
return 1;
}