C++ Multithreaded Count Benchmark on Windows and Ubuntu using Visual C++, G++ and Clang
Table Of Contents
The example code is hosted on Github.
The benchmark code for this article is from VC++: 30 Multithreading Mistakes on Windows article but they are scattered all over in different sections. This article is to shower the attention they deserve in one single article. C++17 parallel std::for_each
is used to invoke the threads on all logical cores, unless otherwise stated. The counting benchmark is done on Intel i7 870 with 12 logical cores on Windows 11. The Ubuntu 24.04 benchmark result is shown in the last section.
The first example involves using C++11 mutex
to synchronize the count from multiple threads.
void inc_mutex()
{
timer watch;
watch.start("inc mutex");
int count = 0;
std::mutex mut;
std::for_each(std::execution::par, Vec.begin(), Vec.end(),
[&mut, &count](size_t num) -> void
{
if (is_valid((int)(num)))
{
std::lock_guard<std::mutex> guard(mut);
++count;
}
});
watch.stop();
std::cout << "total count:" << count << std::endl;
}
Vec
is initialized randomly up to VEC_SIZE
of 100000000
. rand()
is to inject indeterminacy to stop the compiler from able to calculate the result at compile time and optimizing away the code.
const size_t VEC_SIZE = 100000000U;
void init_vector()
{
srand(time(NULL));
for (size_t i = 0; i < VEC_SIZE; ++i)
Vec.push_back(rand() % 20);
}
bool is_valid(int n)
{
return (n % 2 == 0);
}
The mutex
benchmark result is shown below.
Windows Benchmark Results
=========================
inc mutex: 3136ms
Since we are only dealing with a integer count
variable, we can forgo mutex
and use a C++11 atomic
integer for count
to improve performance.
void inc_atomic()
{
timer watch;
watch.start("inc atomic");
std::atomic<int> count = 0;
std::for_each(std::execution::par, Vec.begin(), Vec.end(),
[&count](size_t num) -> void
{
if (is_valid((int)(num)))
++count;
});
watch.stop();
std::cout << "total count:" << count << std::endl;
}
The benchmark result for mutex
and atomic
integer is shown below. We have over 66% improvement.
Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
If you are using a pre-C++11 compiler on Windows, you can use Windows' InterlockedIncrement
to achieve the same perf gain.
#ifdef _WIN32
void inc_winInterlocked()
{
timer watch;
watch.start("inc win Interlocked");
LONG count = 0;
std::for_each(std::execution::par, Vec.begin(), Vec.end(),
[&count](size_t num) -> void
{
if (is_valid((int)(num)))
::InterlockedIncrement(&count);
});
watch.stop();
std::cout << "total count:" << count << std::endl;
}
#endif
The benchmark result for mutex
, atomic
integer and InterlockedIncrement
is shown below.
Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms
For this lockless benchmark, we are going to use a loop::parallel_for
written by me instead of using C++17 parallel std::for_each
. The function declaration of loop::parallel_for
is below. When numOfThreads
is zero, no thread is spawned. When numOfThreads
is one, all work is done in the current thread. When numOfThreads
is -1
, the number of threads is equal to number of logical cores. numOfThreads
can be set to any other number. Rule of thumb is to set less than logical cores.
template<typename Func>
void parallel_for(int numOfThreads, size_t beginIndex, size_t endIndex, Func f)
The signature of Func
is below. threadID
and index
are what enable us to do away with synchronizing. Value of threadID
is from 0
to numOfThreads
- 1
. Value of index
is from beginIndex
to endIndex - 1
.
void Func(int threadID, size index);
We create a CountStruct
to store the count
for each thread. buf
member is to prevent false sharing. After all the threads end, total
count is accumulated from all the CountStruct
.
struct CountStruct
{
CountStruct() : Count(0)
{
memset((void*)buf, 0, sizeof(buf));
}
int Count;
char buf[128]; };
void inc_no_lock()
{
int threads = std::thread::hardware_concurrency();
std::vector<CountStruct> count(threads);
timer watch;
watch.start("inc no lock");
loop::parallel_for(threads, (size_t)(0), (size_t)(VEC_SIZE),
[&count](int threadIndex, int index) -> void
{
if (is_valid(Vec[index]))
++(count[threadIndex].Count);
});
int total = 0;
for (auto& st : count)
total += st.Count;
watch.stop();
std::cout << "total count:" << total << std::endl;
}
The benchmark for lockless is below.
Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms
inc no lock: 73ms
In this section, we use a vanilla for
loop to benchmark single threaded performance. Note: this for
loop does not has the thread start overhead of the previous code benchmark.
void inc_single_thread()
{
timer watch;
watch.start("inc single_thread");
int count = 0;
for (size_t i = 0; i < Vec.size(); ++i)
{
if (is_valid(Vec[i]))
++count;
}
watch.stop();
std::cout << "total count:" << count << std::endl;
}
The benchmark result for single threaded is shown.
Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms
inc no lock: 73ms
inc single_thread: 86ms
In this section, we use C++17 parallel std::for_each
but we forgo the convenience of parallel std::for_each
by spawning threads according to number of logical cores and inside each thread, we calculate the division of work and do the looping manually to store the count in the local temp_count
and at the end of each thread, we acquires a mutex
before we accumulate the temp_count
into count
. note: The number of mutex
acquired is exactly the number of logical cores.
void inc_less_mutex_lock()
{
timer watch;
watch.start("inc less mutex lock");
const size_t threads = std::thread::hardware_concurrency();
std::vector<size_t> vecIndex;
for (size_t i = 0; i < threads; ++i)
vecIndex.push_back(i);
int count = 0;
std::mutex mut;
std::for_each(std::execution::par, vecIndex.begin(), vecIndex.end(),
[&mut, &count, threads](size_t index) -> void
{
size_t thunk = Vec.size() / threads;
size_t start = thunk * index;
size_t end = start + thunk;
if (index == (threads - 1))
{
size_t remainder = Vec.size() % threads;
end += remainder;
}
int temp_count = 0;
for (int i = start; i < end; ++i)
{
if (is_valid((int)(Vec[i])))
{
++temp_count;
}
}
{
std::lock_guard<std::mutex> guard(mut);
count += temp_count;
}
});
watch.stop();
std::cout << "total count:" << count << std::endl;
}
Very surprisingly, acquiring mutex
per logical cores, (in my Intel i7 870 case, 12 cores) achieved the best result on Windows.
Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms
inc no lock: 73ms
inc single_thread: 86ms
inc less mutex lock: 33ms
This section lists the benchmark results of Windows 11 WSL Ubuntu 24.04's G++ 13.2 and Clang 18.1.3. CPU is the same Intel i7 870 12 with logical cores. Both compilers have similar results. Taking no lock, taking mutex
per core and single threaded have yielded the similar performance. Do note for the normal workload, say counting number of prime numbers, the time taken for the work dwarfs the time of writing count
. The advantage of the other two over single threaded, despite similar count perf charateristic, is we can divide the workload in threads.
Ubuntu G++ Benchmark Results
============================
inc mutex: 719ms
inc atomic: 473ms
inc no lock: 51ms
inc single_thread: 48ms
inc less mutex lock: 54ms
Ubuntu Clang Benchmark Results
==============================
inc mutex: 745ms
inc atomic: 463ms
inc no lock: 54ms
inc single_thread: 52ms
inc less mutex lock: 58ms
Below are the commands to compile the code under G++ and Clang. Remember to copy parallel_for_each.h
and timer.h
to your Ubuntu folder. The Windows InterlockedIncrement
benchmark is excluded when compiled on non-Windows platform.
g++ MultithreadedCount.cpp -O3 -std=c++17
clang++ MultithreadedCount.cpp -O3 -std=c++17
- 13th May, 2024: Initial version