C++ Multithreaded Count Benchmark

Shao Voon Wong

5.00/5 (4 votes)

12 May 2024CPOL4 min read

6.4K

C++ Multithreaded Count Benchmark using Visual C++, G++ and Clang

C++ Multithreaded Count Benchmark on Windows and Ubuntu using Visual C++, G++ and Clang

Download MultithreadedCount2.zip - 5.4 KB

Introduction
Mutex
Atomic Count
Windows InterlockedIncrement
No Lock
Single Threaded
One Lock Per Core
Benchmark Results on Ubuntu 24.04 on WSL
History

The example code is hosted on Github.

Introduction

The benchmark code for this article is from VC++: 30 Multithreading Mistakes on Windows article but they are scattered all over in different sections. This article is to shower the attention they deserve in one single article. C++17 parallel std::for_each is used to invoke the threads on all logical cores, unless otherwise stated. The counting benchmark is done on Intel i7 870 with 12 logical cores on Windows 11. The Ubuntu 24.04 benchmark result is shown in the last section.

Mutex

The first example involves using C++11 mutex to synchronize the count from multiple threads.

C++

void inc_mutex()
{
    timer watch;
    watch.start("inc mutex");

    int count = 0;
    std::mutex mut;

    std::for_each(std::execution::par, Vec.begin(), Vec.end(),
        [&mut, &count](size_t num) -> void
        {
            if (is_valid((int)(num)))
            {
                std::lock_guard<std::mutex> guard(mut);
                ++count;
            }
        });

    watch.stop();

    std::cout << "total count:" << count << std::endl;
}

Vec is initialized randomly up to VEC_SIZE of 100000000. rand() is to inject indeterminacy to stop the compiler from able to calculate the result at compile time and optimizing away the code.

C++

const size_t VEC_SIZE = 100000000U;

void init_vector()
{
    srand(time(NULL));
    for (size_t i = 0; i < VEC_SIZE; ++i)
        Vec.push_back(rand() % 20);
}

bool is_valid(int n)
{
    return (n % 2 == 0);
}

The mutex benchmark result is shown below.

Windows Benchmark Results
=========================
    inc mutex: 3136ms

Atomic Count

Since we are only dealing with a integer count variable, we can forgo mutex and use a C++11 atomic integer for count to improve performance.

C++

void inc_atomic()
{
    timer watch;
    watch.start("inc atomic");

    std::atomic<int> count = 0;

    std::for_each(std::execution::par, Vec.begin(), Vec.end(),
        [&count](size_t num) -> void
        {
            if (is_valid((int)(num)))
                ++count;
        });

    watch.stop();

    std::cout << "total count:" << count << std::endl;
}

The benchmark result for mutex and atomic integer is shown below. We have over 66% improvement.

Windows Benchmark Results
=========================
    inc mutex: 3136ms
   inc atomic: 1007ms

Windows InterlockedIncrement

If you are using a pre-C++11 compiler on Windows, you can use Windows' InterlockedIncrement to achieve the same perf gain.

C++

#ifdef _WIN32
void inc_winInterlocked()
{
    timer watch;
    watch.start("inc win Interlocked");

    LONG count = 0;

    std::for_each(std::execution::par, Vec.begin(), Vec.end(),
        [&count](size_t num) -> void
        {
            if (is_valid((int)(num)))
                ::InterlockedIncrement(&count);
        });

    watch.stop();

    std::cout << "total count:" << count << std::endl;
}
#endif

The benchmark result for mutex, atomic integer and InterlockedIncrement is shown below.

      Windows Benchmark Results
      =========================
          inc mutex: 3136ms
         inc atomic: 1007ms
inc win Interlocked: 1005ms

No Lock

For this lockless benchmark, we are going to use a loop::parallel_for written by me instead of using C++17 parallel std::for_each. The function declaration of loop::parallel_for is below. When numOfThreads is zero, no thread is spawned. When numOfThreads is one, all work is done in the current thread. When numOfThreads is -1, the number of threads is equal to number of logical cores. numOfThreads can be set to any other number. Rule of thumb is to set less than logical cores.

C++

template<typename Func>
void parallel_for(int numOfThreads, size_t beginIndex, size_t endIndex, Func f)

The signature of Func is below. threadID and index are what enable us to do away with synchronizing. Value of threadID is from 0 to numOfThreads - 1. Value of index is from beginIndex to endIndex - 1.

C++

void Func(int threadID, size index);

We create a CountStruct to store the count for each thread. buf member is to prevent false sharing. After all the threads end, total count is accumulated from all the CountStruct.

C++

struct CountStruct
{
    CountStruct() : Count(0)
    {
        memset((void*)buf, 0, sizeof(buf));
    }
    int Count;
    char buf[128]; // to prevent false sharing
};

void inc_no_lock()
{
    int threads = std::thread::hardware_concurrency();
    std::vector<CountStruct> count(threads);

    timer watch;
    watch.start("inc no lock");

    loop::parallel_for(threads, (size_t)(0), (size_t)(VEC_SIZE),
        [&count](int threadIndex, int index) -> void
        {
            if (is_valid(Vec[index]))
                ++(count[threadIndex].Count);
        });

    int total = 0;
    for (auto& st : count)
        total += st.Count;

    watch.stop();

    std::cout << "total count:" << total << std::endl;
}

The benchmark for lockless is below.

      Windows Benchmark Results
      =========================
          inc mutex: 3136ms
         inc atomic: 1007ms
inc win Interlocked: 1005ms
        inc no lock:   73ms

Single Threaded

In this section, we use a vanilla for loop to benchmark single threaded performance. Note: this for loop does not has the thread start overhead of the previous code benchmark.

C++

void inc_single_thread()
{
    timer watch;
    watch.start("inc single_thread");

    int count = 0;

    for (size_t i = 0; i < Vec.size(); ++i)
    {
        if (is_valid(Vec[i]))
            ++count;
    }

    watch.stop();

    std::cout << "total count:" << count << std::endl;
}

The benchmark result for single threaded is shown.

      Windows Benchmark Results
      =========================
          inc mutex: 3136ms
         inc atomic: 1007ms
inc win Interlocked: 1005ms
        inc no lock:   73ms
  inc single_thread:   86ms

One Lock Per Core

In this section, we use C++17 parallel std::for_each but we forgo the convenience of parallel std::for_each by spawning threads according to number of logical cores and inside each thread, we calculate the division of work and do the looping manually to store the count in the local temp_count and at the end of each thread, we acquires a mutex before we accumulate the temp_count into count. note: The number of mutex acquired is exactly the number of logical cores.

C++

void inc_less_mutex_lock()
{
    timer watch;
    watch.start("inc less mutex lock");

    const size_t threads = std::thread::hardware_concurrency();
    std::vector<size_t> vecIndex;
    for (size_t i = 0; i < threads; ++i)
        vecIndex.push_back(i);

    int count = 0;
    std::mutex mut;

    std::for_each(std::execution::par, vecIndex.begin(), vecIndex.end(),
        [&mut, &count, threads](size_t index) -> void
        {
            size_t thunk = Vec.size() / threads;
            size_t start = thunk * index;
            size_t end = start + thunk;
            if (index == (threads - 1))
            {
                size_t remainder = Vec.size() % threads;
                end += remainder;
            }
            int temp_count = 0;
            for (int i = start; i < end; ++i)
            {
                if (is_valid((int)(Vec[i])))
                {
                    ++temp_count;
                }
            }
            {
                std::lock_guard<std::mutex> guard(mut);
                count += temp_count;
            }
        });

    watch.stop();

    std::cout << "total count:" << count << std::endl;
}

Very surprisingly, acquiring mutex per logical cores, (in my Intel i7 870 case, 12 cores) achieved the best result on Windows.

      Windows Benchmark Results
      =========================
          inc mutex: 3136ms
         inc atomic: 1007ms
inc win Interlocked: 1005ms
        inc no lock:   73ms
  inc single_thread:   86ms
inc less mutex lock:   33ms

Benchmark Results on Ubuntu 24.04 on WSL

This section lists the benchmark results of Windows 11 WSL Ubuntu 24.04's G++ 13.2 and Clang 18.1.3. CPU is the same Intel i7 870 12 with logical cores. Both compilers have similar results. Taking no lock, taking mutex per core and single threaded have yielded the similar performance. Do note for the normal workload, say counting number of prime numbers, the time taken for the work dwarfs the time of writing count. The advantage of the other two over single threaded, despite similar count perf charateristic, is we can divide the workload in threads.

    Ubuntu G++ Benchmark Results
    ============================
          inc mutex:  719ms
         inc atomic:  473ms
        inc no lock:   51ms
  inc single_thread:   48ms
inc less mutex lock:   54ms

    Ubuntu Clang Benchmark Results
    ==============================
          inc mutex:  745ms
         inc atomic:  463ms
        inc no lock:   54ms
  inc single_thread:   52ms
inc less mutex lock:   58ms

Below are the commands to compile the code under G++ and Clang. Remember to copy parallel_for_each.h and timer.h to your Ubuntu folder. The Windows InterlockedIncrement benchmark is excluded when compiled on non-Windows platform.

g++ MultithreadedCount.cpp -O3 -std=c++17
clang++ MultithreadedCount.cpp -O3 -std=c++17

History

13^th May, 2024: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

C++ Multithreaded Count Benchmark

Table Of Contents

License