Two C++ Features Win Over C Equivalents

Shao Voon Wong

5.00/5 (16 votes)

18 Feb 2024MIT5 min read

20.1K

259

Two C++ features win over C equivalents in performance

The tip explores the superior performance of C++ over C in sorting and summing operations, showcasing benchmarks that reveal significant speed advantages of C++ over C.

This short write-up is a tip that showcases two tricks C++ developers use to gain performance over C. As such, it is not eligible to participate in the CodeProject monthly article competition.

Introduction
C++ sort VS C qsort
C++ accumulate VS C Summing For Loop
Reference
History

Introduction

When it comes to performance, C is king, but there are two areas in which C++ is supreme over C: sorting and summing. When the benchmark results were revealed, the C programmers were baffled and intrigued, and finally, they were won over to the C++ side. In this tip, we examine the reason they are faster and see if it is still the case today by benchmarking.

C++ sort VS C qsort

In C, qsort is a de facto way to do QuickSort. qsort requires a comparator function to be provided as a function pointer to compare between two elements.

C++

// This comparison function is used in qsort
// to sort in descending order
int comparator(const void* p, const void* q)
{
    // Get the values at given addresses 
    const int64_t l = *(const int64_t*)p;
    const int64_t r = *(const int64_t*)q;

    // sort in descending order
    if (l > r) return -1;
    else if (l < r) return +1;
    else return 0;
}

const int size = 1000000;
int64_t arr[size];
qsort((void*)arr, size, sizeof(arr[0]), comparator);

In C++, std::sort is used to do sorting. A function object must be provided as a comparator function until the advent of C++11 where a lambda is sufficient. Before C++17, std::sort is implemented using QuickSort. In C++17, IntroSort is used instead because though QuickSort is the fastest sorting algorithm, its Achilles heel is the worst time performance of O(N²) when it comes to sorting an almost sorted array.

std::sort on C++ Reference noted

Before LWG713, the complexity requirement allowed sort() to be implemented using only Quicksort, which may need O(N² ) comparisons in the worst case.

Introsort can handle all cases with O(N·log(N)) comparisons (without incurring additional overhead in the average case), and thus is usually used for implementing sort().

libc++ has not implemented the corrected time complexity requirement until LLVM 14

This is Wikipedia opening introduction to IntroSort:

Introsort or introspective sort is a hybrid sorting algorithm that provides both fast average performance and (asymptotically) optimal worst-case performance. It begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms, with practical performance comparable to quicksort on typical data sets and worst-case O(n log n) runtime due to the heap sort.

C++

std::vector<int64_t> vec(1000000);

// sort in descending order
std::sort(vec.begin(), vec.end(), [](int64_t a, int64_t b) {
    return a > b;
    });

As you can see from the above, the std::sort version is more concise. The reason std::sort is performant over qsort is due to its comparison function being a function pointer that requires dereferencing and inside it, two castings from void* and dereferencing are required to access the element value for comparison whilst std::sort comparison function is inlined and elements are already the correct type. C++17 parallel std::sort is added to the benchmark. Below are the benchmark results on the Intel i7 8700 (6 physical cores and 12 logical cores) with 16GB RAM. The GCC and Clang benchmarks are run on WSL Ubuntu 2022. It seems like parallel std::sort is not implemented on GCC and Clang as they shared same result as serial std::sort

VC++ 2022
             C qsort Benchmark:13103ms
     C++ stable sort Benchmark: 5702ms
            C++ sort Benchmark: 6300ms
        C++ par sort Benchmark: 1451ms

gcc version 11.3.0

             C qsort Benchmark: 8994ms
     C++ stable sort Benchmark: 6389ms
            C++ sort Benchmark: 5095ms
        C++ par sort Benchmark: 5114ms

clang version 14.0.0

             C qsort Benchmark: 8954ms
     C++ stable sort Benchmark: 6206ms
            C++ sort Benchmark: 5232ms
        C++ par sort Benchmark: 5227ms

The GCC and clang command to build SortBenchmark.cpp is as follows. Remember to copy timer.h to your Ubuntu folder. It makes no benchmark difference whether -std=c++17 or earlier standard is used. Looks like IntroSort is not an affecting factor.

g++ SortBenchmark.cpp -O3 -std=c++17

clang++ SortBenchmark.cpp -O3 -std=c++17

C++ accumulate VS C Summing For Loop

Below is an example of the summing for loop.

C++

const int size = 1000000;
uint64_t arr[size];

uint64_t sum = 0;
for (int i = 0; i < size; ++i)
{
    sum += arr[i];
}

arr[i] is equivalent to *(arr+i) in C which is equivalent to the below assembly code using GCC 13.2 O1 optimization. The CPU register, rdx is the sum and rax is the loop counter, i. As you can see, the multiplication of 8 because 8 is the size of uint64_t. You can examine the generated assembly code at Godbolt.

ASM

.L9:
        add     rdx, QWORD PTR [rcx+rax*8]
        add     rax, 1
        cmp     rax, rsi
        jb      .L9

The C++ example is illustrated below using std::accumulate.

C++

std::vector<uint64_t> vec(1000000);
// ...
constexpr const uint64_t Zero = 0;
uint64_t sum = std::accumulate(vec.cbegin(), vec.cend(), Zero);

The code std::accumulate produced is roughly below.

C++

auto it = vec.cbegin();
auto end = vec.cend();
for(; it != end; ++it)
{
    sum += *it;
}

++it and *it is equivalent to the assembly code below, an addition of 8 without the time-consuming multiplication. The reason I use O1 optimization and not O0 because some functions like vec[i], vector::size are not inlined but called, so no assembly is shown at the site for my examination. O1 is the sweet spot for manual analysis. Of course, there are higher optimization levels like O2 and O3 but all my assumptions are off since the optimizer realizes these 3 loops are equivalent code and generates the same optimized assembly for all three.

ASM

.L19:
        add     rdx, QWORD PTR [rax]
        add     rax, 8
        cmp     rax, rcx
        jne     .L19

C++11 range for loop generates the same code as std::accumulate and iterator code above.

C++

for (auto n : vec)
{
    sum += n;
}

The benchmark results for different compilers are as follows. Parallel version of std::accumulate() is called std::reduce() in C++17. sum can be ignored: it is used so that the for loop won't be optimized away. For Visual C++ and Clang, accumulate and range for loop is faster. For GCC, all four results hover around 300ms. From the GCC and Clang results, we can clearly see std::reduce() is not parallelized in this Standard Library I am using.

VC++ 2022

 Increment For Loop:  439ms, sum:500000500000
     Range For Loop:  277ms, sum:500000500000
        Accumulator:  239ms, sum:500000500000
    Parallel Reduce:  111ms, sum:500000500000

gcc version 11.3.0

 Increment For Loop:  328ms, sum:500000500000
     Range For Loop:  295ms, sum:500000500000
        Accumulator:  273ms, sum:500000500000
    Parallel Reduce:  313ms, sum:500000500000

clang version 14.0.0

 Increment For Loop:  339ms, sum:500000500000
     Range For Loop:  266ms, sum:500000500000
        Accumulator:  265ms, sum:500000500000
    Parallel Reduce:  265ms, sum:500000500000

The GCC and clang command to build SummingBenchmark.cpp is as follows:

g++ SummingBenchmark.cpp -O3 -std=c++17

clang++ SummingBenchmark.cpp -O3 -std=c++17

Reference

History

19^th February 2024: Added stable_sort to the SortBenchmark because qsort should be rightfully compared to stable_sort since former has equality compare and the latter checks for equality by using this expression !(x<y) && !(y<x).
17^th February 2024: Corrected the initialization for SortBenchmark. Added C++17's parallel std::sort to the SortBenchmark. Added C++17 parallel std::accumulate to the SummingBenchmark. Note: The parallel version of std::accumulate is called std::reduce. C++ committee decided to rename it because when using floating point, std::reduce can return slightly different results due to its parallel operations. If order of accumulation matters to you, stick to std::accumulate. Benchmark is run on Intel i7 8700 with 6 physical cores and 12 logical cores.
16^th February 2024: Updated messages in the benchmark code download
7^th February 2024: Added assembly code from GCC and removed the confusing C++ code which is not valid C++ pointer arithmetic for the summing section
4^th February 2024: Corrected the comparison function of qsort according to Fly Gheorghe
31^st January 2024: First release

License

This article, along with any associated source code and files, is licensed under The MIT License