The tip explores the superior performance of C++ over C in sorting and summing operations, showcasing benchmarks that reveal significant speed advantages of C++ over C.
This short write-up is a tip that showcases two tricks C++ developers use to gain performance over C. As such, it is not eligible to participate in the CodeProject monthly article competition.
Table of Contents
Introduction
When it comes to performance, C is king, but there are two areas in which C++ is supreme over C: sorting and summing. When the benchmark results were revealed, the C programmers were baffled and intrigued, and finally, they were won over to the C++ side. In this tip, we examine the reason they are faster and see if it is still the case today by benchmarking.
C++ sort VS C qsort
In C, qsort
is a de facto way to do QuickSort. qsort
requires a comparator function to be provided as a function pointer to compare between two elements.
int comparator(const void* p, const void* q)
{
const int64_t l = *(const int64_t*)p;
const int64_t r = *(const int64_t*)q;
if (l > r) return -1;
else if (l < r) return +1;
else return 0;
}
const int size = 1000000;
int64_t arr[size];
qsort((void*)arr, size, sizeof(arr[0]), comparator);
In C++, std::sort
is used to do sorting. A function object must be provided as a comparator function until the advent of C++11 where a lambda is sufficient. Before C++17, std::sort
is implemented using QuickSort. In C++17, IntroSort is used instead because though QuickSort is the fastest sorting algorithm, its Achilles heel is the worst time performance of O(N2) when it comes to sorting an almost sorted array.
std::sort on C++ Reference noted
Before LWG713, the complexity requirement allowed sort() to be implemented using only Quicksort, which may need O(N2 ) comparisons in the worst case.
Introsort can handle all cases with O(N·log(N)) comparisons (without incurring additional overhead in the average case), and thus is usually used for implementing sort().
libc++ has not implemented the corrected time complexity requirement until LLVM 14
This is Wikipedia opening introduction to IntroSort:
Introsort or introspective sort is a hybrid sorting algorithm that provides both fast average performance and (asymptotically) optimal worst-case performance. It begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms, with practical performance comparable to quicksort on typical data sets and worst-case O(n log n) runtime due to the heap sort.
std::vector<int64_t> vec(1000000);
std::sort(vec.begin(), vec.end(), [](int64_t a, int64_t b) {
return a > b;
});
As you can see from the above, the std::sort
version is more concise. The reason std::sort
is performant over qsort
is due to its comparison function being a function pointer that requires dereferencing and inside it, two castings from void*
and dereferencing are required to access the element value for comparison whilst std::sort
comparison function is inlined and elements are already the correct type. C++17 parallel std::sort
is added to the benchmark. Below are the benchmark results on the Intel i7 8700 (6 physical cores and 12 logical cores) with 16GB RAM. The GCC and Clang benchmarks are run on WSL Ubuntu 2022. It seems like parallel std::sort
is not implemented on GCC and Clang as they shared same result as serial std::sort
VC++ 2022
C qsort Benchmark:13103ms
C++ stable sort Benchmark: 5702ms
C++ sort Benchmark: 6300ms
C++ par sort Benchmark: 1451ms
gcc version 11.3.0
C qsort Benchmark: 8994ms
C++ stable sort Benchmark: 6389ms
C++ sort Benchmark: 5095ms
C++ par sort Benchmark: 5114ms
clang version 14.0.0
C qsort Benchmark: 8954ms
C++ stable sort Benchmark: 6206ms
C++ sort Benchmark: 5232ms
C++ par sort Benchmark: 5227ms
The GCC and clang
command to build SortBenchmark.cpp
is as follows. Remember to copy timer.h to your Ubuntu folder. It makes no benchmark difference whether -std=c++17
or earlier standard is used. Looks like IntroSort
is not an affecting factor.
g++ SortBenchmark.cpp -O3 -std=c++17
clang++ SortBenchmark.cpp -O3 -std=c++17
C++ accumulate VS C Summing For Loop
Below is an example of the summing for
loop.
const int size = 1000000;
uint64_t arr[size];
uint64_t sum = 0;
for (int i = 0; i < size; ++i)
{
sum += arr[i];
}
arr[i]
is equivalent to *(arr+i)
in C which is equivalent to the below assembly code using GCC 13.2 O1
optimization. The CPU register, rdx
is the sum
and rax
is the loop counter, i
. As you can see, the multiplication of 8
because 8
is the size of uint64_t
. You can examine the generated assembly code at Godbolt.
.L9:
add rdx, QWORD PTR [rcx+rax*8]
add rax, 1
cmp rax, rsi
jb .L9
The C++ example is illustrated below using std::accumulate.
std::vector<uint64_t> vec(1000000);
constexpr const uint64_t Zero = 0;
uint64_t sum = std::accumulate(vec.cbegin(), vec.cend(), Zero);
The code std::accumulate
produced is roughly below.
auto it = vec.cbegin();
auto end = vec.cend();
for(; it != end; ++it)
{
sum += *it;
}
++it
and *it
is equivalent to the assembly code below, an addition of 8
without the time-consuming multiplication. The reason I use O1
optimization and not O0
because some functions like vec[i]
, vector::size
are not inlined but called, so no assembly is shown at the site for my examination. O1
is the sweet spot for manual analysis. Of course, there are higher optimization levels like O2
and O3
but all my assumptions are off since the optimizer realizes these 3 loops are equivalent code and generates the same optimized assembly for all three.
.L19:
add rdx, QWORD PTR [rax]
add rax, 8
cmp rax, rcx
jne .L19
C++11 range for
loop generates the same code as std::accumulate
and iterator
code above.
for (auto n : vec)
{
sum += n;
}
The benchmark results for different compilers are as follows. Parallel version of std::accumulate()
is called std::reduce()
in C++17. sum
can be ignored: it is used so that the for
loop won't be optimized away. For Visual C++ and Clang, accumulate
and range for
loop is faster. For GCC, all four results hover around 300ms. From the GCC and Clang results, we can clearly see std::reduce()
is not parallelized in this Standard Library I am using.
VC++ 2022
Increment For Loop: 439ms, sum:500000500000
Range For Loop: 277ms, sum:500000500000
Accumulator: 239ms, sum:500000500000
Parallel Reduce: 111ms, sum:500000500000
gcc version 11.3.0
Increment For Loop: 328ms, sum:500000500000
Range For Loop: 295ms, sum:500000500000
Accumulator: 273ms, sum:500000500000
Parallel Reduce: 313ms, sum:500000500000
clang version 14.0.0
Increment For Loop: 339ms, sum:500000500000
Range For Loop: 266ms, sum:500000500000
Accumulator: 265ms, sum:500000500000
Parallel Reduce: 265ms, sum:500000500000
The GCC and clang command to build SummingBenchmark.cpp is as follows:
g++ SummingBenchmark.cpp -O3 -std=c++17
clang++ SummingBenchmark.cpp -O3 -std=c++17
Reference
History
- 19th February 2024: Added
stable_sort
to the SortBenchmark because qsort
should be rightfully compared to stable_sort
since former has equality compare and the latter checks for equality by using this expression !(x<y) && !(y<x)
. - 17th February 2024: Corrected the initialization for
SortBenchmark
. Added C++17's parallel std::sort
to the SortBenchmark
. Added C++17 parallel std::accumulate
to the SummingBenchmark
. Note: The parallel version of std::accumulate
is called std::reduce
. C++ committee decided to rename it because when using floating point, std::reduce
can return slightly different results due to its parallel operations. If order of accumulation matters to you, stick to std::accumulate
. Benchmark is run on Intel i7 8700 with 6 physical cores and 12 logical cores. - 16th February 2024: Updated messages in the benchmark code download
- 7th February 2024: Added assembly code from GCC and removed the confusing C++ code which is not valid C++ pointer arithmetic for the summing section
- 4th February 2024: Corrected the comparison function of
qsort
according to Fly Gheorghe - 31st January 2024: First release