Introduction
The initial motivation is to find out the overhead of different for-loop types in C++.
Code
Copy and paste below code into Godbolt Online C++ Compiler to see the generated assembly code. Note: The array or vector are initialized in the benchmark. The simplified code below is for you to copy and paste into the Godbolt Online C++ Compiler so that you only read the relevant assembly code, including other code just adds to noise where you have to wade through to find your assembly code.
Prior to using Godbolt, I was poring the assembly code generated by Visual C++ which was difficult to read because the optimized assembly code for each single C++ line were interleaved with assembly code for other lines. My guess is reason for doing that probably is to take advantage of the CPU pipelining so that code which are not dependent on the result of previous operation, can execute independently. Using Godbolt is the best and most easy way. Trust me.
#include <cstdint>
#include <algorithm>
#include <numeric>
#include <iterator>
const size_t LEN = 1000000;
uint64_t func1()
{
uint64_t vec[LEN];
uint64_t sum = 0;
for (size_t i = 0; i < LEN; ++i)
{
sum += vec[i];
}
return sum;
}
uint64_t func2()
{
uint64_t vec[LEN];
uint64_t sum = 0;
for (auto n : vec)
{
sum += n;
}
return sum;
}
uint64_t func3()
{
uint64_t vec[LEN];
uint64_t sum = 0;
for (auto it = std::cbegin(vec); it != std::cend(vec); ++it)
{
sum += *it;
}
return sum;
}
uint64_t func4()
{
uint64_t vec[LEN];
uint64_t sum = 0;
const uint64_t Zero = 0;
sum = std::accumulate(std::cbegin(vec), std::cend(vec), Zero);
return sum;
}
Test Machine: Intel i7 6700 at 3.4 GHz
Visual C++ 2017 (15.4 Update) Result
Please ignore the sum result. I display resultant sum to prevent compiler from optimizing away for loop. Visual C++ vectorized the code with SSE2.
Increment For Loop: 599ms, sum:500000500000
Range For Loop: 446ms, sum:500000500000
Iterator For Loop: 558ms, sum:500000500000
Accumulator: 437ms, sum:500000500000
Investigation shows multiplication by 8 for array index subscripting could be the culprit in the slowdown in the Increment For Loop.
sum += vec[i];
movdqu xmm0, XMMWORD PTR vec$[rsp+rax*8] <== multiplication by 8
As for the Range For Loop, the address is incremented by 16 (= 8 + 8 because of loop unrolling), multiplication is not used to calculate the address. Accumulator code use the same tactics. Earlier in the decade, C programmers were baffled as to why std::accumulate
was faster than for loop. Now we know the reason.
As for the Iterator For Loop poor result, my guess is the iterator overhead.
Cygwin clang++ 3.9.1 Result
clang++ generated the similar code for all 4 loops, hence, similar timing. clang++ vectorized the loops with SSE2. To compile the code with clang++, use the command below.
# clang++ ForLoopBenchmark.cpp -O2 -std=c++14
Increment For Loop: 392ms, sum:500000500000
Range For Loop: 406ms, sum:500000500000
Iterator For Loop: 381ms, sum:500000500000
Accumulator: 391ms, sum:500000500000
Cygwin g++ 5.4 Result
Like clang++, g++ also generated the similar code for all 4 loops, so they had similar timing but sadly, loops are not vectorized in O2. Specifying O3 vectorize all loops and result is on par with clang++'s O2. To compile the code with g++, use the command below.
g++ ForLoopBenchmark.cpp -O2 -std=c++14
Increment For Loop: 558ms, sum:500000500000
Range For Loop: 552ms, sum:500000500000
Iterator For Loop: 542ms, sum:500000500000
Accumulator: 544ms, sum:500000500000
"Is this information even useful?"
There is FIX Protocol (for financial markets) which makes use of summing to calculate simple checksum of its message.
If you find Godbolt useful, do consider becoming a patreon of Matt Godbolt to show your appreciation and help him to defray monthly server costs. Of course, donation is not obligatory. I have been his patreon since December.
Benchmark source code is hosted here. Thanks for reading!