The benchmark is hosted at Github.
I discovered a real case of premature micro-optimization when you don't measure it. You know it is pretty bad when you read premature and micro in the same sentence. On page 44 of Optimizing C++ ebook by Agner Fog, it is written ...
In some cases it is possible to replace a poorly predictable branch by a table lookup. For example:
float a;
bool b;
a = b ? 1.5f : 2.6f;
The ?: operator here is a branch. If it is poorly predictable then replace it by a table lookup:
float a;
bool b = 0;
const float lookup[2] = {2.6f, 1.5f};
a = lookup[b];
If a bool is used as an array index then it is important to make sure it is initialized or comes from a reliable source so that it can have no other values than 0 or 1. In some cases the compiler can automatically replace a branch by a conditional move.
I was trying to implement this lookup table optimization (to replace ternary operator) on a floating-point value which the code is compiled with G++ and ran on Linux. I also ran the integer benchmark and on other compilers such like Visual C++ 2019 and Clang and also the Visual C# 7 to see their differences. Below is the C++ benchmark code. The lookup array is declared as a static local variable inside the function.
int64_t IntTernaryOp(bool value)
{
return value ? 3 : 4;
}
int64_t IntArrayOp(bool value)
{
static const int64_t arr[2] = { 3, 4 };
return arr[value];
}
double FloatTernaryOp(bool value)
{
return value ? 3.0f : 4.0f;
}
double FloatArrayOp(bool value)
{
static const double arr[2] = { 3.0f, 4.0f };
return arr[value];
}
int main()
{
const size_t MAX_LOOP = 1000000000;
int64_t sum = 0;
double sum_f = 0;
timer stopwatch;
stopwatch.start("IntTernaryOp");
sum = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum += IntTernaryOp(k % 2);
}
stopwatch.stop(sum);
stopwatch.start("IntArrayOp");
sum = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum += IntArrayOp(k % 2);
}
stopwatch.stop(sum);
stopwatch.start("FloatTernaryOp");
sum_f = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatTernaryOp(k % 2);
}
stopwatch.stop(sum_f);
stopwatch.start("FloatArrayOp");
sum_f = 0;
for (size_t k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatArrayOp(k % 2);
}
stopwatch.stop(sum_f);
return 0;
}
In Visual C# code, static local variable is not permitted so the lookup array is declared as a static member variable inside the class. C# does not allow casting integer bool to integer (to be used as array index), so I use an integer instead.
static Int64[] arrInt64 = new Int64[] { 3, 4 };
static Double[] arrDouble = new Double[] { 3.0, 4.0 };
static Int64 IntTernaryOp(Int64 value)
{
return (value==1) ? 3 : 4;
}
static Int64 IntArrayOp(Int64 value)
{
return arrInt64[value];
}
static double FloatTernaryOp(Int64 value)
{
return (value == 1) ? 3.0f : 4.0f;
}
static double FloatArrayOp(Int64 value)
{
return arrDouble[value];
}
static void Main(string[] args)
{
const int MAX_LOOP = 1000000000;
Int64 sum = 0;
double sum_f = 0.0;
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
sum = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum += IntTernaryOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("IntTernaryOp", stopWatch.Elapsed, sum);
stopWatch = new Stopwatch();
stopWatch.Start();
sum = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum += IntArrayOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("IntArrayOp", stopWatch.Elapsed, sum);
stopWatch = new Stopwatch();
stopWatch.Start();
sum_f = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatTernaryOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("FloatTernaryOp", stopWatch.Elapsed, sum_f);
stopWatch = new Stopwatch();
stopWatch.Start();
sum_f = 0;
for (Int64 k = 0; k < MAX_LOOP; ++k)
{
sum_f += FloatArrayOp(k % 2);
}
stopWatch.Stop();
DisplayElapseTime("FloatArrayOp", stopWatch.Elapsed, sum_f);
}
These are the benchmark results on a computer with Intel i7-8700 at 3.2Ghz with 16GB RAM.
VC++ /Ox results
IntTernaryOp: 562ms, result:3500000000
IntArrayOp: 523ms, result:3500000000
FloatTernaryOp: 3972ms, result:3.5e+09
FloatArrayOp: 1030ms, result:3.5e+09
G++ 7.4.0 -O3 results
IntTernaryOp: 306ms, result:3500000000
IntArrayOp: 519ms, result:3500000000
FloatTernaryOp: 1030ms, result:3.5e+09
FloatArrayOp: 1030ms, result:3.5e+09
Clang++ 6.0.0 -O# results
IntTernaryOp: 585ms, result:3500000000
IntArrayOp: 523ms, result:3500000000
FloatTernaryOp: 1030ms, result:3.5e+09
FloatArrayOp: 1030ms, result:3.5e+09
C# 7 Release Mode, .NET Framework 4.7.2
IntTernaryOp: 1311ms, result:3500000000
IntArrayOp: 1038ms, result:3500000000
FloatTernaryOp: 2448ms, result:3500000000
FloatArrayOp: 1036ms, result:3500000000
For the integer benchmark, the lookup table got worse performance than lookup table with G++, while in Visual C++/C# and Clang++, there is improvement. As it turns out in the floating benchmark I am looking for (in G++), improvement is only seen in Visual C++/C# while G++ and Clang++ retained the same timing for both ternary and lookup table code. I did check the assembly code at the Godbolt Compiler Explorer, none of the ternary operator is converted to use conditional move as suggested by Agner Fog's book so I guess the short branch is still with the CPU cache, therefore it makes little difference in G++ and Clang++ floating-point test. The morale of the story is never take a book's word at its value and always measure to confirm your hypothesis. In my case, I forgo this lookup micro-optimization.
History
- 31st January, 2020: Initial version