Introduction
Requirements
The Problem
Related Topics
Getting Started - Trying to work with recursion
The recursion problem in detail, and the fix
The good stuff - Writing a generalized static_for implementation
The code
A test program
Code commentary
How can I choose between a compile-time loop, and a run-time loop?
Questions and comments
History
Introduction
I recently came across a problem that I found interesting. How can I easily loop at compile time?
I'd thought about writing a static_for
method a few times in the past to simplify some template
code, but until now there was no pressing need for it.
I recently found my reason to write this one. I needed to test some compile-time code, and with that came the requirement to loop through it. The built-in tools weren't quite enough, and so... here I am, and here is my static_for
!
By looping at compile time, you can interact with compile-time code (such as templates).
static_for
will also unroll loops.
Requirements
A compiler that supports C++11 or later. The download contains a workaround is provided for Visual Studio's current lack of constexpr
support by using enum
s, and more template
s.
The Problem
Looping at compile time. Looping at run-time is easy:
for (int i = 0; i < upper_bound; ++i)
{
do_stuff(i);
}
Since we're working at compile time, the solution isn't that simple. I'd like to do:
template<int n> void do_stuff()
{
}
for (int i = 0; i < 1000; i++)
{
do_stuff<i>();
}
But... this doesn't work. i
is a runtime counter, and do_stuff<int>()
requires a compile-time counter. Because I don't want to write:
do_stuff<0>
do_stuff<1>
do_stuff<2>
do_stuff<3>
do_stuff<4>
...
do_stuff<997>
do_stuff<998>
do_stuff<999>
And of course, that would still require me to hard-code the number of iterations. I need to loop to a number that is calculated through compile time expressions, and yet have the same run-time performance as hard coding above.
Today I'll walk you through a bit of my process in creating a generalized solution to this problem, as well as show you some of the potential pitfalls and my reasons for needing to create something of this complexity.
Or, you can just download the code, and play around with it :D
Does unrolling a loop increase performance?
Function signature
Overload resolution
Perfect Forwarding
Tag Dispatching
Tail Recursion
Template argument deduction
Template instantiation
Template meta-programming
Generic programming
n-ary/k-ary tree
Getting started - Trying to work with recursion
So let's say that I want to instantiate this template
with values from 0 to 99, and call the function:
#include <iostream>
template<int index> void do_stuff()
{
std::cout << index << std::endl;
}
This is the type of problem I'm looking to solve here, as it's not one that we can handle with a normal run-time loop. The initial naive approach to solving this problem will involve recursion. Maybe this is good?
template<int max_index, int index = 0> void stuff_helper()
{
if (index <= max_index)
{
do_stuff<index>();
stuff_helper<max_index, index + 1>();
}
}
int main()
{
stuff_helper<100>();
return 0;
}
Depending on how familiar you are with working with template
s in C++, you probably either think this looks good, or you're already familiar with why this won't work. Or maybe you think this looks good and know why it won't work.
On the surface, it could look like the if
statement would be responsible for terminating the recursion, like how this would work with a "normal" run-time based recursion algorithm. But that's the problem. What works at runtime doesn't work at compile time.
This is an infinite loop, and only stops because compilers limit themselves to a certain recursion depth. In clang, I get an error fatal error: recursive template instantiation exceeded maximum depth of 256
. You can expect a similar error with your compiler of choice.
Fortunately, there's a reasonable way to fix this.
The recursion problem in detail, and the fix
So the problem with the first attempt is that template
s don't always work the way you might intuit how they work. I've found it a bit frustrating to learn myself, and this seems to be a common difficulty. However, once the quirks are understood and accepted, they don't have to be a problem any longer.
Your trusty C++ compiler will attempt to instantiate the template
regardless of "normal" logic that is run-time based.
Which means that when we do this:
template<int max_index, int index = 0> void stuff_helper()
{
if (index <= max_index)
{
do_stuff<index>();
The if
logic doesn't kick in when we want it to. The if
has nothing to do with compile-time decisions, and your lovely compiler is obligated to attempt to compile
stuff_helper<max_index, index + 1>();
... even when you're at or past your max_index
!
So what we need is something a little different. What we need here is a way for logic to branch at compile time rather than at run-time.
How about using specialization?
#include <type_traits>
constexpr int terminate_index{ 100 };
template<int index = 0> void stuff_helper();
template<> void stuff_helper<terminate_index>()
{
}
template<int index> void stuff_helper<index>()
{
do_stuff<index>();
stuff_helper<index + 1>();
}
int main()
{
stuff_helper();
return 0;
}
Nope, that doesn't work. C++ doesn't allow partial specialization of functions. Thankfully, there's a workaround to simulate partial specialization in this case: we can use tags. Tags in this context are empty types that will separate one function signature from another. So let's use std::integral_constant
, and try this:
#include <type_traits>
constexpr int terminate_index{ 100 };
void stuff_helper(std::integral_constant<int, terminate_index>)
{
}
template<int index = 0>
void stuff_helper(std::integral_constant<int, index> = std::integral_constant<int, 0>())
{
do_stuff<index>();
stuff_helper(std::integral_constant<int, index + 1>());
}
int main()
{
stuff_helper();
return 0;
}
Success! But how does it work?
This relies on the compiler's method for matching a function call to the proper function and makes use of template
parameter deduction and tag dispatching.
The first time we call stuff_helper
, only the second (template
d) function matches the call, so that one is chosen.
Recursive calls up until terminate_index
are also pretty straight-forward. We pass the std::integral_constant
, and the template
parameter is deduced by the compiler to match the int
inside of the integral_constant
.
std::integral_constant
works here because every index creates a new type, and a new function signature. In essence std::integral_constant<int, 0>
and std::integral_constant<int, 1>
are completely different types even though they share an underlying template
.
The last call is the special one though, as that's where there are two valid function signatures to choose from. When the inevitable call to stuff_helper(std::integral_constant<int, max_index>());
is made, the compiler must choose the proper one. But which one is that?
The compiler will choose a function without a template
before one that does have a template
. In effect, since the compiler can find a function signature that works without having to instantiate a template, it chooses that one without looking at the template
d one. This process is known as overload resolution under the heading "best viable function".
Oh, and don't worry about the runtime performance cost of passing the unused std::integral_constant
around. Unnamed types are optimized out in release mode on all compilers I'm aware of.
But, this code has its limits. What happens when we change our code to loop a few more times?
constexpr int max_index{ 1000 };
Kablooie! We hit our template
recursion error again. Or, at least I do on my compiler which has a maximum recursion depth of 256.
There's an easy way, and a fun way to work around this.
The easy way is to tell your compiler to increase its maximum template
depth past 256 to something more appropriate to your specific situation.
The fun way is to work within the limit of a maximum template
depth of 256, and allow ourselves to loop an arbitrarily large number of times. It's like solving a tricky puzzle. Maybe you don't like puzzles though. *shrug* As an added bonus, the fun way provides a generalized solution that is useful in other places.
Also to note: I'm writing library code where I don't have control over the user's compiler settings. As such, it's very important for me to be able to solve this situation in a way that will "just work" on a modern C++ compiler without modifying compiler flags.
The good stuff - Writing a generalized compile-time static_for
method
Here's the basic interface I've created:
template<size_t count, typename functor_template, typename... functor_types>
inline void static_for(functor_types&&... functor_args);
template<size_t start, size_t end, typename functor_template, typename... functor_types>
inline void static_for(functor_types&&... functor_args);
The static_for
method takes either a count
or a start
and end
point, a typename
that contains a templated
function called func
that will be instantiated by static_for
, and a list of function parameters to pass on with C++11's perfect forwarding (which in turn are automatically deduced!). Oh, and it also takes an optional argument called sequence_width
which is used to internally. More on that one below!
Though the declaration looks complicated, using it is simple. As an example:
struct do_stuff
{
template<int index> static inline void func()
{
std::cout << index << std::endl;
}
};
int main()
{
static_for<5000, do_stuff>();
return 0;
}
Looks good right? So now we have our interface. Now it's just a matter of writing the code to implement it...
Unfortunately, we need to wrap func()
in a helper struct, as there is no way to pass the name of a templated function directly. This is another short coming of the C++ language in regards to what we can do at compile time.
The code
My solution to loop at compile time without triggering template
depth errors is to break up the problem into smaller pieces.
Basically, the initial naive approach with recursion goes in a straight line. No matter whether you want to loop 20 times, or 20,000, the basic approach just keeps going until it's done.
What I've done differently is to break that up into smaller pieces. The template
parameter sequence_width
can be used to set how big the smaller pieces will be. This structure is known as an n-ary tree, where n
is equal to sequence_width
. I originally wrote this with a binary tree, and I moved to an n-ary tree as a compile time optimization.
Although the code below is suitable for large loops, you will almost certainly run into other limits in your compiler, or a hard limit based on how much time will be required to compile your program.
In other words: this isn't a magic bullet, and although this will allow you larger compile time loops than without this, there are still other things you will be contending with. Also: compiling template
s are notoriously slow.
I'm going to share with you the header file right here, as well as some comments about how this thing works. The full working code is provided below. A test function and project files are included with the download.
The public interface:
template<size_t count, typename functor, size_t sequence_width = 70,
typename... functor_types>
inline void static_for(functor_types&&... functor_args)
{
static_for_impl<0, count-1, functor, sequence_width, functor_types...>::
loop(std::forward<functor_types>(functor_args)...);
}
template<size_t start, size_t end, typename functor, size_t sequence_width = 70,
typename... functor_types>
inline void static_for(functor_types&&... functor_args)
{
static_for_impl<start, end, functor, sequence_width, functor_types...>::
loop(std::forward<functor_types>(functor_args)...);
}
The internals with comments:
template<size_t for_start, size_t for_end, typename functor, size_t sequence_width,
typename... functor_types>
struct static_for_impl
{
static inline void loop(functor_types&&... functor_args)
{
using sequence = point<for_start, for_end>;
next<sequence>
(std::integral_constant<bool, sequence::is_end_point_>(),
std::forward<functor_types>(functor_args)...);
}
private:
template<size_t pt_start, size_t pt_end> struct point
{
static constexpr size_t start_ { pt_start };
static constexpr size_t end_ { pt_end };
static constexpr size_t count_ { end_ - start_ + 1 };
static constexpr bool is_end_point_ { count_ <= sequence_width };
static constexpr size_t sequence_count()
{
return
points_in_sequence(sequence_width) > sequence_width
?
sequence_width
:
points_in_sequence(sequence_width);
}
private:
static constexpr size_t child_start(size_t index)
{
return
index == 0
?
pt_start
:
child_end(index - 1) + 1;
}
static constexpr size_t child_end(size_t index)
{
return
index == sequence_count() - 1
?
pt_end
:
pt_start + points_in_sequence(sequence_count()) * (index + 1) -
(index < count_
?
1
:
0);
}
static constexpr size_t points_in_sequence(size_t max)
{
return count_ / max + (
(count_ % max) > 0
?
1
:
0);
}
public:
template<size_t index> using child_point = point<child_start(index), child_end(index)>;
};
template<size_t flat_start, size_t flat_end, class flat_functor> struct flat_for
{
static inline void flat_loop(functor_types&&... functor_args)
{
flat_next(std::integral_constant<size_t, flat_start>(),
std::forward<functor_types>(functor_args)...);
}
private:
static inline void flat_next
(std::integral_constant<size_t, flat_end + 1>, functor_types&&...)
{
}
template<size_t index>
static inline void flat_next
(std::integral_constant<size_t, index>, functor_types&&... functor_args)
{
flat_functor::template func<index>(std::forward<functor_types>(functor_args)...);
flat_next(std::integral_constant<size_t, index + 1>(),
std::forward<functor_types>(functor_args)...);
}
};
template<typename sequence> struct flat_sequence
{
template<size_t index> static inline void func(functor_types&&... functor_args)
{
using pt = typename sequence::template child_point<index>;
next<pt>
(std::integral_constant<bool, pt::is_end_point_>(),
std::forward<functor_types>(functor_args)...);
}
};
template<typename sequence> static inline void next
(std::true_type, functor_types&&... functor_args)
{
flat_for<sequence::start_, sequence::end_, functor>::
flat_loop(std::forward<functor_types>(functor_args)...);
}
template<typename sequence> static inline void next
(std::false_type, functor_types&&... functor_args)
{
flat_for<0, sequence::sequence_count() - 1, flat_sequence<sequence>>::
flat_loop(std::forward<functor_types>(functor_args)...);
}
};
The current version of Visual Studio doesn't include full support for constexpr
. The source code download has a version of static_for
that doesn't require it.
A test program
struct do_stuff
{
template<size_t index> static inline void func(bool* b)
{
assert(!b[index]);
b[index] = true;
std::cout << index << std::endl;
}
};
int main()
{
#define for_count 2000
bool b[for_count + 100];
memset(b, false, for_count);
memset(&b[for_count], true, 100);
static_for<for_count, do_stuff>(b);
for (size_t i{ 0 }; i < for_count; ++i)
assert(b[i]);
return 0;
}
The static_for_impl
is initialized with the information you pass it, and then it creates a point
. Consider a point
to be like a node of a tree.
If the point
contains sequence_width
items or less, then the point is marked as one that doesn't require splitting (the value is_end_point_
will equal true
for that point
).
If the point
contains more than sequence_width
items, then its start_
and end_
values will be split into child_point
s.
The child_point
s are checked to determine if they need further splitting.
If so, then each child gets a number of point
items (types) equal to the parent's count_
divided by sequence_width.
Else, if no further splitting is required, then the number of children point
s are minimized to the parent's count_
divided by sequence_width
.
Each point
is fed through the internal flat_for
template. In there, the point
is used to either call more instances of flat_for
with child point
s, or it is used to instantiate the functor
that was provided with the initial static_for
call at the proper index
.
Overall, the code works by breaking apart the task of the for loop into point
s which are sequence_width
or smaller. The point
s are internally structured as an n-ary tree. It then feeds each point
through flat_for
, which is used to call your code at each index that you requested when you call static_for
.
How can I choose between a compile-time and run-time loop?
There are some cases where you don't have a choice.
I wrote this because I had a need to loop with other compile-time code, and in that case I could not use a run-time loop.
There are often times where there is logic related to the loop that cannot be known until runtime. An example would be a loop that requires external data to know when it terminates.
Much of the time, the question comes down to an engineering problem, as it requires weighing the pros and cons of both choices.
Compile-time/unrolled loops will take longer to compiler. They might allow for slightly higher runtime performance. This could be a benefit that is worthy of the slower compile time in some cases.
In many cases, the performance difference of saving a handful of nanoseconds will be irrelevant. And yet, in others saving a handful of nanoseconds will be of vital importance.
At the very least, it is good to be aware of the tradeoffs, and to know why you are picking one choice over another.
Questions and comments
I would appreciate any questions and comments you have about this. Let me know what you like, what you don't like, and how I can make this better.
Let me know what kind of template meta-programming topics you're interested in, and what you'd like to see me take on with a follow-up article.
History
v 1.0 - December 28th, 2014 - Initial release
v 1.1 - December 29th, 2014 - Additional info and cleanup
v 1.2 - December 30th, 2014 - Code cleanup, additional information, added test program and more "related topics"