Introduction
This article compares the performance of various techniques for implementing a map function for transforming arrays and introduces a novel high-performance implementation of map that uses expression tree rewriting.
A map operation
constructs a sequence or list from another list where each element in the new list is
the result applying a function to each corresponding element in the original list. The most well-known example of a map operation for C# programmers is the IEnumerable.Select()
function in the system library.
The map operation is interesting to study for optimization because it is very common and also embarassingly parallel. A high-performance map operation is the cornerstone of an efficient collection library.
Observations
The primary observations of the experiment are:
- The simplest non-parallelized implementations far outperformed parallelized implementation when arrays were small (< 10,000 elements).
- Using
Parallel.ForEach()
with a ranged partitioner consistently outperforms AsParallel().AsOrdered().Select().ToArray()
. - We can improve performance of
Parallel.ForEach()
with a ranged partitioner by passing a lambda as an expression tree and rewriting it.
Methodology
These tests were run on an Intel i7 quad-core machine running at 2.3 Ghz.
In order to compare the different implementation of map, I created a test suite comparing 6 different implementation of maps. I tried each implementation with inputs of 9 different function arguments and input array lengths ranging from 100 to 10,000,000 integers.
Test functions are run on each array size repeatedly alternating between the different implementation techniques until the total time elapsed reaches 1 second.
Techniques for Implementing a Map Operation on Arrays
In my experiments I studied the following implementations for a map operation on arrays.
Simple Sequenctial Map
Here is a simple implementation of map using a for loop.
U[] Map<T, U>(T[] xs, Func<T, U> f) {
var r = new U[xs.Length];
for (int i=0; i < xs.Length; ++i) r[i] = f(xs[i]);
return r;
}
For small arrays (<= 1000) this was consistenly the most efficient implementation.
Implementing Map using IEnumerable.Select
The simplest implementation of map uses the
IEnumerable.Select()
operation, followed by a call to ToArray()
.
U[] Map<T, U>(T[] xs, Func<T, U> f) {
return xs.Select(f).ToArray();
}
In my measurements this was most often the slowest approach to implementing map, especially for arrays of size 10,000 and larger. It makes sense that this is less efficient than the hand-written version, because it uses an enumerator making it hard for the compiler to leverage knowledge about the underlying data types for optimization.
Parallel Map operation using Parallel.For
Transforming the hand-written map operation into a parallel operation can be done easily using Parallel.For() as follows:
static U[] SimpleParallelMap<T, U>(T[] xs, Func<T, U> f) {
var r = new U[xs.Length];
Parallel.For(0, xs.Length, i => r[i] = f(xs[i]));
return r;
}
Somewhat surprisingly this implementation never significantly outperformed the non-parallel version, even for large arrays, and even under-performs in cases.
Parallel Map using IEnumerable.AsParallel()
An alternative parallel implementation of map uses the IEnumerable.AsParallel()
.
U[] Map<T, U>(T[] xs, Func<T, U> f) {
return xs.Select(f).AsParallel().AsOrdered().ToArray();
}
The performance of this technique is comparable with the Parallel.For()
approach, sometimes better, and sometimes worse. For some reason the performance was consistently and significantly worse than the Parallel.For() technique for arrays of precisely 1,000,000 items (sometimes as much 2x as long or more).
Parallel Map using Range Partitioner
Using Parallel.ForEeach()
a partitioner can decrease the burden on the task scheduler by creating a smaller number of tasks that operate on sub-ranges of the array.
static U[] PartitionedParallelMap<T, U>(T[] xs, Func<T, U> f)
{
var r = new U[xs.Length];
Parallel.ForEach(Partitioner.Create(0, xs.Length),
range => { for (int i = range.Item1; i < range.Item2; ++i)
r[i] = f(xs[i]); });
return r;
}
The performance of this approach is the best so far for large arrays but like most parallel approaches it starts to outperform the simple hand-written sequential map implementation only once the array size reaches 10,000 elements.
Parallel Map using Range Partitioner and Expression Tree Rewriting
The performance of the PartitionedParallelMap()
function can be improved if we observe that the function passed to ForEach
is dominated by the cost of invoking the lambda function argument. Generally speaking invoking a lambda function has a significant performance cost.
If the lambda function is passed as an expression tree it can instead be inlined in a dynamically created function and passed to the Parallel.ForEach()
function.
Here is the general approach in pseudo-code:
static U[] OptimizedPartitionedParallelMap<T, U>(T[] xs, Expresion<Func<T, U>> fexpr)
{
var r = new U[xs.Length];
Expression<Action<T[], U[], int, int>> fexpr2 = RewriteExpressionTree(fexpr);
Action<T[], U[], int, int> f = fexpr.Compile();
Parallel.ForEach(Partitioner.Create(0, xs.Length), range =>
{ f(xs, r, range.Item1, range.Item2); });
return r;
}
The rewriting code creates a new expression tree and uses the ExpressionVisitor
class to inline the body of the function argument (fexpr
) within it thus eliminating the need for an expensive lambda invocation call within the for-loop.
public class FunctionInliner <t,> : ExpressionVisitor
{
Expression argument;
ParameterExpression parameter;
public FunctionInliner(Expression<func<t,>> function, Expression argument){
parameter = function.Parameters[0];
this.argument = argument;
}
public Expression Modify(Expression expression) {
return Visit(expression);
}
protected override Expression VisitParameter(ParameterExpression node){
return node = parameter ? argument : node;
}
}
public static class FunctionInliner
{
public static Expression InlineFunctionCall<t,>(Expression<func<t,>> function, Expression arg) {
var fi = new FunctionInliner<t,>(function, arg);
return fi.Modify(function.Body);
}
}
This map implementation only has reasonable performance if the Expression.Compile()
is memoized (i.e. cached) which is done as follows:
private static Dictionary<Expression, object> memoizedFastMaps = new Dictionary<Expression, object>();
public static U[] FastMap<T, U>(this T[] self, Expression<Func<T, U>> fexpr) {
Action<T[], U[], int, int> f;
lock (memoizedFastMaps)
{
if (!memoizedFastMaps.ContainsKey(fexpr))
memoizedFastMaps[fexpr] = ComputeFastMap(fexpr);
f = (Action<T[], U[], int, int>)memoizedFastMaps[fexpr];
}
var r = new U[self.Length];
Parallel.ForEach(Partitioner.Create(0, self.Length),
range => f(self, r, range.Item1, range.Item2));
return r;
}
The performance of this technique was generally a bit better than simply using the range partitioner.
Results
This table contains the results for the different map tests. This is also available as a color coded excel sheet in the zip package. Each value represents the total number of msec spent in each test. Each column represents a different size of the array.
|
100 |
1000 |
10,000 |
100,000 |
1,000,000 |
10,000,000 |
x * 2 |
|
|
|
|
|
|
IEnumerable.Select |
47.0013 |
177.8057 |
328.9458 |
360.1104 |
354.0814 |
645.1824 |
SimpleMap |
7.5816 |
34.7406 |
69.7591 |
94.4403 |
92.0408 |
189.9622 |
Parallel.For |
356.2646 |
250.5682 |
145.9923 |
234.1428 |
103.7589 |
214.9254 |
ParallelQuery |
154.392 |
230.0595 |
288.8722 |
237.1865 |
385.0914 |
131.25 |
Partitioned ForEach |
209.446 |
154.7732 |
107.0929 |
44.0619 |
25.6843 |
52.4506 |
Expression-tree map |
202.83 |
138.4812 |
55.6059 |
29.6556 |
70.3177 |
47.7849 |
x % 3 |
|
|
|
|
|
|
IEnumerable.Select |
50.9754 |
187.2522 |
326.9465 |
351.2355 |
404.2332 |
546.7806 |
SimpleMap |
11.0956 |
53.0866 |
96.3511 |
120.8589 |
136.8649 |
205.0038 |
Parallel.For |
335.2865 |
221.5865 |
138.5964 |
213.4733 |
121.2471 |
159.3329 |
ParallelQuery |
158.7274 |
228.8488 |
277.0174 |
219.9876 |
260.9124 |
125.8866 |
Partitioned ForEach |
217.3006 |
152.4565 |
89.5992 |
56.3834 |
41.3112 |
57.5527 |
Expression-tree map |
203.2819 |
144.2859 |
68.2051 |
40.7922 |
97.9225 |
56.9034 |
x % 10 == 0 |
|
|
|
|
|
|
IEnumerable.Select |
44.749 |
154.2371 |
269.2269 |
299.4427 |
375.8133 |
486.5356 |
SimpleMap |
11.0698 |
54.2887 |
97.424 |
135.1354 |
181.7922 |
213.2659 |
Parallel.For |
327.3782 |
233.9054 |
144.3949 |
134.1175 |
147.8055 |
192.6815 |
ParallelQuery |
159.6129 |
233.3072 |
275.8584 |
331.0025 |
185.0597 |
143.3002 |
Partitioned ForEach |
227.4341 |
157.5438 |
128.4379 |
55.6181 |
53.5953 |
59.0363 |
Expression-tree map |
207.7087 |
153.7541 |
81.5905 |
47.0574 |
62.8087 |
61.7823 |
x * y |
|
|
|
|
|
|
IEnumerable.Select |
47.5342 |
176.2845 |
330.5266 |
417.78 |
389.5859 |
641.2155 |
SimpleMap |
6.6616 |
29.933 |
60.012 |
82.1904 |
86.6731 |
164.2725 |
Parallel.For |
335.883 |
239.2442 |
151.766 |
151.0844 |
109.0079 |
208.9425 |
ParallelQuery |
156.0036 |
229.0513 |
307.9049 |
260.6253 |
319.6288 |
114.221 |
Partitioned ForEach |
219.3628 |
156.26 |
81.5716 |
41.7017 |
34.5045 |
54.1712 |
Expression-tree map |
211.8914 |
155.145 |
65.0756 |
45.8949 |
80.9438 |
57.4718 |
Math.Sqrt(x) |
|
|
|
|
|
|
IEnumerable.Select |
52.052 |
178.1152 |
325.6915 |
337.2131 |
314.3897 |
594.8534 |
SimpleMap |
11.5972 |
60.8885 |
108.1417 |
150.2933 |
142.9398 |
226.6415 |
Parallel.For |
332.5255 |
214.7269 |
155.6848 |
115.8106 |
107.1056 |
206.2448 |
ParallelQuery |
151.1703 |
236.2765 |
262.9026 |
221.0244 |
333.1709 |
108.9785 |
Partitioned ForEach |
218.4954 |
153.5197 |
70.759 |
133.0735 |
53.0308 |
116.2612 |
Expression-tree map |
212.4254 |
146.074 |
74.6249 |
46.6229 |
51.6872 |
84.8528 |
1.0 / x |
|
|
|
|
|
|
IEnumerable.Select |
57.7009 |
192.1225 |
351.9372 |
342.5118 |
342.8452 |
422.6756 |
SimpleMap |
13.6616 |
61.0858 |
113.1279 |
215.0919 |
149.2644 |
159.6373 |
Parallel.For |
327.4285 |
211.7903 |
168.9644 |
123.7927 |
123.0176 |
130.4033 |
ParallelQuery |
156.755 |
233.4242 |
211.9213 |
156.8925 |
305.5861 |
379.4584 |
Partitioned ForEach |
222.793 |
151.4817 |
73.8911 |
51.0748 |
56.9122 |
86.312 |
Expression-tree map |
200.7036 |
139.8924 |
77.7088 |
115.2319 |
54.257 |
59.8788 |
F(x) |
|
|
|
|
|
|
IEnumerable.Select |
48.3564 |
172.7315 |
328.2192 |
339.3331 |
401.4378 |
685.65 |
SimpleMap |
9.3622 |
53.5944 |
119.9011 |
134.9806 |
153.9984 |
271.0598 |
Parallel.For |
317.6485 |
199.1581 |
138.0239 |
188.8737 |
127.2151 |
202.1605 |
ParallelQuery |
148.9416 |
201.4768 |
241.8003 |
197.0903 |
154.3222 |
141.0791 |
Partitioned ForEach |
215.59 |
172.11 |
64.8258 |
47.5668 |
45.3556 |
65.6707 |
Expression-tree map |
239.0486 |
190.5578 |
103.7636 |
92.9645 |
160.6785 |
143.9428 |
xs[x % xs.Length] |
|
|
|
|
|
|
IEnumerable.Select |
53.4851 |
194.6166 |
326.2611 |
371.1026 |
431.2638 |
568.9713 |
SimpleMap |
11.1547 |
52.3415 |
92.1421 |
123.3329 |
142.5564 |
203.5745 |
Parallel.For |
321.7491 |
215.2261 |
135.3974 |
225.5697 |
129.4563 |
188.0756 |
ParallelQuery |
161.9284 |
219.4217 |
290.0379 |
169.8483 |
139.5091 |
118.6672 |
Partitioned ForEach |
219.876 |
157.8085 |
73.0295 |
58.3842 |
42.7669 |
59.3939 |
Expression-tree map |
210.0777 |
148.9512 |
80.0346 |
53.4993 |
121.5434 |
70.5305 |
x.ToString() |
|
|
|
|
|
|
IEnumerable.Select |
174.3339 |
318.9216 |
322.7798 |
263.6981 |
263.9824 |
2986.9454 |
SimpleMap |
123.489 |
255.0868 |
285.1731 |
247.5723 |
297.4845 |
2781.8611 |
Parallel.For |
198.1301 |
109.1844 |
115.8489 |
147.2986 |
174.429 |
1952.4264 |
ParallelQuery |
145.9428 |
112.4906 |
96.8155 |
125.436 |
183.3579 |
2024.5259 |
Partitioned ForEach |
182.2398 |
96.6949 |
89.1915 |
129.1918 |
226.7504 |
1928.2476 |
Expression-tree map |
165.3074 |
104.9725 |
89.9368 |
121.4181 |
178.6953 |
1835.5693 |
Final Words
This survey was informal and not very rigorous. Ideally more tests
would be added, and more input array types would be explored (not just
arrays of ints). Finally more CPU configurations need to be tested.
That said the results did provide some interesting insights into the fact that the size of the array has a signficiant impact on what technique is most efficient, and shows how effective range partitioning can be in parallel map operations for arrays.
While the expression tree rewriting technique is interesting, the performance gains were less than I expected and inconsistent. For now I wouldn't recommend using it due to the complexity, but I hope others may find ways to improve its performance.
Of course all comments are appreciated, but I'd especially like to here about results on other machines, and suggested alternative implementations.
History
September 2, 2012 - First Version