Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Efficient Map Operations for Arrays in C#

0.00/5 (No votes)
3 Sep 2012 1  
An informal survey of implementation techniques of the map higher-order function for arrays in C#.

Introduction  

This article compares the performance of various techniques for implementing a map function for transforming arrays and introduces a novel high-performance implementation of map that uses expression tree rewriting.

A map operation constructs a sequence or list from another list where each element in the new list is the result applying a function to each corresponding element in the original list. The most well-known example of a map operation for C# programmers is the IEnumerable.Select() function in the system library.   

The map operation is interesting to study for optimization because it is very common and also embarassingly parallel. A high-performance map operation is the cornerstone of an efficient collection library.  

Observations 

The primary observations of the experiment are:  

  • The simplest non-parallelized implementations far outperformed parallelized implementation when arrays were small (< 10,000 elements).  
  • Using Parallel.ForEach() with a ranged partitioner consistently outperforms  AsParallel().AsOrdered().Select().ToArray()
  • We can improve performance of Parallel.ForEach() with a ranged partitioner by passing a lambda as an expression tree and rewriting it. 

Methodology   

These tests were run on an Intel i7 quad-core machine running at 2.3 Ghz.   

In order to compare the different implementation of map, I created a test suite comparing 6 different implementation of maps. I tried each implementation with inputs of 9 different function arguments and input array lengths ranging from 100 to 10,000,000 integers.

Test functions are run on each array size repeatedly alternating between the different implementation techniques until the total time elapsed reaches 1 second.  

Techniques for Implementing a Map Operation on Arrays

In my experiments I studied the following implementations for a map operation on arrays.

Simple Sequenctial Map

Here is a simple implementation of map using a for loop.

U[] Map<T, U>(T[] xs, Func<T, U> f)  {   
  var r = new U[xs.Length];
  for (int i=0; i < xs.Length; ++i) r[i] = f(xs[i]);
  return r;
}

For small arrays (<= 1000) this was consistenly the most efficient implementation.

Implementing Map using IEnumerable.Select   

The simplest implementation of map uses the  IEnumerable.Select()  operation, followed by a call to ToArray()

U[] Map<T, U>(T[] xs, Func<T, U> f)  {   
  return xs.Select(f).ToArray();
}
In my measurements this was most often the slowest approach to implementing map, especially for arrays of size 10,000 and larger. It makes sense that this is less efficient than the hand-written version, because it uses an enumerator making it hard for the compiler to leverage knowledge about the underlying data types for optimization.

Parallel Map operation using Parallel.For  

Transforming the hand-written map operation into a parallel operation can be done easily using Parallel.For() as follows:   

static U[] SimpleParallelMap<T, U>(T[] xs, Func<T, U> f) {
  var r = new U[xs.Length];
  Parallel.For(0, xs.Length, i => r[i] = f(xs[i]));
  return r;
}

Somewhat surprisingly this implementation never significantly outperformed the non-parallel version, even for large arrays, and even under-performs in cases.

Parallel Map using IEnumerable.AsParallel()   

An alternative parallel implementation of map uses the IEnumerable.AsParallel()

U[] Map<T, U>(T[] xs, Func<T, U> f)  {   
  return xs.Select(f).AsParallel().AsOrdered().ToArray();
}     

The performance of this technique is comparable with the Parallel.For() approach, sometimes better, and sometimes worse. For some reason the performance was consistently and significantly worse than the Parallel.For() technique for arrays of precisely 1,000,000 items (sometimes as much 2x as long or more).

Parallel Map using Range Partitioner 

Using Parallel.ForEeach()  a partitioner can decrease the burden on the task scheduler by creating a smaller number of tasks that operate on sub-ranges of the array.

static U[] PartitionedParallelMap<T, U>(T[] xs, Func<T, U> f)
{
  var r = new U[xs.Length];
  Parallel.ForEach(Partitioner.Create(0, xs.Length), 
    range => { for (int i = range.Item1; i < range.Item2; ++i) 
      r[i] = f(xs[i]); });
  return r;
}

The performance of this approach is the best so far for large arrays but like most parallel approaches it starts to outperform the simple hand-written sequential map implementation only once the array size reaches 10,000 elements.  

Parallel Map using Range Partitioner and Expression Tree Rewriting

The performance of the PartitionedParallelMap() function can be improved if we observe that the function passed to ForEach is dominated by the cost of invoking the lambda function argument. Generally speaking invoking a lambda function has a significant performance cost.  

If the lambda function is passed as an expression tree it can instead be inlined in a dynamically created function and passed to the Parallel.ForEach() function.    

Here is the general approach in pseudo-code:  

static U[] OptimizedPartitionedParallelMap<T, U>(T[] xs, Expresion<Func<T, U>> fexpr)
{
  var r = new U[xs.Length];
  Expression<Action<T[], U[], int, int>> fexpr2 = RewriteExpressionTree(fexpr);
  Action<T[], U[], int, int> f = fexpr.Compile();
  Parallel.ForEach(Partitioner.Create(0, xs.Length), range => 
   { f(xs, r, range.Item1, range.Item2); });
  return r;
}

The rewriting code creates a new expression tree and uses the ExpressionVisitor class to inline the body of the function argument (fexpr) within it thus eliminating the need for an expensive lambda invocation call within the for-loop.  

public class FunctionInliner <t,> : ExpressionVisitor
{
  Expression argument;
  ParameterExpression parameter;

  public FunctionInliner(Expression<func<t,>> function, Expression argument){
     parameter = function.Parameters[0];
     this.argument = argument;
  }

  public Expression Modify(Expression expression) {
    return Visit(expression);
  }

  protected override Expression VisitParameter(ParameterExpression node){ 
    return node = parameter ? argument : node;
  }
}

public static class FunctionInliner
{
  public static Expression InlineFunctionCall<t,>(Expression<func<t,>> function, Expression arg) {
     var fi = new FunctionInliner<t,>(function, arg);
     return fi.Modify(function.Body);
  }
}

This map implementation only has reasonable performance if the Expression.Compile() is memoized (i.e. cached) which is done as follows: 

private static Dictionary<Expression, object> memoizedFastMaps = new Dictionary<Expression, object>();

public static U[] FastMap<T, U>(this T[] self, Expression<Func<T, U>> fexpr) {
  Action<T[], U[], int, int> f;
  lock (memoizedFastMaps)
  {
    if (!memoizedFastMaps.ContainsKey(fexpr))
       memoizedFastMaps[fexpr] = ComputeFastMap(fexpr);
    f = (Action<T[], U[], int, int>)memoizedFastMaps[fexpr];
  }
  var r = new U[self.Length];
  Parallel.ForEach(Partitioner.Create(0, self.Length), 
  range => f(self, r, range.Item1, range.Item2));
  return r;
}  

The performance of this technique was generally a bit better than simply using the range partitioner. 

Results    

This table contains the results for the different map tests. This is also available as a color coded excel sheet in the zip package. Each value represents the total number of msec spent in each test. Each column represents a different size of the array.

100 1000 10,000 100,000 1,000,000 10,000,000
x * 2 
IEnumerable.Select   47.0013 177.8057 328.9458 360.1104 354.0814 645.1824
SimpleMap            7.5816 34.7406 69.7591 94.4403 92.0408 189.9622
Parallel.For         356.2646 250.5682 145.9923 234.1428 103.7589 214.9254
ParallelQuery        154.392 230.0595 288.8722 237.1865 385.0914 131.25
Partitioned ForEach  209.446 154.7732 107.0929 44.0619 25.6843 52.4506
Expression-tree map  202.83 138.4812 55.6059 29.6556 70.3177 47.7849
x % 3 
IEnumerable.Select   50.9754 187.2522 326.9465 351.2355 404.2332 546.7806
SimpleMap            11.0956 53.0866 96.3511 120.8589 136.8649 205.0038
Parallel.For         335.2865 221.5865 138.5964 213.4733 121.2471 159.3329
ParallelQuery        158.7274 228.8488 277.0174 219.9876 260.9124 125.8866
Partitioned ForEach  217.3006 152.4565 89.5992 56.3834 41.3112 57.5527
Expression-tree map  203.2819 144.2859 68.2051 40.7922 97.9225 56.9034
x % 10 == 0 
IEnumerable.Select   44.749 154.2371 269.2269 299.4427 375.8133 486.5356
SimpleMap            11.0698 54.2887 97.424 135.1354 181.7922 213.2659
Parallel.For         327.3782 233.9054 144.3949 134.1175 147.8055 192.6815
ParallelQuery        159.6129 233.3072 275.8584 331.0025 185.0597 143.3002
Partitioned ForEach  227.4341 157.5438 128.4379 55.6181 53.5953 59.0363
Expression-tree map  207.7087 153.7541 81.5905 47.0574 62.8087 61.7823
x * y 
IEnumerable.Select   47.5342 176.2845 330.5266 417.78 389.5859 641.2155
SimpleMap            6.6616 29.933 60.012 82.1904 86.6731 164.2725
Parallel.For         335.883 239.2442 151.766 151.0844 109.0079 208.9425
ParallelQuery        156.0036 229.0513 307.9049 260.6253 319.6288 114.221
Partitioned ForEach  219.3628 156.26 81.5716 41.7017 34.5045 54.1712
Expression-tree map  211.8914 155.145 65.0756 45.8949 80.9438 57.4718
Math.Sqrt(x)
IEnumerable.Select   52.052 178.1152 325.6915 337.2131 314.3897 594.8534
SimpleMap            11.5972 60.8885 108.1417 150.2933 142.9398 226.6415
Parallel.For         332.5255 214.7269 155.6848 115.8106 107.1056 206.2448
ParallelQuery        151.1703 236.2765 262.9026 221.0244 333.1709 108.9785
Partitioned ForEach  218.4954 153.5197 70.759 133.0735 53.0308 116.2612
Expression-tree map  212.4254 146.074 74.6249 46.6229 51.6872 84.8528
1.0 / x 
IEnumerable.Select   57.7009 192.1225 351.9372 342.5118 342.8452 422.6756
SimpleMap            13.6616 61.0858 113.1279 215.0919 149.2644 159.6373
Parallel.For         327.4285 211.7903 168.9644 123.7927 123.0176 130.4033
ParallelQuery        156.755 233.4242 211.9213 156.8925 305.5861 379.4584
Partitioned ForEach  222.793 151.4817 73.8911 51.0748 56.9122 86.312
Expression-tree map  200.7036 139.8924 77.7088 115.2319 54.257 59.8788
F(x) 
IEnumerable.Select    48.3564 172.7315 328.2192 339.3331 401.4378 685.65
SimpleMap             9.3622 53.5944 119.9011 134.9806 153.9984 271.0598
Parallel.For          317.6485 199.1581 138.0239 188.8737 127.2151 202.1605
ParallelQuery         148.9416 201.4768 241.8003 197.0903 154.3222 141.0791
Partitioned ForEach   215.59 172.11 64.8258 47.5668 45.3556 65.6707
Expression-tree map  239.0486 190.5578 103.7636 92.9645 160.6785 143.9428
xs[x % xs.Length] 
IEnumerable.Select   53.4851 194.6166 326.2611 371.1026 431.2638 568.9713
SimpleMap            11.1547 52.3415 92.1421 123.3329 142.5564 203.5745
Parallel.For         321.7491 215.2261 135.3974 225.5697 129.4563 188.0756
ParallelQuery        161.9284 219.4217 290.0379 169.8483 139.5091 118.6672
Partitioned ForEach  219.876 157.8085 73.0295 58.3842 42.7669 59.3939
Expression-tree map  210.0777 148.9512 80.0346 53.4993 121.5434 70.5305
x.ToString() 
IEnumerable.Select   174.3339 318.9216 322.7798 263.6981 263.9824 2986.9454
SimpleMap            123.489 255.0868 285.1731 247.5723 297.4845 2781.8611
Parallel.For         198.1301 109.1844 115.8489 147.2986 174.429 1952.4264
ParallelQuery        145.9428 112.4906 96.8155 125.436 183.3579 2024.5259
Partitioned ForEach  182.2398 96.6949 89.1915 129.1918 226.7504 1928.2476
Expression-tree map  165.3074 104.9725 89.9368 121.4181 178.6953 1835.5693

Final Words 

This survey was informal and not very rigorous. Ideally more tests would be added, and more input array types would be explored (not just arrays of ints). Finally more CPU configurations need to be tested.  

That said the results did provide some interesting insights into the fact that the size of the array has a signficiant impact on what technique is most efficient, and shows how effective range partitioning can be in parallel map operations for arrays. 

While the expression tree rewriting technique is interesting, the performance gains were less than I expected and inconsistent. For now I wouldn't recommend using it due to the complexity, but I hope others may find ways to improve its performance.  

Of course all comments are appreciated, but I'd especially like to here about results on other machines, and suggested alternative implementations. 

History        

September 2, 2012 - First Version

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here