Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Fast and Less Fast Loops in C#

0.00/5 (No votes)
17 Jan 2011 4  
How fast can a loop reading from memory be made to run and how does loop constructs, data types, interfaces, unrolling and hoisting affect performance?

Introduction

After I did the first article on QS, I decided to use the tool to do a few experiments to investigate how the CPU cache affects performance. During these experiments, I got a few insights with regard to the performance of various C# constructs that I will share with you here.

Background

The memory system of your PC most likely consists of a large but slow main memory and smaller but faster CPU caches for instructions, data and virtual memory management. The experiment I originally set out to do was about the data cache and specifically about read performance, so here is a short and simplified description of how a memory read works:

When a memory address is accessed, the CPU sends its request to the closest cache (L1) and if the cache holds the value for the address, it simply responds with the value. In fact, the cache will not respond with just the value but will have the entire line containing the address ready (on my system that is 64 bytes). The trick is that finding the line and preparing it is slow (3 clock cycles on my machine) while fetching the data is fast (1 word per cycle), so if the CPU over the next couple of cycles request other data in the line that data will be returned much faster than the first piece of data. The CPU will do its best to predict the next memory access so that the cache can prepare the next line while the CPU works its way through the first line.

So what happens if the L1 cache does not contain the data we are looking for? The L1 cache will look in the L2 cache and if it’s there the exact same thing happens as above, but this time with a higher latency (17 clock cycles on my machine). And what if it is not in L2? Then the main memory (or L3 if such a cache is present) is accessed, again with an increased latency, this time a whopping 290 cycles on my system.

If you want to learn more about caches work, see the article http://en.wikipedia.org/wiki/CPU_cache on Wikipedia or check out the document “What Every Programmer Should Know About Memory” at http://www.unilim.fr/sci/wiki/_media/cali/cpumemory.pdf.

If you are curious about your own system, you can use a benchmark tool to find its characteristics. I used SiSoft Sandra from http://www.sisoftware.net/ for this article.

Measuring Cache Effects

How can the effects of a cache be measured? This is a matter of allocating buffers of various sizes and reading from these buffers while timing how long it takes; If the cache fits into L1, we should expect fast access times and slower times if the data is in L2 or main memory. In reality, it’s more complicated and different access patterns will yield different results due to line sizes, associativity, pre-fetching and pipelining within the CPU, but the first step is really to find out if C# / .NET is fast enough to reveal any cache effects at all. This question is the subject of the remainder of the article.

Before we start, here are some facts about my system:

CPU: Intel T9600 dual core CPU @ 2.8GHz
L1i & L1d caches each 16KB
L2 cache 6MB shared between cores
Main memory: 4 GB dual channel PC6400 DDR2 RAM
Max theoretical transfer rate (MTTR) = 12.8 GB/s.

Since I have a 32 bit OS, I’ll assume that the CPU with a single thread can read at most one 32 bit word per cycle from the memory system (i.e. from the L1d cache). At 2.8 GHz, this yields a maximum theoretical transfer rate from L1d into core of 11.2 GB/s.

Measurements

The first experiment I set out to do was to see how close I could get to the L1d->Core transfer rate of 11.2 GB/s. That involved creating a number of methods with different loop constructs and using different data types. All experiments have an inner loop and an outer loop and in many of the experiments, the loop has been unrolled. The inner loop sequentially reads and sums the contents of a 4KB buffer into a local variable (the sum is to prevent the compiler from eliminating the loop body) while the outer loop is repeated until a total of 64MB is read. The 4KB buffer fits into the L1d cache ensuring the fastest memory access possible so any difference in performance has to do with the loop construct.

Regarding measurements, all measurements have been repeated at least 100 times and the best execution times have been picked.

The table below show the results and as you can see, there is almost a factor 10 between the fastest and the slowest loop. The number of cycle is computed based on the assumption that the max transfer rate is 11.2 GB/s.

Method Time
(ms)
Transfer rate
(GB/s)
% Max Cycles / read Reads / iteration Cycles / iteration


Validation that 64 bit reads (long) does not increase transfer rate

For4_Unroll4_Long 17,36 3,87 35% 2,90 4 11,59


Different ways of indexing the buffer and different amounts of unrolling

For4_Unroll4

For-loop, 16 x unroll, prefix++

6,23 10,76 96% 1,04 16 16,65
For4_Unroll3

For-loop, 4 x unroll, prefix++

7,52 8,93 80% 1,25 4 5,02
For4_Unroll2

For-loop, 4 x unroll, postfix++

14,76 4,55 41% 2,46 4 9,85
For4_Unroll1

For-loop, 4 x unroll, index + offset

8,91 7,53 67% 1,49 4 5,95
For4_Foreach

Foreach loop

8,92 7,52 67% 1,49 1 1,49


Different ways of setting up the loop

For1_For1

Loop variable declared in loop, custom property read in loop.

17,71 3,79 34% 2,96 1 2,96
For1_For2

Loop variable declared in loop, array.length property read in loop.

11,84 5,67 51% 1,98 1 1,98
For3_For3

Loop variable declared in loop, array.length property read before loop.

11,87 5,65 50% 1,98 1 1,98
For4_For4

Loop variable declared outside loop, array.length property read before loop.

11,91 5,64 50% 1,99 1 1,99
BORDER-TOP: #f0f0f0; BORDER-RIGHT: windowtext 1pt solid; PADDING-TOP: 0cm"> For4_For5

Loop variable declared outside loop, array.length property read before loop, index incremented in loop body

11,82 5,68 51% 1,97 1 1,97
For4_While

Inner loop using while

11,82 5,68 51% 1,97 1 1,97
For4_DoWhile

Inner loop using do-while

11,80 5,69 51% 1,97 1 1,97


Using generic List<int> instead of array int[]

For4_Unroll4_List 23,82 2,82 25% 3,98 16 63,60
For4_Foreach_List 59,04 1,14 10% 9,85 1 9,85

Establishing a Baseline

The fastest loop ”For4_Unroll4” is very close to the maximum theoretical speed and implemented like this:

int unused = 0, i, j;
int imax = b.spec.CountIterations;
int jmax = b.spec.CountItems;
TestContext.Current.Timer.Start();
for (i = 0; i != imax; ++i)
{
  for (j = 0; j != jmax; )
  {
    unused += b.intArray[j]; ++j;
    // repeated 16 times
  }
}
TestContext.Current.Timer.Stop();

For4_Unroll3 is identical but only unrolled four times, while Unshared_For4_For5 has the same structure with no unrolls. From For4_For5 to For4_Unroll3, the unrolling shaves away 75% of the iterations in the inner loop from 16M to 4M iterations and from For4_Unroll3 to For4_Unroll4 another 75% is shaved away down to 1M iterations. Since all methods do the same work of summing up 16M integers and the only difference is the number of times the inner loop is repeated, we can compare the number of cycles per iteration which reveals that the loop itself (i.e., compare and branch) costs around 1 cycle while reading the integer, adding it to a local variable and incrementing the index costs another cycle.

Postfix Increment Considered Harmful

Postfix operators (e.g. i++) are generally considered bad style because they are errorprone to use, but they can nevertheless save a lazy programmer a couple of keystrokes. Consider the inner loop of the method For4_Unroll2:

for (j = 0; j != jmax; )
{
  unused += b.intArray[j++];
  unused += b.intArray[j++];
  unused += b.intArray[j++];
  unused += b.intArray[j++];
}

It’s certainly shorter and more concise than the code from For4_Unroll4 but interestingly you don’t only get punished by the code police, but also take a severe hit on performance. This code takes twice as long to execute compared to the unrolled version that uses the prefix increment!

Post fix increment incurs overhead because a temporary copy must be made of the initial value before the increment, and we apparently cannot rely on the compilers to realise that this copy is unnecessary.

Foreach on Arrays is Pretty Good

The last unrolled loop ”For4_Unroll1” looks like this:

for (j = 0; j != jmax; j += 4)
{
  unused += b.intArray[j];
  unused += b.intArray[j + 1];
  unused += b.intArray[j + 2];
  unused += b.intArray[j + 3];
}

It’s runs in 8.9 ms, about 20% slower than the loop using postfix increments (7.5 ms) with the extra time corresponding to one extra cycle per iteration of the inner loop. The interesting bit is to compare this result to the loop For4_Foreach where the inner loop has been replaced by a foreach loop:

foreach (int x in b.intArray)
{
  unused += x;
}

This loop also runs in 8.9 ms just as For4_Unroll1, and 25% faster than any of the for, while and do-while variants have not been unrolled.

A Loop is a Loop but Properties Cost

Does it matter where variables are declared? Does it matter where the index is incremented? Does it matter if the length of an array is referenced in the loop condition? To answer that, I tried some variations of the for-loop:

For1_For2

for (int i = 0; i != b.spec.CountIterations; ++i)
{
  for (int j = 0; j != b.intArray.Length; ++j)
  {
    unused += b.intArray[j];
  }
}

For3_For3

for (int i = 0; i != imax; ++i)
{
  for (int j = 0; j != jmax; ++j)
  {
    unused += b.intArray[j];
  }
}

For4_For4

for (i = 0; i != imax; ++i)
{
  for (j = 0; j != jmax; ++j)
  {
    unused += b.intArray[j];
  }
}

For4_For5

for (i = 0; i != imax; ++i)
{
  for (j = 0; j != jmax; )
  {
    unused += b.intArray[j]; ++j;
  }
}

For4_While

for (i = 0; i != imax; ++i)
{
  j = 0;
  while (j != jmax)
  {
    unused += b.intArray[j];
    ++j;
  }
}

For4_DoWhile

for (i = 0; i != imax; ++i)
{
  j = 0;
  do
  {
    unused += b.intArray[j];
    ++j;
  } while (j != jmax);
}

As you can see from the results, they all perform exactly the same (11.8 ms or 2 cycles per iteration) and so do the variants involving while and do-while. The one loop that sticks out as a poor performer with 17.7 ms or 3 cycles/iteration is For1_For1:

for (int i = 0; i != b.spec.CountIterations; ++i)
{
  for (int j = 0; j != b.spec.CountItems; ++j)
  {
    unused += b.intArray[j];
  }
}

public int CountItems { get { return countItems; } }
private int countItems;

I take this to mean that reading a property in a loop condition costs an extra cycle per iteration. Note however that semantics are not the same if you read the property in the condition or make a local copy. In a multithreaded program, a property may change between calls but a local copy will not change. Compare this to For1_For2 where the loop condition depends on the Length property of an array and For3_For3 where the length is read before the loop. These two loops perform the same which is expected since the array length cannot change (arrays cannot be resized).

Performance of List<int>

Being pleasantly surprised by the performance of foreach on arrays, I had my hopes high that List<int> would also perform well, but as you can see it doesn’t: With a loop unrolled 16 times (reading through the list’s indexer), it takes 4 cycles to do a read, and using foreach it takes 10 cycles per read!

Access Through IList<int>

The next step in the quest to test looping with C# is to come up with a way to simulate different access patterns. My approach to this is to fill a list of integers with an index to the next position in the list to read, then I can simply populate the list with the access pattern I want to test and use the same code for all patterns I come up with. The previous experiments showed that loop unrolling gave a significant performance boost, so I’ll go for single loop unrolled 16 times. The code looks like this:

int[] list = b.intArray;
for (; i != max; ++i)
{
  j = list[j];
  // repeat 16 times
}

At this point, it is clear that I should use int[] as data structure, but I am still interested in the general performance of C# so I decided to do another experiment where I use int[], List<int> and see what happens if I access them directly or through the common Ilist<int> interface. Results are in the table below:

Method Time (ms) Transfer rate (GB/s) % Max Cycles / read Reads / iteration Cycles / iteration
List<int> 23,80 2,82 25,2% 3,97 16 63,5
int[] 17,56 3,82 34,1% 2,93 16 46,9
Ilist<int> on List<Int> 92,33 0,73 6,5% 15,41 16 246,5
Ilist<int> on int[] 86,90 0,77 6,9% 14,50 16 232,0

I still get 4 cycles per read when accessing List<int> directly, but the work on the int array has increased from 1 to 3 cycles per read. That is OK since I’m done with sequential reads (those were primarily for the L1 cache and read-ahead and once I get cache misses against L1 access times are 3 or more cycles anyway).

Another point is that sequential summing a list of integers is not a typical task. Neither is the loop to test access patterns but its increased complexity is closer to the things real programs do and it is interesting that in this scenario the execution time increases by 33% and not 300% in the sequential summing example. This makes the tradeoff between the convenience of generic lists and the slower execution much more acceptable.

Access through the IList<int> interface is 4 times slower compared to using List<int> directly and 5 times slower compared to int[]. I guess the reason is that the interface prevents the compilers from doing optimizations relying on the concrete implementation. For example, there is no way for the compiler to tell that the length of the integer array is constant or that the memory layout is sequential, forcing both the C# compiler and the JIT compiler to emit code that constantly re-evaluates all aspects of the loop condition and the memory access.

Using the Code

The attached archive contains a VS 2008 project for the loops and control structures used to get the results above. To run the test, you need to download the tool Quality Gate One Studio. The archive contains a project for this tool with a couple of test sets set up for the experiments mentioned in the article. Simply run the test sets and generate reports to get results on your system.

Conclusion

This article covers a precursor to an experiment to measure practical cache performance with the purpose to identify whether it is possible to write C# code that executes fast enough to reveal cache effects. The overall conclusion to that question is that with some care and a bit of loop unrolling it is possible. In fact, for simple constructs on simple data (arrays of int) and with some loop unrolling, C# performs really well and is capable of summing integers from memory at an average rate of one integer per clock cycle.

Some things perform better than others and the following caveats have been discovered:

  • Postfix increment (e.g. i++) is expensive and adds on average 1.25 cycles reducing throughput by 65%.
  • Reading properties, e.g. in loop conditions can hurt performance because some optimizations cannot be done in a multithreaded environment. Truly constant properties like the length of an array can be optimized, but if in doubt the safest bet is to read the value of the property before entering the loop.
  • Generic lists are significantly slower (a factor 4) than arrays in this specialized case, but the evidence seems to be that the performance cost is much less in more typical cases. This article showed an example where the execution time increased only 33%.
  • Use of interfaces instead of concrete types comes with a additional performance hit. In this case, a factor 4 to 5 was observed. An educated guess is that interfaces prevent the compilers from doing optimizations that are available when using concrete types instead.

On the positive side, the evidence shows that:

  • Choice of loop construct (for, while or do-while) does not affect performance.
  • The standard foreach construct can be faster (1,5 cycles per step) than a simple for-loop (2 cycles per step), unless the loop has been unrolled (1.0 cycles per step). So for everyday code, performance is not a reason to use the more complex for, while or do-while constructs.

Not all code is performance critical, but when things must execute fast, you may have to sacrifice good programming practices for the performance:

  • Prefer concrete types over interfaces
  • Prefer simple arrays over List<>
  • Manually hoist constants outside loops if these involve properties or variables / members unless you are absolutely sure the compilers will recognize them as constants - also in a multithreaded environment.
  • Manually unroll loops

In general, don’t use postfix operators: They are both error prone and have poor performance.

Regarding compiler / platform versions, I have tried to compile for both .NET 2.0 (VS 2008) and .NET 4.0 (VS 2010) but have not identified any significant differences between the two.

Finally, this article is about fast iteration in C# but the background is an attempt to measure cache effects. I think this article is long enough as it is, but I made a few observations on my way: First, when reading sequentially using For4_Unroll4 the rate of one step per cycle is sustained regardless of buffer size which basically means that pipelining and prefetching is doing a hell of a job to get the data into the CPU as fast as possible all the way from main memory and up. When changing the access pattern to use random access or strides above 4 bytes cache effects become visible because the hardware cannot predict and prefetch fast enough.

If you want to play around on your own system, the attached code is prepared for the above experiments, but if you just want the hard facts about your system, you can get it from a benchmark tool such as SiSoft Sandra.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here