Introduction
Everybody who does performance optimization stumbles sooner or later over the Stopwatch
class in the System.Diagnostics
namespace. And everybody has noticed that the measurements of the same function on the same computer can differ 25% -30% in run time. This article shows how single threaded test programs must be designed to get an accuracy of 0.1% - 0.2% out of the Stopwatch
class. With this accuracy, algorithms can be tested and compared.
Background
Modern CPUs have multiple cores, large caches, instruction pipelines and many other stuff affecting the run time of an algorithm in a particular test scenario. White box testing techniques - like attached Debugger or Profiler - dismiss the cache lines, pipelines, etc. on the CPU. The real run time is hidden, so that an algorithm optimized for these modern superscalar CPUs performs slower (due to more instructions) than a non optimized one with an attached profiler. Black box testing (run time measurement) without an attached debugger or profiler discovers the real performance of an algorithm and completes the performance analysis of an algorithm.
Setting Up the Test Scenario
The most important thing is to prevent switching between CPU cores or processors. Switching dismisses the cache, etc. and has a huge performance impact on the test. This can be done by setting the ProcessorAffinity
mask of the process:
Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(2);
To get the CPU core more exclusively, we must prevent that other threads can use this CPU core. We set our process and thread priority to achieve this:
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Thread.CurrentThread.Priority = ThreadPriority.Highest;
And last but not least, we need a warmup phase in our tests. On my system, the results are stable after a warm-up phase of 1000-1500 milliseconds. We can use the stopwatch itself to control the warm-up (at least 1200mS here):
stopwatch.Start();
while (stopwatch.ElapsedMilliseconds < 1200)
{
result = TestFunction(seed, count);
}
stopwatch.Stop();
Here is the complete sample:
using System;
using System.Diagnostics;
using System.Threading;
namespace PreciseMeasure
{
class Program
{
static void Main(string[] args)
{
Stopwatch stopwatch = new Stopwatch();
long seed = Environment.TickCount;
long result = 0;
int count = 100000000;
Console.WriteLine("20 Tests without correct preparation");
Console.WriteLine("Warmup");
for (int repeat = 0; repeat < 20; ++repeat)
{
stopwatch.Reset();
stopwatch.Start();
result ^= TestFunction(seed, count);
stopwatch.Stop();
Console.WriteLine("Ticks: " + stopwatch.ElapsedTicks +
" mS: " +stopwatch.ElapsedMilliseconds);
}
Process.GetCurrentProcess().ProcessorAffinity =
new IntPtr(2);
Process.GetCurrentProcess().PriorityClass =
ProcessPriorityClass.High;
Thread.CurrentThread.Priority =
ThreadPriority.Highest;
Console.WriteLine();
Console.WriteLine();
Console.WriteLine("20 Tests with correct preparation");
Console.WriteLine("Warmup");
stopwatch.Reset();
stopwatch.Start();
while (stopwatch.ElapsedMilliseconds < 1200)
{
result = TestFunction(seed, count);
}
stopwatch.Stop();
for (int repeat = 0; repeat < 20; ++repeat)
{
stopwatch.Reset();
stopwatch.Start();
result ^= TestFunction(seed, count);
stopwatch.Stop();
Console.WriteLine("Ticks: " + stopwatch.ElapsedTicks +
" mS: " + stopwatch.ElapsedMilliseconds);
}
Console.WriteLine(result);
}
public static long TestFunction(long seed, int count)
{
long result = seed;
for (int i = 0; i < count; ++i)
{
result ^= i ^ seed;
}
return result;
}
}
}
Results
Without Correct Preparation
- Ticks: 1580367 mS: 572 <-- highest Value
- Ticks: 1577003 mS: 571
- Ticks: 1576140 mS: 571
- Ticks: 1560964 mS: 565
- Ticks: 1351663 mS: 489
- Ticks: 1248383 mS: 452
- Ticks: 1115361 mS: 404
- Ticks: 1112813 mS: 403
- Ticks: 1113112 mS: 403
- Ticks: 1112012 mS: 402 <-- lowest Value
- Ticks: 1330444 mS: 482
- Ticks: 1558493 mS: 564
- Ticks: 1501097 mS: 543
- Ticks: 1517796 mS: 549
- Ticks: 1542712 mS: 558
- Ticks: 1574959 mS: 570
- Ticks: 1483975 mS: 537
- Ticks: 1390578 mS: 503
- Ticks: 1546904 mS: 560
- Ticks: 1349507 mS: 488
The run time differs between 402 mS and 572 mS. That is a difference of 170 mS or 42%. These results are useless.
With Correct Preparation
- Ticks: 1110518 mS: 402
- Ticks: 1110540 mS: 402
- Ticks: 1110543 mS: 402
- Ticks: 1110684 mS: 402 <-- highest Value
- Ticks: 1110508 mS: 402
- Ticks: 1110553 mS: 402
- Ticks: 1110600 mS: 402
- Ticks: 1110433 mS: 402 <-- lowest Value
- Ticks: 1110509 mS: 402
- Ticks: 1110508 mS: 402
- Ticks: 1110489 mS: 402
- Ticks: 1110568 mS: 402
- Ticks: 1110503 mS: 402
- Ticks: 1110566 mS: 402
- Ticks: 1110625 mS: 402
- Ticks: 1110474 mS: 402
- Ticks: 1110571 mS: 402
- Ticks: 1110448 mS: 402
- Ticks: 1110555 mS: 402
- Ticks: 1110495 mS: 402
The 20 samples produce the same result of 402 mS and can only be distinguished by the tick count (the internal CPU performance counter value). The difference is 251 ticks or 0,02 %. On my system, the Stopwatch
frequency is 2760029 ticks per second. The difference between the test runs is only 0,09 mS. That is really good and can be used to measure and compare algorithm runtime.
Points of Interest
One very important thing should be kept in mind. The best (lowest) value without preparation isn't as good as the worst value with preparation. CPU context and core switches have an huge impact to the run time of applications.
Links and Additional Resources
This article is one of the performance tests that can be found on benchmarks.codeplex.com. These are research for the "Second WAF" web application framework.
History
- 28th February, 2010: Initial post