Performance Tests: Precise Run Time Measurements with System.Diagnostics.Stopwatch

Thomas Maierhofer (Tom)

4.95/5 (52 votes)

28 Feb 2010BSD3 min read

How the .NET Framework Stopwatch class can be used to measure and compare algorithm runtime with a high accuracy

Introduction

Everybody who does performance optimization stumbles sooner or later over the Stopwatch class in the System.Diagnostics namespace. And everybody has noticed that the measurements of the same function on the same computer can differ 25% -30% in run time. This article shows how single threaded test programs must be designed to get an accuracy of 0.1% - 0.2% out of the Stopwatch class. With this accuracy, algorithms can be tested and compared.

Background

Modern CPUs have multiple cores, large caches, instruction pipelines and many other stuff affecting the run time of an algorithm in a particular test scenario. White box testing techniques - like attached Debugger or Profiler - dismiss the cache lines, pipelines, etc. on the CPU. The real run time is hidden, so that an algorithm optimized for these modern superscalar CPUs performs slower (due to more instructions) than a non optimized one with an attached profiler. Black box testing (run time measurement) without an attached debugger or profiler discovers the real performance of an algorithm and completes the performance analysis of an algorithm.

Setting Up the Test Scenario

The most important thing is to prevent switching between CPU cores or processors. Switching dismisses the cache, etc. and has a huge performance impact on the test. This can be done by setting the ProcessorAffinity mask of the process:

Process.GetCurrentProcess().ProcessorAffinity = new IntPtr(2); // Use only the second core

To get the CPU core more exclusively, we must prevent that other threads can use this CPU core. We set our process and thread priority to achieve this:

Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Thread.CurrentThread.Priority = ThreadPriority.Highest;

And last but not least, we need a warmup phase in our tests. On my system, the results are stable after a warm-up phase of 1000-1500 milliseconds. We can use the stopwatch itself to control the warm-up (at least 1200mS here):

stopwatch.Start();
while (stopwatch.ElapsedMilliseconds < 1200)
{
    result = TestFunction(seed, count); 
}
stopwatch.Stop();

Here is the complete sample:

 using System;
using System.Diagnostics;
using System.Threading;

namespace PreciseMeasure
{
    class Program
    {
        static void Main(string[] args)
        {
            Stopwatch stopwatch = new Stopwatch();

            long seed = Environment.TickCount; 	// Prevents the JIT Compiler 
					// from optimizing Fkt calls away
            long result = 0;
            int count = 100000000;

            Console.WriteLine("20 Tests without correct preparation"); 
            Console.WriteLine("Warmup");
            for (int repeat = 0; repeat < 20; ++repeat)
            {
                stopwatch.Reset();
                stopwatch.Start();
                result ^= TestFunction(seed, count);
                stopwatch.Stop();
                Console.WriteLine("Ticks: " + stopwatch.ElapsedTicks + 
				" mS: " +stopwatch.ElapsedMilliseconds);
            }

            Process.GetCurrentProcess().ProcessorAffinity = 
		new IntPtr(2); // Uses the second Core or Processor for the Test
            Process.GetCurrentProcess().PriorityClass = 
		ProcessPriorityClass.High;  	// Prevents "Normal" processes 
					// from interrupting Threads
            Thread.CurrentThread.Priority = 
		ThreadPriority.Highest;  	// Prevents "Normal" Threads 
					// from interrupting this thread
    
            Console.WriteLine();
            Console.WriteLine();
            Console.WriteLine("20 Tests with correct preparation"); 
            Console.WriteLine("Warmup");
            stopwatch.Reset();
            stopwatch.Start();
            while (stopwatch.ElapsedMilliseconds < 1200)  // A Warmup of 1000-1500 mS 
					// stabilizes the CPU cache and pipeline.
            {
                result = TestFunction(seed, count); // Warmup
            }
            stopwatch.Stop();

            for (int repeat = 0; repeat < 20; ++repeat)
            {
                stopwatch.Reset();
                stopwatch.Start();
                result ^= TestFunction(seed, count);
                stopwatch.Stop();
                Console.WriteLine("Ticks: " + stopwatch.ElapsedTicks + 
				" mS: " + stopwatch.ElapsedMilliseconds);
            }
            Console.WriteLine(result); // prevents optimizations (current compilers are 
		// too silly to analyze the dataflow that deep, but we never know )
        }

        public static long TestFunction(long seed, int count)
        {
            long result = seed;
            for (int i = 0; i < count; ++i)
            {
                result ^= i ^ seed; // Some useless bit operations
            }
            return result;
        }
    }
}

Results

Without Correct Preparation

Ticks: 1580367 mS: 572 <-- highest Value
Ticks: 1577003 mS: 571
Ticks: 1576140 mS: 571
Ticks: 1560964 mS: 565
Ticks: 1351663 mS: 489
Ticks: 1248383 mS: 452
Ticks: 1115361 mS: 404
Ticks: 1112813 mS: 403
Ticks: 1113112 mS: 403
Ticks: 1112012 mS: 402 <-- lowest Value
Ticks: 1330444 mS: 482
Ticks: 1558493 mS: 564
Ticks: 1501097 mS: 543
Ticks: 1517796 mS: 549
Ticks: 1542712 mS: 558
Ticks: 1574959 mS: 570
Ticks: 1483975 mS: 537
Ticks: 1390578 mS: 503
Ticks: 1546904 mS: 560
Ticks: 1349507 mS: 488

The run time differs between 402 mS and 572 mS. That is a difference of 170 mS or 42%. These results are useless.

With Correct Preparation

Ticks: 1110518 mS: 402
Ticks: 1110540 mS: 402
Ticks: 1110543 mS: 402
Ticks: 1110684 mS: 402 <-- highest Value
Ticks: 1110508 mS: 402
Ticks: 1110553 mS: 402
Ticks: 1110600 mS: 402
Ticks: 1110433 mS: 402 <-- lowest Value
Ticks: 1110509 mS: 402
Ticks: 1110508 mS: 402
Ticks: 1110489 mS: 402
Ticks: 1110568 mS: 402
Ticks: 1110503 mS: 402
Ticks: 1110566 mS: 402
Ticks: 1110625 mS: 402
Ticks: 1110474 mS: 402
Ticks: 1110571 mS: 402
Ticks: 1110448 mS: 402
Ticks: 1110555 mS: 402
Ticks: 1110495 mS: 402

The 20 samples produce the same result of 402 mS and can only be distinguished by the tick count (the internal CPU performance counter value). The difference is 251 ticks or 0,02 %. On my system, the Stopwatch frequency is 2760029 ticks per second. The difference between the test runs is only 0,09 mS. That is really good and can be used to measure and compare algorithm runtime.

Points of Interest

One very important thing should be kept in mind. The best (lowest) value without preparation isn't as good as the worst value with preparation. CPU context and core switches have an huge impact to the run time of applications.

Links and Additional Resources

This article is one of the performance tests that can be found on benchmarks.codeplex.com. These are research for the "Second WAF" web application framework.

History

28^th February, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The BSD License