Performance optimization in bigger systems is hard because the measured numbers can vary greatly depending on the measurement method of your choice.
To measure execution timing of specific methods in your application you usually use
Time Measurement Method | Potential Pitfalls |
Stopwatch | Most accurate method on recent processors. Internally it uses the RDTSC instruction. Since the counter is processor specific you can get greatly different values when your thread is scheduled to another core or the core goes into a power saving mode. But things do change luckily: Intel's Designer's vol3b, section 16.11.1 "16.11.1 Invariant TSC The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource." |
DateTime.Now | Good but it has only a resolution of 16ms which can be not enough if you want more accuracy. |
Reporting Method | Potential Pitfalls |
Console.WriteLine | Ok if not called too often. |
Debug.Print | Are you really measuring performance with Debug Builds? Shame on you. |
Trace.WriteLine | Better but you need to plug in some good output listener like a trace file. But be aware that the first time you call this method it will read your
app.config and deserialize your system.diagnostics section which does also take time. |
In general it is a good idea to use some tracing library which does measure the timing for you and you only need to decorate some methods with tracing so you can later verify if something has changed for the better or worse. In my previous article I did compare measuring performance with quantum mechanics. This analogy does work surprising well. When you measure a quantum system there is a lower limit how accurately you can measure something. The Heisenberg uncertainty relation does tell us that you cannot measure of a quantum system the impulse and location of a particle at the same time with infinite accuracy. For programmers the two variables are execution time and memory allocations.
If you try to measure the timings of all methods in your application you will need to store them somewhere. The fastest storage space besides the CPU cache is the memory. But if your timing values do consume all available memory there is no memory left for the actual application to run.
On the other hand if you try to record all memory allocations of your application you will also need to store the data somewhere. This will cost you memory and execution time.
These constraints are always there and regardless how good the marketing of tool vendors for performance and memory profilers are:
Any measurement will disturb the system in a non predictable way.
Commercial tool vendors will tell you they do calculate this overhead and subtract it from the measured values to give you the most accurate values but in reality it is not entirely true. After falling into the trap to trust the profiler timings several
times I have got into the habit to
- Measure with a profiler to get an idea where potential bottlenecks are.
- Measure again with tracing only the specific methods to check if this method is really worth optimizing.
- Optimize it
- Measure again.
-
Be surprised that your optimization has made things worse.
- Think harder
- Implement something that really works.
- Measure again
- Finished! - Or look for the next bottleneck.
Recently I have looked into issues with serialization performance. For serialization
DataContractSerializer
was used and I was not sure if XML is really the most optimal wire format. After looking around I have found protobuf-net which uses Googles Protocol Buffer format which is a compact binary serialization format. What is good for Google should be good for us.
A small sample app to check out performance was a matter of minutes:
using ProtoBuf;
using System;
using System.Diagnostics;
using System.IO;
using System.Reflection;
using System.Runtime.Serialization;
[DataContract, Serializable]
class Data
{
[DataMember(Order=1)]
public int IntValue { get; set; }
[DataMember(Order = 2)]
public string StringValue { get; set; }
[DataMember(Order = 3)]
public bool IsActivated { get; set; }
[DataMember(Order = 4)]
public BindingFlags Flags { get; set; }
}
class Program
{
static MemoryStream _Stream = new MemoryStream();
static MemoryStream Stream
{
get
{
_Stream.Position = 0;
_Stream.SetLength(0);
return _Stream;
}
}
static void Main(string[] args)
{
DataContractSerializer ser = new DataContractSerializer(typeof(Data));
Data data = new Data
{
IntValue = 100,
IsActivated = true,
StringValue = "Hi this is a small string value to check if serialization does work as expected"
};
var sw = Stopwatch.StartNew();
int Runs = 1000 * 1000;
for (int i = 0; i < Runs; i++)
{
Serializer.Serialize<Data>(Stream, data);
}
sw.Stop();
Console.WriteLine("Did take {0:N0}ms for {1:N0} objects", sw.Elapsed.TotalMilliseconds, Runs);
Console.ReadLine();
}
}
The results are indeed promising:
Serializer | Time in ms | N objects |
protobuf-net | 807 | 1000000 |
DataContract | 4402 | 1000000 |
Nearly a factor 5 faster and a much more compact wire format. Lets use it! After switching over to protbuf-net the transfered wire data has dropped by a factor two (good) and the performance has worsened by nearly a factor two. How is that possible? We have measured it? Protobuf-net is much faster! As it turns out protobuf-net is faster but it has a cost: For the first time a type is de/serialized it does use some very smart code-gen which does not come for free. Lets try to measure this one by setting of our performance test app the Runs value not to one million but to 1.
Serializer | Time in ms | N objects |
protobuf-net | 85 | 1 |
DataContract | 24 | 1 |
The code-gen overhead is significant and can take up to 200ms for more complex types. The break even point where the code-gen cost is amortized by its faster serialization performance is (assuming small objects) somewhere between 20.000-40.000 serialized objects. As it turned out my specific scenario involved about 100 types and 1000 serializations in total. That explains why the good old
DataContractSerializer
is not so easy to take out of business. The final approach I ended up was to reduce the number of types and to serialize primitive types via
BinaryWriter
directly which turned out to be a pretty good alternative.
It sounded good until I measured again and found that my optimizations so far do not help much. After looking more deeper at the profiling data I did found that one of the 1000 calls did take 50% of the time. So how do I find out which call it was? Normal profilers do fail short at this discipline. A (totally undeserved) relatively unknown profiler is SpeedTrace which does unlike normal profilers create traces of your applications by instrumenting your IL code at runtime. This way you can look at the full call stack of the one slow serializer call to find out if this stack was something special.
Unfortunately the call stack showed nothing special. But luckily I have my own tracing as well and I could see that the slow serializer call did happen during the serialization of a bool value. When you encounter after much analysis something unreasonable you cannot explain it then the chances are good that your thread was suspended by the garbage collector. If there is a problem with excessive GCs remains to be investigated but so far the serialization performance seems to be mostly
OK.
When you do profile a complex system with many interconnected processes you can never be sure that the timings you just did measure are accurate at all. Some process might be hitting the disc slowing things down for all other processes for some seconds as well. There is a big difference between warm and cold startup. If you restart all processes you can basically forget the first run because of the OS disc cache, JIT and GCs make the measured timings very flexible. When you are in need of a random number generator you should measure cold startup times of a sufficiently complex system. After the first run you can try again getting different and much lower numbers. Now try again at least two times to get some feeling how stable the numbers are.
Oh and try to do the same thing the next day. It might be that the bottleneck you found yesterday is gone today. Thanks to GC and other random stuff it can become pretty hard to find stuff worth optimizing if no big bottlenecks except bloatloads of code are left anymore.
When I have found a spot worth optimizing I do make the code changes and do measure again to check if something has changed. If it has got slower and I am certain that my change should have made it faster I can blame the GC again. The thing is that if you optimize stuff and you allocate less objects the GC times will shift to some other location. If you are unlucky it will make your faster working code slower because you see now GCs at times where none were before. This is where the stuff does get really tricky. A safe escape hatch is to create a repro of the slow code in an isolated application so you can change things fast in a reliable manner. Then the normal profilers do also start working again.
As Vance Morrison does point out it is much more complex to profile a system against the wall clock compared to optimize for CPU time. The reason is that for wall clock time analysis you need to understand how your system does work and which threads (if you have not one but perhaps 20) are causing a visible delay to the end user and which threads can wait a long time without affecting the user experience at all.
Next time: Commercial profiler shootout.