Recently, Roger Johansson wrote a post titled Wire – Writing one of the fastest .NET serializers, describing the optimizations that were implemented to make Wire as fast as possible. He also followed up that post with a set of benchmarks, showing how Wire compared to other .NET serialisers:
Using BenchmarkDotNet, this post will analyze the individual optimizations and show how much faster each change is. For reference, the full list of optimizations in the original blog post are:
- Looking up value serializers by type
- Looking up types when deserializing
- Byte buffers, allocations and GC
- Clever allocations
- Boxing, Unboxing and Virtual calls
- Fast creation of empty objects
Looking Up Value Serializers By Type
This optimization changed code like this:
public ValueSerializer GetSerializerByType(Type type)
{
ValueSerializer serializer;
if (_serializers.TryGetValue(type, out serializer))
return serializer;
}
into this:
public ValueSerializer GetSerializerByType(Type type)
{
if (ReferenceEquals(type.GetTypeInfo().Assembly, ReflectionEx.CoreAssembly))
{
if (type == TypeEx.StringType) return StringSerializer.Instance;
if (type == TypeEx.Int32Type)
return Int32Serializer.Instance;
if (type == TypeEx.Int64Type)
return Int64Serializer.Instance;
...
}
So it has replaced a dictionary
lookup with an if
statement. In addition, it is caching the Type
instance of known types, rather than calculating them every time. As you can see, the optimization pays off in some circumstance but not in others, so it’s not a clear win. It depends on where the type is in the list of if
statements. If it’s near the beginning (e.g. System.String
), it’ll be quicker than if it’s near the end (e.g. System.Byte[]
), which makes sense as all the other comparisons have to be done first.
Full benchmark code and results
Looking Up Types When Deserializing
The second optimization works by removing all unnecessary memory allocations, it did this by:
- Using a custom
struct
(value type) rather than a class
- Pre-calculating a hash code once, rather than each time a comparison is needed
- Doing
string
comparisons with raw byte []
, rather than deserializing to a string
Full benchmark code and results
Note: These results nicely demonstrate how BenchmarkDotNet can show you memory allocations as well as the time taken.
Interestingly, they hadn’t actually removed all memory allocations as the comparisons between OptimisedLookup
and OptimisedLookupCustomComparer
show. To fix this, I sent a P.R which removes unnecessary boxing, by using a Custom Comparer rather than the default struct
comparer.
Byte Buffers, Allocations and GC
Again removing unnecessary memory allocations were key in this optimization, most of which can be seen in the NoAllocBitConverter. Clearly serialization spends a lot of time converting from the in-memory representation of an object to the serialized version, i.e., a byte []
. So several tricks were employed to ensure that temporary memory allocations were either removed completely or if that wasn’t possible, they were done by re-using a buffer from a pool rather than allocating a new one each time (see “Buffer recycling”)
Full benchmark code and results
Clever Allocations
This optimization is perhaps the most interesting, because it’s implemented by creating a custom data structure, tailored to the specific needs of Wire. So, rather than using the default .NET dictionary, they implemented FastTypeUShortDictionary. In essence, this data structure optimizes for having only 1 item, but falls back to a regular dictionary when it grows larger. To see this in action, here is the code from the TryGetValue(..) method:
public bool TryGetValue(Type key, out ushort value)
{
switch (_length)
{
case 0:
value = 0;
return false;
case 1:
if (key == _firstType)
{
value = _firstValue;
return true;
}
value = 0;
return false;
default:
return _all.TryGetValue(key, out value);
}
}
Like we’ve seen before, the performance gains aren’t clear-cut. For instance, it depends on whether FastTypeUShortDictionary
contains the item you are looking for (Hit
v Miss
), but generally it is faster:
Full benchmark code and results
Boxing, Unboxing and Virtual Calls
This optimization is based on the widely used trick that I imagine almost all .NET serialisers employ. For a serializer to be generic, it has to be able to handle any type of object that is passed to it. Therefore, the first thing it does is use reflection to find the public
fields/properties of that object, so that it knows the data is has to serialise. Doing reflection like this time-and-time again gets expensive, so the way to get round it is to do reflection once and then use dynamic code generation to compile a delegate
that you can then call again and again.
If you are interested in how to implement this, see the Wire compiler source or this Stack Overflow question. As shown in the results below, compiling code dynamically is much faster than reflection and only a little bit slower than if you read/write the property directly in C# code:
Full benchmark code and results
Fast Creation of Empty Objects
The final optimization trick used is also based on dynamic code creation, but this time it is purely dealing with creating empty objects. Again, this is something that a serializer does many times, so any optimizations or savings are worth it.
Basically, the benchmark is comparing code like this:
FormatterServices.GetUninitializedObject(type);
with dynamically generated code, based on Expression trees:
var newExpression = ExpressionEx.GetNewExpression(typeToUse);
Func<TestClass> optimization = Expression.Lambda<Func<TestClass>>(newExpression).Compile();
However, this trick only works if the constructor
of the type being created is empty, otherwise it has to fall back to the slow version. But as shown in the results below, we can see that the optimization is a clear win and worth implementing:
Full benchmark code and results
Summary
So it’s obvious that Roger Johansson and Szymon Kulec (who also contributed performance improvements) know their optimizations and as a result, they have steadily made the Wire serializer faster, which makes it an interesting project to learn from.
The post Analysing Optimisations in the Wire Serialiser first appeared on my blog Performance is a Feature!
CodeProject