Introduction
This is the first of two (or possibly three, depending on interest) articles on optimizing serialization, especially for use in remoting.
This first article includes general-purpose code which is used to store 'owned data' (defined later) into a compact unit with maximum speed. The second article will provide an example of how to use this code to serialize datasets as self-contained units. The possible third article will cover how to serialize Entity
s and EntityCollection
s from LLBLGenPro - the leading O/R mapper - as an example of how to take over the whole serialization process to get excellent results. Although this is specific to one application, you may find the techniques used useful in your code.
The code in this article was inspired by three articles on CodeProject:
Background
If you've ever used .NET remoting for large amounts of data, you will have found that there are problems with scalability. For small amounts of data, it works well enough, but larger amounts take a lot of CPU and memory, generate massive amounts of data for transmission, and can fail with Out Of Memory exceptions. There is also a big problem with the time taken to actually perform the serialization - large amounts of data can make it unfeasible for use in apps, regardless of the CPU/memory overhead, simply because it takes so long to do the serialization/deserialization. Using data compression via Server and Client sinks can help with the resulting transmission size, but doesn't help with the excesses earlier in the process.
Having said that, you can't blame .NET for this, bearing in mind all of the work it does: it will ensure that the whole graph of objects required to rebuild the original object are discovered, and that multiple references to the same object are dealt with properly to ensure that only one common instance is deserialized. It also has to do this via reflection, and be able to do this without having any prior knowledge of the objects involved, so overall it does a pretty good job. It will also allow you to take part in the serialization/deserialization process by letting you implement the ISerializable
interface if you know you can do a better job than just recreating the field data via reflection.
'Prior knowledge' is the key here. We can use that to optimize how certain 'owned data' (defined later) is stored, and let .NET deal with the rest. That is what this article will be about.
As a taster, let me give an example of the scale of optimization that may be possible:
I had a set of reference data from 34,423 rows in a database table, and it was stored in a collection of entities. Serializing this (to a MemoryStream
for maximum speed) took a whopping 92 seconds, and resulted in a 13.5MB lump of serialized data. Deserializing this took around 58 seconds - not very usable in a remoting scenario!
By using the techniques in this article, I was able to serialize the same data down to 2.1MB, which took just 0.35 seconds to serialize, and 0.82 seconds to deserialize! Also, the CPU usage and memory were just a fraction of that used by the raw .NET serializer.
Using the Code
As mentioned in the Introduction, the downloadable code is pretty general purpose (have to avoid using 'generic' here!), so there is nothing specific to remoting as such. Basically, you put 'owned' data into an instance of the SerializationWriter
class, and then use the ToArray
method to get a byte[]
containing the serialized data. This can then be stored in the SerializationInfo
parameter passed to the ISerializable.GetObjectData()
method, as normal, like this:
public virtual void GetObjectData(SerializationInfo info,
StreamingContext context)
{
SerializationWriter writer = new SerializationWriter();
writer.Write(myInt32Field);
writer.Write(myDateTimeField);
writer.Write(myStringField);
info.AddValue("data", writer.ToArray());
}
Deserialization is essentially the reverse: in your deserialization constructor, you retrieve the byte[]
and create a SerializationReader
instance, passing the byte[]
to its constructor. Data is then retrieved in the same order that it was written:
protected EntityBase2(SerializationInfo info, StreamingContext context)
{
SerializationReader reader =
new SerializationReader((byte[]) info.GetValue("data",
typeof(byte[])));
myInt32Field = reader.ReadInt32();
myDateTimeField = reader.ReadDateTime();
myString = reader.ReadString();
}
Just copy the FastSerializer.cs file into your project, change the namespace to suit, and it is ready for use. The download also includes a FastSerializerTest.cs file with 220+ tests, but this isn't required, and only include it if you want to modify the code and ensure you don't break anything.
(from v2.2)
Each class now has its own file. Just copy the files from the FastSerialization/Serialization folder into your project and change the namespace to suit, and it is ready for use. The download also includes 700+ unit tests under the FastSerializer.UnitTest folder.
Owned Data
I've mentioned previously the concept of 'owned data', so let's try to define this: Owned Data is object data that is:
- Any value-type/struct data such as
Int32
or Byte
or Boolean
etc. Since value-types are structs and recreated as they are passed around, any value-type data within your object cannot be affected by another object, and so it is always safe to serialize/deserialize them. - Strings. Although they are reference-types, they are immutable (cannot be changed once created), and so have value-type semantics, and it is always safe to serialize them.
- Other reference-types created by (or passed in to) your object that are never used by external objects. This would include internal or private
Hashtable
s, ArrayList
s, Array
s etc., because they are not accessible by external objects. It could also include objects that were created externally and passed to your object to use exclusively. - Other reference-types (again created by your object or passed to it) which might be used by other objects but you know will not cause a problem during deserialization. The problem here is that your object itself has no knowledge of what other objects may be serialized in the same object graph. So if it serialized a shared-object using the
SerializationWriter
into a byte[]
, and that same shared-object was serialized again by another external object using its SerializationWriter
, then two instances would end up being deserialized - a different instance for each - because the serialization infrastructure would never get to see them to check and deal with the multiple references.
This may or may not be a problem depending on the shared object: If the shared object was immutable/had no field data, then getting two instances created during deserialization, although inefficient, would not cause a problem. But if there was field data involved and it was supposed to be shared by the two objects referencing it, then it would be a problem because each now has its own independent copy. The worst case scenario is when the shared object stores references back to the object that references it, then there is a risk of a loop which would cause serialization to fail pretty quickly with an OutOfMemoryException
or StackOverflow
. Having said all this, it is only a problem if multiple referencing objects are serialized in the same graph. If only one object is part of the graph, then the referenced object can be considered 'owned data' - the other references become immaterial - but it is up to you to identify this situation.
The bottom line is to make sure that only carefully identified 'Owned Data' is stored within the SerializationWriter
, and let the .NET Serialization Infrastructure take care of the rest.
How Does it Work (Short Version)
SerializationWriter
has a number of Write(xxx)
methods overloaded for a number of types. It also has a set of WriteOptimized(xxx)
methods which can store certain types in a more optimized manner but may have some restrictions for the values to be stored (which are documented with the method). For data that is unknown at compile-time, there is the WriteObject(object)
method which will store the data type as well as the value so that SerializationReader
knows how to restore the data back again. The data type is stored using a single byte which is based around an internal enum
called SerializedType
.
SerializationReader
has a number of methods to match those on SerializationWriter
. These can't be overloaded in the same manner, so each is a separate method named to describe its use. So for example, a string
written using Write(string)
would be retrieved using ReadString()
, WriteOptimized(Int32)
would be retrieved using ReadOptimizedInt32()
, and WriteObject (object)
would be retrieved using ReadObject()
, and so on. As long as the equivalent method to retrieve data is called on SerializationReader
and, importantly, in the same order, then you will get back exactly the same data that was written.
How Does It Work (Long Version)
The Write(xxx)
methods store data using the normal size for the type, so an Int32
will always take 4 bytes, a Double
will always take 8 bytes, etc. The WriteOptimized(xxx)
methods use an alternative storage method that not only depends on the type but its value also. So, an Int32
that is less than 128 can be stored in a single byte (by using 7-bit encoding), but Int32
values of 268,435,456 or larger or negative numbers can't be stored using this technique (otherwise they would take 5 bytes and so can't be considered optimizable!), but if you want to store a value, say the number of items in a list, that you know will never be negative and will never reach the limit, then use the WriteOptimized()
method instead.
DateTime
is another type that has an optimized method. Its limitations are that it can't optimize a DateTime
value that is precise to sub-millisecond levels, but for a common case where just a date is stored without a time, then it will take up only 3 bytes in the stream. (A DateTime
with hh:mm and no seconds will take 5 bytes - still much better than the 8 bytes taken by the Write(DateTime)
method.) The WriteObject()
method uses the SerializedType
enumeration to describe the type of object which is next in the stream. The enumeration is defined as byte
, which gives us 256 possible values. Each basic type will take up one of these values, but 256 is quite a lot to go at, so I've made use of some of them to 'encode' the values of well-known values in with their type. So each numeric type also has a 'zero' version (some also have a 'minus one' version too) which allows just a single byte to specify the type and the value. This allows objects which have a lot of data, whose type is unknown at compile-time, to be serialized very compactly.
Since strings are used quite extensively, I have paid special attention to them to ensure that they are always optimized (strings don't have a WriteOptimized(string)
overload since they are always optimized!). To that end, I have allocated 128 out of the 256 values for string usage - actually string lists. This allows any string to be written using a string list token (consisting of a byte
plus an optimized Int32
), to ensure that a given string value is written once and once only - if a string is seen multiple times, then only the string list token is stored multiple times. By making 128 string lists available, each having fewer strings rather than one string list containing many strings, the string token will be just 1 byte for the first 16,256 unique strings, thereafter taking 2 bytes for the next 2,080,768 unique strings! That should be enough for anyone! Special care has been taken to generate a new list once the current list reaches 127 in size, to take advantage of the smaller string token size - once all 128 possible lists are created, then strings are allocated to them in a round robin fashion. Strings are only tokenized if they are longer than two chars - Null
, Empty
, 'Y', 'N', or a single space have their own SerializedTypeCode
s, and other 1-char strings will take 2 bytes (1 for type, and 1 for the char itself).
The other big advantage of using string tokens is that during deserialization, only one instance of a given string will be stored in memory. Whilst the .NET serializer does the same for the same string reference, it doesn't do this where the references are different but the value is the same. This happens a lot when reading database tables that contain the same string value in a column - because they arrive via a DataReader
at different times, they have the same value but different references, and get serialized many times. This doesn't matter to SerializationWriter
- it uses a Hashtable
to identify identical strings regardless of their reference. So a deserialized graph is usually more memory-efficient than the graph before serialization.
Arrays of objects also have been given special attention since they are so prevalent in database-type work, whether as part of a DataTable
or entity classes. Their contents are written using the WriteObject()
method of course, to store the type and/or optimized value, but there are special optimizations such as looking for sequences of null
or DBNull.Value
. Where a sequence is identified, then a SerializationType
identifying the sequence is written (1 byte) followed by the length of the sequence (1 byte, typically). There is also an overloaded method called WriteOptimized(object[], object[])
which takes two object arrays of the same length, such as you might find in a modified DataRow
or a modified entity; the first object[]
is written as described above, but the values in the second list are compared to their equivalent in the first, and where the values are identical, a specific SerializationType
identifies this, thus reducing the size for each identical pair to a single byte regardless of its normal storage size.
Whilst developing SerializationWriter
, I came across a need for serializing an object (a factory class) that would be used by many entities in a collection. Since this factory class had no data of its own, only its type needed serializing, but it would be helpful to ensure that each entity ended up using the same factory class during deserialization. To this end, I have added tokenization of any object: Using WriteTokenizedObject (object)
will put a token into the stream to represent that object, and the object itself will be serialized later just after the string tokens are serialized. I have also added an overload to this method.
As an additional space saver, if your object can be recreated by a parameter-less constructor, then use the WriteTokenizedObject(object, true)
overload instead: this will store just the Type name as a string
, and SerializationReader
will recreate it directly using Activator.GetInstance
.
There is just one property available for configuration: SerializationWriter.OptimizeForSize
(which is true
, by default) controls whether the WriteObjectMethod()
should use the optimized serialization where possible. Because it has to check whether the value falls within the parameters required for optimization, this takes slightly longer to serialize. Setting this to false
will bypass these checks, and use the quick and simple method. In reality, these checks will not be noticeable for small sets of data, and take only a few extra milliseconds for large sets of data (tens of megabytes), so normally, leave this property as-is. All of the optimizations are fully documented in the code, especially the requirements for optimizations, which you should see as a tool tip.
Golden Rules for Best Optimization
Here is a list of key points to bear in mind for optimizing your serialization:
- Know your data: By knowing your data, how it is used, what range of values are likely, what range of values are possible, etc., you can identify 'owned' data, and that will help decide which methods are appropriate (or indeed, whether you should not use any methods, and serialize the data directly into the
SerializationInfo
block). There is always a non-optimized method available for any primitive data type, but using the optimized version gives the best results. - Read the data in the same order that it was written: Because we are streaming data rather than associating it with a name, it is essential that it is read back in exactly the same order that it was written. Actually, you will find this isn't too much of a problem - the process will fail pretty quickly if there is an ordering problem, and this will be discovered at design time.
- Don't serialize anything that isn't necessary: You can't get better optimization than data that takes up zero bytes! See Techniques for Optimization later in the article, for an example.
- Consider whether child objects can be considered 'Owned Data': If you serialize a collection, for example, are its contents considered part of the collection, or separate objects to be serialized separately and just referenced? This consideration can have a big impact on the size of the data serialized, since if the former is true, then a single
SerializationWriter
is effectively shared for many objects and the effects of string tokenization can have a dramatic effect. See Part 2 of this article, as an example. - Remember that serialization is a Black-Box Process: By this, I mean that the data you serialize doesn't have to be in the same format as it is in memory. As long as at the end of deserialization, the data is as it was before serialization, it doesn't matter what happens to it in the meantime - anything goes! Optimize by serializing just enough data to be able to recreate the objects at the other end. See Techniques for Optimization later in the article, for an example.
SerializedType Enum
Below is a table showing the currently-used SerializedType
values. 128 are reserved for string tables, and 70 are listed below, leaving 58 available for other uses.
NullType | Used for all null values |
NullSequenceType | Used internally to identify sequences of null values in object arrays |
DBNullType | Used for all DBNull.Value instances |
DBNullSequenceType | Used internally to identify sequences of DBNull.Value in object arrays (DataSet s use this value extensively) |
OtherType | Used for any unrecognized types |
BooleanTrueType BooleanFalseType | For Boolean type and values |
ByteType SByteType CharType DecimalType DoubleType SingleType Int16Type Int32Type Int64Type UInt16Type UInt32Type UInt64Type | Standard numeric value types |
ZeroByteType ZeroSByteType ZeroCharType ZeroDecimalType ZeroDoubleType ZeroSingleType ZeroInt16Type ZeroInt32Type ZeroInt64Type ZeroUInt16Type ZeroUInt32Type ZeroUInt64Type | Optimization to store numeric type and a zero value |
OneByteType OneSByteType OneDecimalType OneDoubleType OneSingleType OneInt16Type OneInt32Type OneInt64Type OneUInt16Type OneUInt32Type OneUInt64Type | Optimization to store numeric type and a one value |
MinusOneInt16Type MinusOneInt32Type MinusOneInt64Type | Optimization to store numeric types and a minus one value |
OptimizedInt32Type OptimizedInt64Type OptimizedInt64Type OptimizedUInt64Type | Stores 32 and 64 bit types using the fewest bytes possible ? see code for restrictions |
EmptyStringType SingleSpaceType SingleCharStringType YStringType NStringType | Optimization to store single-char strings efficiently (1 or 2 bytes) |
ObjectArrayType ByteArrayType CharArrayType | Optimizations for common array types |
DateTimeType MinDateTimeType MaxDateTimeType | DateTime struct with often-used values |
TimeSpanType ZeroTimeSpanType | TimeSpan struct with often-used values |
GuidType EmptyGuidType | GUID struct with often-used values |
BitVector32Type | Optimized to store a BitVector32 in 1 to 4 bytes |
DuplicateValueType | Used internally when storing a pair of object arrays |
BitArrayType | Optimized to store BitArray s |
TypeType | Stores a Type as a string (will use full AssemblyQualifiedName for non-system Types) |
SingleInstanceType | Used internally to identify that a tokenized object should be recreated using Activator.GetInstance() |
ArrayListType | Optimization for ArrayList |
Techniques for Optimization
OK, we have identified 'Owned Data', and seen how it is possible to store it using fewer bytes than its actual in-memory size, using tokens and well-known values, but is there anything else we can do to improve optimization? Certainly, there is...... Let's look at an example of Golden Rule #3 - Don't serialize anything that isn't necessary:
A straightforward example of this is a Hashtable
that is used internally to quickly locate a particular item based on one of its properties. That Hashtable
can easily be recreated using the deserialized data so there is no need to store the Hashtable
itself. For most other scenarios, the problem isn't serialization itself, it's the deserialization. The deserialization needs to know what to expect in the data stream - if it isn't implicit like in the previous example, then you need to store that information somehow.
Enter the BitVector32
class: a little-known class that is your friend here. See the docs for full information, but basically it is a struct, taking four bytes that can be used in either of two ways (but not both together!) - it can use its 32 bits to store 32 boolean flags, or you can allocate sections of multiple bits to pack in data (the DateTime
optimization in SerializationWriter
uses this technique, so have a look at the code). In its boolean flag mode, it can be invaluable to identify which bits of data have actually been stored and, at deserialization time, your code can check the flags, and either read the expected data, or take some other action where some other action would be to use a default value, or create an empty object, or do nothing (a default value may have already been created in a constructor, for example).
Other benefits of using a BitVector32
are that boolean data values are stored as a single bit, and the BitVector32
may be stored optimized (provided that no more than 21 bits are used - otherwise use Write(BitVector32)
for a fixed four bytes) so that a BitVector32
using less than 8 flags will only take a single byte! Similarly, if you find you need lots of flags, say if you have a large list of objects and you need to store a boolean for each, then use a BitArray
, which will still use a single bit per item (just rounded to the nearest byte) but can store many, many bits.
As an example of how useful bit flags can be, here is some sample code from the fast DataSet
serializer I will write about in Part 2: The flags are created using the BitVector32.CreateMask()
method which is overloaded to 'chain' subsequent masks to the previous ones. They are static
and read-only, so memory-efficient. This particular set of flags is for a DataColumn
: it will take two bytes per serialized column, but note that some data, such as AllowNull
and ReadOnly
, is already serialized by the flag itself, and that other data will now only be serialized conditionally. In fact, one bit flag (HasAutoIncrement
) is used to conditionally serialize three pieces of data (AutoIncrement
, AutoIncrementSeed
, and AutoIncrementStep
).
static readonly int MappingTypeIsNotElement
= BitVector32.CreateMask();
static readonly int AllowNull = BitVector32.CreateMask(MappingTypeIsNotElement);
static readonly int HasAutoIncrement = BitVector32.CreateMask(AllowNull);
static readonly int HasCaption = BitVector32.CreateMask(HasAutoIncrement);
static readonly int HasColumnUri = BitVector32.CreateMask(HasCaption);
static readonly int ColumnHasPrefix = BitVector32.CreateMask(HasColumnUri);
static readonly int HasDefaultValue = BitVector32.CreateMask(ColumnHasPrefix);
static readonly int ColumnIsReadOnly =
BitVector32.CreateMask(HasDefaultValue);
static readonly int HasMaxLength = BitVector32.CreateMask(ColumnIsReadOnly);
static readonly int DataTypeIsNotString = BitVector32.CreateMask(HasMaxLength);
static readonly int ColumnHasExpression =
BitVector32.CreateMask(DataTypeIsNotString);
static readonly int ColumnHasExtendedProperties =
BitVector32.CreateMask(ColumnHasExpression);
static BitVector32 GetColumnFlags(DataColumn dataColumn)
{
BitVector32 flags = new BitVector32();
flags[MappingTypeIsNotElement] =
dataColumn.ColumnMapping != MappingType.Element;
flags[AllowNull] = dataColumn.AllowDBNull;
flags[HasAutoIncrement] = dataColumn.AutoIncrement;
flags[HasCaption] = dataColumn.Caption != dataColumn.ColumnName;
flags[HasColumnUri] = ColumnUriFieldInfo.GetValue(dataColumn) != null;
flags[ColumnHasPrefix] = dataColumn.Prefix != string.Empty;
flags[HasDefaultValue] = dataColumn.DefaultValue != DBNull.Value;
flags[ColumnIsReadOnly] = dataColumn.ReadOnly;
flags[HasMaxLength] = dataColumn.MaxLength != -1;
flags[DataTypeIsNotString] = dataColumn.DataType != typeof(string);
flags[ColumnHasExpression] = dataColumn.Expression != string.Empty;
flags[ColumnHasExtendedProperties] =
dataColumn.ExtendedProperties.Count != 0;
return flags;
}
Here are the methods that make use of the flags to serialize/deserialize all of the columns of a DataTable
: You can see the flags being used to combine serialization of optional data with mandatory data such as ColumnName
and defaulted data such as DataType
, where that data is always required but only needs to be serialized if it isn't our chosen default (in this case typeof(string)
).
void SerializeColumns(DataTable table)
{
DataColumnCollection columns = table.Columns;
writer.WriteOptimized(columns.Count);
foreach(DataColumn column in columns)
{
BitVector32 flags = GetColumnFlags(column);
writer.WriteOptimized(flags);
writer.WriteString(column.ColumnName);
if (flags[DataTypeIsNotString])
writer.Write(column.DataType.FullName);
if (flags[ColumnHasExpression])
writer.Write(column.Expression);
if (flags[MappingTypeIsNotElement])
writer.WriteOptimized((int) MappingType.Element);
if (flags[HasAutoIncrement]) {
writer.Write(column.AutoIncrementSeed);
writer.Write(column.AutoIncrementStep);
}
if (flags[HasCaption]) writer.Write(column.Caption);
if (flags[HasColumnUri])
writer.Write((string) ColumnUriFieldInfo.GetValue(column));
if (flags[ColumnHasPrefix]) writer.Write(column.Prefix);
if (flags[HasDefaultValue]) writer.WriteObject(column.DefaultValue);
if (flags[HasMaxLength]) writer.WriteOptimized(column.MaxLength);
if (flags[TableHasExtendedProperties])
SerializeExtendedProperties(column.ExtendedProperties);
}
}
void DeserializeColumns(DataTable table)
{
int count = reader.ReadOptimizedInt32();
DataColumn[] dataColumns = new DataColumn[count];
for(int i = 0; i < count; i++)
{
DataColumn column = null;
string columnName;
Type dataType;
string expression;
MappingType mappingType;
BitVector32 flags = reader.ReadOptimizedBitVector32();
columnName = reader.ReadString();
dataType = flags[DataTypeIsNotString] ?
Type.GetType(reader.ReadString()) :
typeof(string);
expression = flags[ColumnHasExpression] ?
reader.ReadString() : string.Empty;
mappingType = flags[MappingTypeIsNotElement] ?
(MappingType) reader.ReadOptimizedInt32() :
MappingType.Element;
column = new DataColumn(columnName, dataType,
expression, mappingType);
column.AllowDBNull = flags[AllowNull];
if (flags[HasAutoIncrement]) {
column.AutoIncrement = true;
column.AutoIncrementSeed = reader.ReadInt64();
column.AutoIncrementStep = reader.ReadInt64();
}
if (flags[HasCaption])
column.Caption = reader.ReadString();
if (flags[HasColumnUri])
ColumnUriFieldInfo.SetValue(column, reader.ReadString());
if (flags[ColumnHasPrefix])
column.Prefix = reader.ReadString();
if (flags[HasDefaultValue])
column.DefaultValue = reader.ReadObject();
column.ReadOnly = flags[ColumnIsReadOnly];
if (flags[HasMaxLength])
column.MaxLength = reader.ReadOptimizedInt32();
if (flags[TableHasExtendedProperties])
DeserializeExtendedProperties(column.ExtendedProperties);
dataColumns[i] = column;
}
table.Columns.AddRange(dataColumns);
}
In Part 2 of this article, I will go into more details about using bit flags and serializing child objects to take full advantage of the optimization features listed in this article.
Please feel to post comments/suggestions for improvements, here on CodeProject.
Changes from v1 to v2
- Added support for .NET 2.0 using conditional compilation
Either add a "NET20
" to the Conditional compilation symbols in your project properties (on the Build tab under General) or search for "#if NET20
" and manually remove the unwanted code and conditional constructs. - Supports .NET 2.0
DateTime
including DateTimeKind
. - Added support for typed arrays.
- Added support for Nullable generic types.
- Added support for
List<T>
and Dictionary<K,V>
generic types. - Added support for optional data compression - see MiniLZO section below for details.
- Added
IOwnedDataSerializableAndRecreatable
interface to allow classes and structs to be recognized as types that can entirely serialize/deserialize themselves - Added tests for all new features
- Fixed one known bug (
BitArray
deserialized but rounded up to next 8 bits).
See History section below for details of all changes.
Changes from v2 to v2.1
- Bugfix in
WriteOptimized(decimal)
whereby incorrect flag was used under certain circumstances.
Thanks to marcin.rawicki and DdlSmurf for spotting this. WriteOptimized(decimal value)
will now store decimal values without scale where possible.
This optimization is based on the fact that a given decimal may be stored in different ways depending on how it was created. For example, the number '2' could be stored as '2' with no scaling (if created as decimal d = 2g
) or '200' with a scaling of '2' (if created using decimal d = 2.00g
).
Data retrieved from SQL Server preserves the scaling and so would normally be stored in memory using the latter format.
There is absolutely no numeric difference between the two and the only visible effect is seen when you display the number without specific formatting using ToString()
. However, from the point of view of optimizing serialization, a '2' can be stored more efficiently than a '200' so the code will perform a simple check for this condition and use a zero scaling where possible.
This optimization is on by default but I have added a static
boolean property called DefaultPreserveDecimalScale
to allow turning it off if required. - Negative integers can now be stored optimized.
This optimization uses twos-complement to transform a negative number and, provided the now positive number is optimizable, will use a different TypeCode
to store both the type and the fact it should be negated on deserialization. Int16
/UInt16
can now be stored optimized.
Of course the potential reductions here are very limited but they included for completeness. This also include typed array support and negative numbers. Enum
values stored using WriteObject(object)
are now automatically optimized where possible.
A check is made to determine whether the integer value can be optimized or not then the Enum
type is stored followed by the optimized or non-optimized value. Since Enum
s are usually non-negative with a limited range, optimization will be on most of the time. Storing the Enum
Type will also allow the deserializer to determine the underlying integer type and get the correct size value back. - Added support for 'Type Surrogates' which allow external helper classes to be written that know how to serialize/deserialize a particular
Type
or set of Type
s.
This is a relatively simple feature but has great scope to extend the support of non-directly-supported types without using up the limited Type Codes, without modification of the Fast Serialization code, and without needing to have control of the Type
being serialized.
This feature was always in the back of my mind to implement but special thanks to Dan Avni for giving me good reasons for doing it now and for feedback and testing in the field.
A number of ways of achieving the goal were examined including a dictionary of delegates; and an alternative set of Type Codes but the chosen implementation allows good modularisation and reuse of code.
A new interface has been created called IFastSerializationTypeSurrogate
which has just three members:
bool SupportsType(Type)
which allows SerializationWriter
to query a helper class to see if a given Type is supported void Serialize(SerializationWriter writer, object value)
which does the serialization; and object Deserialize(SerializationReader reader, Type type)
which does the deserialization
Any number of Type Surrogate helper classes can be used and they are simply added to a static property called TypeSurrogates
on SerializationWriter
(no need to duplicate on SerializationReader
) which is either List<IFastSerializationTypeSurrogate>
or an ArrayList
for NET 1.1.
The idea is that Type Surrogate helper classes are added once at the start of your app and where WriteObject
has exhausted its list of known types for a given object and would use a plain BinaryFormatter
, it will first query each helper in the list to see if the Type
is supported.
If a match is found, then a TypeCode
is stored followed by the object's Type
and then the Serialize
method is called on the helper to do the actual work. Deserialization is the reverse process and the same set of helpers must be available to perform the deserialization.
There are a couple of sample Type Surrogate helper classes in the download covering Color
, Pair
, Triplet
, StateBag
, Unit
and a simple implementation of Hashtable
. The structure of the class implementing IFastSerializationTypeSurrogate
could be done in a number of ways, but by making the implementation code for serialization/deserialization also available via public static
methods, the helper class can also be used to serialize Type
s that are known at design-time, maybe as part of a larger class supporting IOwnedDataSerializable
.
If you create a useful Type Surrogate helper class, you might want to post it here so it can be shared with others to save reinventing the wheel.
Changes from v2.1 to v2.2
- You can now pass any
Stream
to SerializationReader
and SerializationWriter
.
- Assumptions about a stream's start Position have been removed. The start Position is stored and used relatively.
- Neither
SerializationWriter
nor SerializationReader
require a seekable stream. Passing a non-seekable stream will just mean that the header cannot be updated. ToArray()
will still just return the portion of the stream written to by SerializationWriter
.
- The stored data stream has been made more linear. There is now either a 4 byte or 12 byte header followed immediately by the serialized data.
- Tokenized Strings and Objects are now written inline as they are first seen rather than appended later all at once.
- The counts for Tokenized Strings and Tokenized Objects (used for pre-sizing the table Lists in
SerializationReader
) are now stored in the header for the normal case using a MemoryStream
.
(For alternative streams which are not seekable (e.g. a compression stream) or where allowUpdateHeader
was set to false
in the relevant constructor, the header will not be updated. In this case, you can specify presize information in the SerializationReader
constructor either by passing the final table sizes from the SerializationWriter
externally from the stream or by making a reasonable guess. Alternatively, you can not specify presize information at all and let the tables grow as the tokenized items are pulled from the data stream thought this can be memory inefficient and is not recommended) SerializationReader
is now a forward-only stream reader - no need to jump to the end of the stream and back. - Once data has been deserialized from a
Stream
by a SerializationReader
, the stream Position will be in the correct position for any subsequent data - no need to jump over token tables. - For the normal
MemoryStream
case, a 12 byte header is used: an Int32
for the total length of the serialized data; an Int32
for the String
table size and an Int32
for the Object table size.
If the header is not to be updated then there will be just a 4 byte Int32
header which will be zero.
- Replaced
AppendTokenTables()
with UpdateHeader()
since there is no longer any appending.
MiniLZO - Realtime Compression
Included with the v2 source code is a file called MiniLZO.cs which contains a slightly modified version of the code from the Pure C# MiniLZO port article by Astaelan.
That article's code is a direct port of the original C version and as such didn't store the original uncompressed size.
The modifications I have made are as follows:
- Modified method signatures to enable any part of a byte array to be compressed.
- Store the uncompressed size in 4 bytes at the end of the compressed data.
- Added a special method overload which takes a
MemoryStream
and uses its internal byte[]
buffer as the source data.
In addition, it will look at the unused bytes in this buffer and, where possible, use those bytes to provide in-place compression thus saving memory.
It is important to note that:
- MiniLZO is covered by the GNU General Public License. The rest of the code in this article is not however - you are free to do with it as you wish
- MiniLZO uses 'unsafe' code. 'Unsafe' only in the .NET sense which means that code uses pointers and so cannot be guaranteed by the .NET runtime to not corrupt memory.
In this context, it is pretty safe however since it will detect if any of its pointers are outside the range of the byte array and throw an exception.
The project in which it is contained (either a separate DLL project or an existing one) will need to have the unsafe option checked for it to compile.
Its benefits are that being a memory buffer-only compressor, it is extremely quick to perform compression and even faster to decompress.
It my testing, I got a reduction in size of around 45%-55% which was much faster than other, stream-based compressors at their fastest setting.
Other compressors might produce slightly better compression but at the cost of reduced speed.
Usage of compression has been left entirely optional - however if you know you will be compressing serialized data one way or another, be sure to set the SerializationWriter.OptimizeForSize static
property to false
for best results.
To use this compression, where you previously used this code...
byte[] serializedData = mySerializationWriter.ToArray();
...do this instead...
mySerializationWriter.AppendTokenTables();
byte[] serializedData =
MiniLZO.Compress((MemoryStream) writer.BaseStream);
Deserialization is even simpler...
byte[] serializedData = MiniLZO.Decompress(serializedData);
History
- 2006-09-25 v1.0 released onto CodeProject
- 2006-11-27 v2.0 released onto CodeProject
- Added MiniLZO.cs for real-time compression
- FIX: Fixed bug in
BitArray
serialization - was rounding up to nearest 8 bits - Added .NET 2.0-conditional code where required
- Added static
DefaultOptimizedForSize
boolean property - used by the OptimizeForSize
property - Renamed
DateHasTimeMask
to DateHasTimeOrKindMask
- Added internal
UniqueStringList
class for faster string token matching:
- string tokens now assume fixed 128 string lists and use round-robin allocation
- uses arithmetic on deserialization rather than multi-dimensional arrays - proved faster
- tweaked hashtable sizes - quadruples for lower sizes, then reverts to doubling.
- Added the
WriteOptimized(string)
overload:
Write(string)
calls a new method rather than the other way around since a new method is not virtual and therefore slightly faster. Also consistent naming
- Added
CLSCompliant
attributes where appropriate - Reorganized all methods (no improvement but easier to related method types)
- Added new Serialized Types:
DuplicateValueSequenceType
ObjectArrayType
EmptyTypedArrayType
EmptyObjectArrayType
NonOptimizedTypedArrayType
FullyOptimizedtypedArrayType
PartiallyOptimizedTypedArrayType
OtherTypedArrayType
BooleanArrayType
ByteArrayType
CharArrayType
DateTimeArrayType
DecimalArrayType
DoubleArrayType
SingleArrayType
GuidArrayType
Int16ArrayType
Int32ArrayType
Int64ArrayType
SByteArrayType
TimeSpanArrayType
UInt16ArrayType
UInt32ArrayType
UInt64ArrayType
OwnedDataSerializableAndRecreatableType
- Added placeholder Serialized types to show how many remain (31 currently)
- Added the
IOwnedDataSerializableAndRecreatable
interface - Refactored
processObject
code to keep array determination in a separate method so that it is reusable - .NET 2.0 dates including
DateTimeKind
now fully supported WriteOptimized(object[], object[])
updated:
- slightly optimized
- now looks for sequences of duplicate values
writeObjectArray(object[])
updated:
- slightly optimized
- now looks for sequences of duplicate values
- Now uses
.Equals
instead of ==
everywhere - Added support for arrays of all primitive/built-in types
- Added support for structures and arrays of structures
- Added support for arrays of classes that implement
IOwnedDataSerializableAndRecreatable
- Added support for
Dictionary<K,V>
Write<K,V>(Dictionary<K,V> value)
method Dictionary<K,V>ReadDictionary<K,V>()
method for simple creation of Dictionary<K,V>
ReadDictionary<K,V>(Dictionary<K,V> dictionary)
method to populate a pre-created Dictionary
- Added support for
List<T>
.
Write<T>(List<T> value)
method List<T> ReadList<T>()
method
- Added support for
Nullable<T>
WriteNullable(ValueType value)
- just calls WriteObject(value)
but included for clarity - Full list of
ReadNullableXXX()
methods for all primitive/built-in types
- Refactored how arrays are handled for the .NET 2.0 covariant 'issue'
ToArray()
refactored to split out token writing from returning byte array. Allows support for any external compression routine to work on a MemoryStream
if required - Reordered processing within
WriteObject()
- in particular DbNull
has higher priority - Test suite updated with lots of tests
- 2007-02-25 v2.1 released onto CodeProject
- FIX: Fixed bug in
WriteOptimized(decimal)
- Added/updated some comments.
- Added optimization to store
Decimal
s without their scale where possible
- Added
static DefaultPreserveDecimalScale
property - false
by default - Added
public PreserveDecimalScale
property which takes its initial value from the static default but allows configuration on a per-instance basis - Updated
WriteObject()
to always store Decimal
s optimized since there will always be a saving - Removed
OptimizedDecimalType
typecode for same reason
- Added support for optimizing
Int16
/UInt16
values
- Added public constant for
HighestOptimizable16BitValue
- Added internal constant for
OptimizationFailure16BitValue
- Updated code in
WriteObject
to look for optimizable 16bit values - Added
WriteOptimized(Int16)
and WriteOptimized(UInt16)
methods - Added
WriteOptimized(Int16[])
and WriteOptimized(UInt16[])
methods - Added
ReadOptimizedInt16()
and ReadOptimizedUInt16()
methods - Added
ReadOptimizedInt16Array()
and ReadOptimizedUInt16Array()
methods - Updated
ReadInt16Array()
and ReadUInt16Array()
methods - Added new Serialized Types:
OptimizedInt16Type
OptimizedUInt16Type
- Added support for some negative integer values.
- Updated code in
WriteObject
to look for optimizable negative values - Added new Serialized Types:
OptimizedInt16NegativeType
OptimizedInt32NegativeType
OptimizedInt64NegativeType
- Added support
Enum
types
- Updated
WriteObject
to look for Enum
values and store them as their Type
and integer value - optimized where possible - Added new Serialized Types:
EnumType
OptimizedEnumType
- Added support for Type Surrogate helpers
- Added
IFastSerializationTypeSurrogate
interface - Added
TypeSurrogates
static property. (List<IFastSerializationTypeSurrogate>
for NET 2.0 or ArrayList
for NET 1.1) - Updated
WriteObject
to query the helpers in TypeSurrogates
before using BinarySerializer
as a last resort. - Added internal
static
method to find a helper class for a given type - shared by SerializationWriter
and SerializationReader
. - Added new Serialized Type:
- 2010-05 v2.2 released onto CodeProject
- Corrected some typos and changed some article text where it is different for v2.2
- Removed conditional compilation - just .NET 2.0 or higher now
- Separated classes into different files
- Renamed some methods (did I really use lower case method names!)
- Used
Switch
es rather than nested if<code>/
elses where possible - Used
var
. - Release now contains two projects: one for the code and one for unit tests. VS2010 is used but the code is in subfolders and easily extracted.
AppendTokenTables()
replaced with UpdateHeader()
- Storing token tables inline gets around the problem reported by Simon Thorogood where tokenized strings written by an
IOwnedDataSerializableAndRecreatable
class are not written into the stream
- Added support for any stream and any starting position.