Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Alternatives to encode List<string> into one string

5.00/5 (1 vote)
11 Nov 2012CPOL3 min read 13.2K  
XML Serialize versus hand crafted CoDec (as alternative to Advance String Split and Joiner)

Introduction

This tip is an alternative to Advanced String Split and Joiner.

Why the heck this alternative?

I attempt to provide a clearer problem description compared to the original post. As a consequence, I provide some alternatives to encode/decode or to serialize/deserialize List<string> to/from a string.

Example usage

C#
List<string> source = ...
...
ICodecStringList codec = new ... // to be "invented", see this tip
string encoded = codec.Encode(source);
...
List<string> target = codec.Decode(encoded);
...
With:
C#
/// <summary>Encoding a list-of-string to a string and decoding from a string.</summary>
public interface ICodecStringList
{
    /// <summary>Encode the list to a string that allows to decode into a list again.</summary>
    string Encode(List<string> item);
    /// <summary>Decode the formerly encoded list.</summary>
    List<string> Decode(string encoded);
}
I will provide two codec implementations for the sample code above.

The Problem Statement

  1. We need a class that encodes a List<string> into a string.
  2. That very class shall provide a decoding function to revert that encoded string into a List<string> again.
Below I show two approaches:
  • XML serialization
  • hand crafted encode/decode
Finally, the summary shows that the XML serialization is far less effort than doing that with a hand crafted version.

XML Serializing

The most straight forward solution to write the List<string> into a string is using XML serialization. The following codec (coder/decoder) implementation employs XML Serializing.
C#
/// <summary>codec with XML serialization</summary>
public class CodecXml: ICodecStringList
{
    private XmlSerializer<List<string>> _xml = new XmlSerializer<List<string>>();
    /// <summary>encode to xml</summary>
    public string Encode(List<string> item) { return _xml.Serialize(item); }
    /// <summary>decode from xml</summary>
    public List<string> Decode(string encoded) { return _xml.Deserialize(encoded); }
}

/// <summary>Generic XML serializer (omitting namespaces and schema declaration where possible)</summary>
public class XmlSerializer<T> where T: class
{
    /// <summary>
    /// Avoid up-front xml declaration of the namespaces.
    /// If e.g. a nil calue must be serialized, the needed namespace is put into the respective element.
    /// </summary>
    private static XmlSerializerNamespaces NoXmlNamespaces
    { get { var ns = new XmlSerializerNamespaces(); ns.Add(string.Empty, string.Empty); return ns; } }
    /// <summary>Omit xml encoding preprocessor header.</summary>
    private static XmlWriterSettings NoXmlEncoding
    { get { return new XmlWriterSettings() { OmitXmlDeclaration = true }; } }
    /// <summary>Serialize into xml.</summary>
    /// <remarks>Throws exceptions if serialization fails.</remarks>
    public string Serialize(T item)
    {
        StringBuilder xml = new StringBuilder();
        using (var writer = XmlWriter.Create(xml, NoXmlEncoding))
        {
            new XmlSerializer(typeof(T)).Serialize(writer, item, NoXmlNamespaces);
        }
        return xml.ToString();
    }
    /// <summary>Deserialize a formerly serialized item.</summary>
    /// <remarks>Throws exceptions if deserialization fails.</remarks>
    public T Deserialize(string xml)
    {
        using (var reader = new StringReader(xml))
        {
            return new XmlSerializer(typeof(T)).Deserialize(reader) as T;
        }
    }
}
So much for the standard way of serializing/deserializing. The next approach shows how one could encode by a hand crafted codec. Not sure why one would do so, though... ;-)

Hand crafted Codec

There may be situations where you do not like to have the standard XML serialization for encoding an object into one string. If so, the following class may inspire you how this could be achieved.

The grammar

If you generate code to be parsed again, you must have an idea on the grammar that is encoded in the string. Here, I suggest a CSV like grammar, where the List<string> represents one record with each element of the list is a field in that record. The grammar:
Image 1
Image 2
If we assume BEGIN_CHAR = ", END_CHAR = ", SEP_CHAR = ; (or ,), we get CSV grammar for one record.

The codec

The following implementation provides first a generic encoder/decoder for the grammar above, plus the concrete CSV-like parametrization of the generic one.
C#
/// <summary>generic class for encoding/decoding string containers</summary>
public class StringContainerToStringCoDec
{
    private string _begin, _end, _sep, _esc;
    private Regex _tokenizer;
    /// <summary>calculate a separator that do not clash with the begin/end characters</summary>
    private static string GetSep(char begin, char end, string sep)
    {
        if (sep != null && sep.Length == 1)
            foreach (char c in string.Format("{0},;\t", sep[0]))
                if (c != begin && c != end) return new string(c, 1);
        return sep ?? ";";
    }
    /// <summary>parametrize codec grammar</summary>
    public StringContainerToStringCoDec(char fieldBegin, char fieldEnd, string fieldSep)
    {
        _begin = new string(fieldBegin, 1);
        _end = new string(fieldEnd, 1);
        _esc = _end + _end;
        _sep = GetSep(fieldBegin, fieldEnd, fieldSep);
        _tokenizer = new Regex(string.Format(@"{0}((?:{1}|[^{2}])*){2}(?:{3})?",
            Regex.Escape(_begin), Regex.Escape(_esc), Regex.Escape(_end), Regex.Escape(_sep)),
            RegexOptions.Compiled | RegexOptions.Singleline);
    }
    /// <summary>encode strings into 'begin'-content-'end' fields, separated by 'sep'</summary>
    public string Encode(IEnumerable<string> items)
    { return string.Join(_sep, items.Select(
                           item => _begin + (item ?? string.Empty).Replace(_end, _esc) + _end));
    }
    /// <summary>decode a string that was encoded by the sibling Encode() method.</summary>
    /// <remarks>Undefined behaviour if s is not properly encoded</remarks>
    public IEnumerable<string> Decode(string s)
    {
        // Console.WriteLine("PATTERN = {0}", _tokenizer.ToString());
        return _tokenizer.Matches(s).Cast<Match>().Where(m => m.Groups[1].Success)
            .Select(m => m.Groups[1].Value.Replace(_esc, _end));
    }
}
/// <summary>
/// CSV like codec: fields enclosed in "..." (doubling embedded quotes),
/// separated by culture dependent list separator
/// </summary>
public class CsvCoDec : ICodecStringList
{
    public StringContainerToStringCoDec _codec
        = new StringContainerToStringCoDec('"', '"', CultureInfo.CurrentCulture.TextInfo.ListSeparator);
    /// <summary>Encode to CSV like record.</summary>
    /// <remarks>Throws exception if encoding fails.</remarks>
    public string Encode(List<string> item) { return _codec.Encode(item); }
    /// <summary>Decode from CSV like record.</summary>
    /// <remarks>Throws exception if decoding fails.</remarks>
    public List<string> Decode(string encoded) { return _codec.Decode(encoded).ToList(); }
}
The code cannot distinguish between empty and null entries while decoding. If that is needed, a special field token must be invented. E.g. beside the normal fields ("...") a special field (e.g. #).

A null-aware codec

The adjusted grammar:
Image 3
Image 4
The adjusted codec which is null-value ready is shown below (see highlighted sections):
C#
/// <summary>generic class for encoding/decoding string containers</summary>
public class StringContainerToStringCoDec
{
    private string _begin, _end, _sep, _esc, _nil;
    private Regex _tokenizer;
    /// <summary>calculate a separator that do not clash with the begin/end characters</summary>
    private static string GetSep(char begin, char end, char nil, string sep)
    {
        if (sep != null && sep.Length == 1)
            foreach (char c in string.Format("{0},;\t|", sep[0]))
                if (c != begin && c != end && c != nil) return new string(c, 1);
        return sep ?? ";";
    }
    /// <summary>parametrize codec grammar</summary>
    public StringContainerToStringCoDec(char fieldBegin, char fieldEnd, char fieldNil, string fieldSep)
    {
        _begin = new string(fieldBegin, 1);
        _end = new string(fieldEnd, 1);
        _esc = _end + _end;
        _nil = new String(fieldNil, 1);
        _sep = GetSep(fieldBegin, fieldEnd, fieldNil, fieldSep);
        _tokenizer = new Regex(string.Format(@"(?:{0}((?:{1}|[^{2}])*){2}|({4}))(?:{3})?",
            Regex.Escape(_begin), Regex.Escape(_esc), Regex.Escape(_end), Regex.Escape(_sep),
            Regex.Escape(_nil)), 
            RegexOptions.Compiled | RegexOptions.Singleline);
    }
    /// <summary>encode fields into 'begin-char' - content - 'end-char' fields, separated by 'sep-string'</summary>
    public string Encode(IEnumerable<string> items)
    { return string.Join(_sep, items.Select(item => item == null ? _nil :  _begin + item.Replace(_end, _esc) + _end)); }
    /// <summary>decode a string that was encoded by the sibling Encode() method.</summary>
    /// <remarks>Undefined behaviour if s is not properly encoded</remarks>
    public IEnumerable<string> Decode(string s)
    {
        // Console.WriteLine("PATTERN = {0}", _tokenizer.ToString());
        return _tokenizer.Matches(s).Cast<Match>().Where(m => m.Groups[1].Success||m.Groups[2].Success)
            .Select(m => m.Groups[2].Success ? null : m.Groups[1].Value.Replace(_esc, _end));
    }
}
/// <summary>
/// CSV like codec: fields enclosed in "..." (doubling embedded quotes),
/// separated by culture dependent list separator
/// </summary>
public class CsvCoDec : ICodecStringList
{
    public StringContainerToStringCoDec _codec
    = new StringContainerToStringCoDec('"', '"', '#', CultureInfo.CurrentCulture.TextInfo.ListSeparator);
    /// <summary>Encode to CSV like record.</summary>
    /// <remarks>Throws exception if encoding fails.</remarks>
    public string Encode(List<string> item) { return _codec.Encode(item); }
    /// <summary>Decode from CSV like record.</summary>
    /// <remarks>Throws exception if decoding fails.</remarks>
    public List<string> Decode(string encoded) { return _codec.Decode(encoded).ToList(); }
}

Summary

The original post on using "advanced" Join/Split to encode/decode List<string> into a string is in my eyes not a useful approach to solve the encoding/decoding problem. Either one uses established means (e.g. XML Serialization) or he defines the problem carefully enough to cover the encoding/decoding in a reasonable way: define the grammar for the encoding.

Once the grammar is defined, the encoding is usually easy (the only challenge is to handle null values and embedded end-of-character characters). The decoder is a bit of a challenge: you must tokenize the string and parse it. This is achieved either hand crafted again or be means of Regex (which may be for many admittedly a bit of a challenge on its own...;-)).

I would go for XML serialized data unless I had a real issue with that. The hand crafted veriant is simply to much of maintenance effort...

History

V1.02012-11-11initial version as alternative to Advanced String Split and Joiner
V1.12012-11-11Improve GetSep() method to catch null sep argument value

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)