(untagged)

Symbols as extensible enums

Qwertie

0.00/5 (No votes)

25 Feb 2014

Use the Symbol class for enum-like values that can be extended by other classes.

Download source code - 4.92 KB

Note: this code contains unit tests for NUnit, but you can easily strip out the tests if you don't have NUnit.

Introduction

In C#, sometimes you would like to define a base class, or a library, that uses an enumeration, and you would like to allow derived classes or users of your library to define new values for it. Trouble is, enums are not extensible: a derived class or user code cannot define new values.

For instance, one time, I wrote a library for serializing and deserializing "shapes" with geographic coordinates to a text file. In this library, several different shapes were supported: circles, rectangles, lines, polygons, and so on. Suppose there is a ShapeType enumeration for this:

public enum ShapeType {
    Circle, 
    Rect,
    Line,
    Polygon
}

Enumerations are great for storing in a text file, since you can write t.ToString() to convert a ShapeType t to a string, and Enum.Parse(typeof(ShapeType), s) to convert the string s back to a ShapeType.

But suppose you would like to allow other developers to define their own shapes. Other developers cannot add new values to ShapeType, and even if they could, there is a risk that two developers would assign the same integer value to different kinds of shapes. How can we solve these problems?

Ruby to the Rescue

Sometimes, when an extensible enum is needed, people use strings or integer constants (const int or readonly int) instead of enums. These solutions have at least the following problems:

Strings and integers are not normally used for enumerations. Therefore, when other developers see a "string" or "int" property or parameter, they don't immediately realize that it is used for an enumeration.
Some of the benefits of static typing is lost, since you can mistype a string or accidentally put a string/integer in a location that was intended to hold an enum value (or vice versa).
Strings can't be renamed with a refactoring tool (like the "Rename" feature of Visual Studio).
Strings are much slower than enumerations when comparing for equality, and they are slower than integers when used as dictionary keys (though due to an odd decision by Microsoft, enums also perform poorly as dictionary keys).
When using integers, it's hard to guarantee that two different developers each use unique values when extending the enumeration.

In the dynamic language Ruby, we commonly use symbols instead of enumerations. Symbols are like string literals, but instead of a string like "Circle", you use the symbol syntax :Circle.

For the most part, symbols solve the above problems. They can be compared as fast as integers, and they cannot be confused with strings. Since anyone can define a new symbol at any time, symbols are like an enumeration of unlimited size. And, if you use them as I prescribe below, it is possible to rename them with a refactoring tool.

(Edit: later I found out that other languages also also have Symbols as a built-in concept, e.g. LISP)

Symbols in .NET

I have written a Symbol implementation for .NET that you can use as an extensible enum. I will now demonstrate how we can rewrite our ShapeType enum to use Symbols instead. First, change enum ShapeType into a class. Then, replace each enum value with a Symbol:

public static class ShapeType
{
    public static readonly Symbol Circle  = GSymbol.Get("Circle");
    public static readonly Symbol Rect    = GSymbol.Get("Rect");
    public static readonly Symbol Line    = GSymbol.Get("Line");
    public static readonly Symbol Polygon = GSymbol.Get("Polygon");
}

If a third party wants to extend this list of symbols, they should write another static class with the additional possibilities. For example, Xyz corporation might write this extension:

public static class FractalShape
{
    public static readonly Symbol Mandelbrot = 
                  GSymbol.Get("XyzCorp.Mandelbrot");
    public static readonly Symbol Julia = GSymbol.Get("XyzCorp.Julia");
    public static readonly Symbol Fern = GSymbol.Get("XyzCorp.Fern");
}

To ensure two independent parties don't accidentally define the same symbol for two different shapes (for example, if two different parties both made a shape called Fern), it is advisable to try to use a unique name when calling GSymbol.Get. That's because GSymbol.Get always returns the same Symbol when given the same input string. Therefore, in this example, I used the prefix "XyzCorp." to ensure that names defined by Xyz corporation are unique.

Typesafe Symbols

When I first wrote this article, people complained that Symbols are not type-safe: you could accidentally mix up two unrelated enumerations, since they both have type Symbol. And by votes of 3, my article was cast into the pit of zero readership. Besides, ShapeType as defined above is not a drop-in replacement for its enum equivalent, because ShapeType variable declarations have to be changed.

ShapeType rect = ShapeType.Rect;

would have to be changed to this:

Symbol rect = ShapeType.Rect;

Now you can overcome these limitations using a type-safe "symbol pool". A SymbolPool is a "namespace" for symbols. There is one permanent, global pool (used by GSymbol.Get), and you can create an unlimited number of private pools. I will say more about how they work below, but for now, let me just show you how to make a type-safe extensible enum using SymbolPool<ShapeType>:

public class ShapeType : Symbol
{
    private ShapeType(Symbol prototype) : base(prototype) { }
    public static new readonly SymbolPool<ShapeType> Pool 
                         = new SymbolPool<ShapeType>(p => new ShapeType(p));

    public static readonly ShapeType Circle  = Pool.Get("Circle");
    public static readonly ShapeType Rect    = Pool.Get("Rect");
    public static readonly ShapeType Line    = Pool.Get("Line");
    public static readonly ShapeType Polygon = Pool.Get("Polygon");
}

Since ShapeType's constructor is private, the only way to make a new ShapeType is by calling ShapeType.Pool.Get().

Now, a third party "XyzCorp" can define new ShapeTypes as follows:

public class FractalShape : ShapeType
{
    public static readonly ShapeType Mandelbrot = 
                  Pool.Get("XyzCorp.Mandelbrot");
    public static readonly ShapeType Julia = Pool.Get("XyzCorp.Julia");
    public static readonly ShapeType Fern = Pool.Get("XyzCorp.Fern");
}

Note that the members of FractalShape still have the type ShapeType. It is not necessary to derive FractalShape from ShapeType; I only do so to make it clear that the two are related.

Using Symbols

To convert a Symbol s to a string, call s.Name. You can also call s.ToString(), but this prefixes the name with a colon ( : ) as in Ruby (edit: I removed the colon in newer versions of Symbol, so Symbols act more like ordinary strings and enums.)
Instead of Enum.Parse, when you want to convert a string back to a Symbol, you can call GSymbol.Get(string) to create a global symbol, or Pool.Get(string) where Pool is a private symbol pool.
To get all symbols in a pool, just enumerate the pool:

foreach (ShapeType s in ShapeType.Pool)
    ...

The symbols are returned in the same order as they were created.

Note that every Symbol you create consumes a small amount of memory that cannot be garbage-collected if the Symbol's pool is stored in a global variable. Therefore, if the string comes from a large file, you may wish to call Pool.GetIfExists(string) instead (where Pool is either a private pool or GSymbol) to avoid a memory leak. GetIfExists does not create new symbols, it only returns symbols that already exist. Therefore, if you get a nonsense name like "fdjlas", GetIfExists returns null instead of a valid Symbol.

There is a catch: if you use GetIfExists, you need to make sure that all desired symbols already exist. Therefore, before calling ShapeType.Pool.GetIfExists to decode a shape type name, you must make sure that derived types such as FractalShape are initialized. Accessing any ShapeType from FractalShape will do the trick:

// Returns null if FractalShape has never been used
ShapeType s = ShapeType.Pool.GetIfExists("XyzCorp.Fern");

s = FractalShape.Julia;
s = ShapeType.Pool.GetIfExists("XyzCorp.Fern"); // guaranteed to work

How Symbols Work, Briefly

This library has four classes.

A Symbol is simply a small class with a read-only Name, integer Id, and a reference to its Pool. Every Symbol is cataloged in a SymbolPool.
A SymbolPool contains a set of Symbols. SymbolPool contains a List<Symbol> and a Dictionary<string, Symbol> which are used to look up symbols by ID and by name, respectively. SymbolPool is thread-safe; you can safely create Symbols in the same pool from different threads.
SymbolPool<T> is a derived class of SymbolPool that creates Ts, where T is a derived class of Symbol. You pass a factory function to its constructor, and when someone calls Get() to create a T, the SymbolPool calls your factory function, passing a "prototype" as a parameter (the "prototype" is a Symbol that you can use to construct a T).
GSymbol contains the "global" SymbolPool. Call GSymbol.Get to create a "global" Symbol.

Each Symbol has an ID number; this is nothing more than the value of a counter that is incremented each time a symbol is created. IDs are unique within a given pool, but may be duplicated across pools. Private pools have positive IDs by default, starting at 1; the global pool has negative IDs starting at -1, except for GSymbol.Empty, which is the Symbol that represents the empty string (Name == "").

GetHashCode() is fast because it returns an ID number instead of obtaining the hash code of the string; therefore, Symbols are fast when used as keys in a Dictionary. Comparing Symbols for equality is fast, because only the references are compared, not the contents of each Symbol. Two Symbols are the same if and only if they are located at the same memory location.

Besides making type-safe extensible enumerations, another reason to use a SymbolPool is to construct a temporary set of Symbols that can be garbage-collected later. A SymbolPool and all the Symbols it contains can be garbage-collected when there are no references left to the pool itself or any of its Symbols. Note that a Symbol has a reference to its pool, so any lingering references to a Symbol will keep its entire pool alive.

As for me...

In the Loyc compiler tooling project, source code is represented with Loyc trees. In Loyc trees, I use Symbols rather than strings to represent all identifiers in source code (variable and method names) as well as names of built-in operators and constructs. This avoids storing multiple copies of strings and allows fast equality comparison.

History

June 1, 2008: First version.
December 12, 2009: Introduced SymbolPools.
December 14, 2009: Released on CodeProject.
February 24, 2010: Added support for type-safe Symbols.
February 25, 2014: Formatting error corrected. Edited some text based on newer information.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here