Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / string

UniqueStringList Revisited

4.95/5 (18 votes)
12 May 2011CPOL18 min read 28.5K   203  
A useful class now twice as fast.

Introduction

This article is about a class called UniqueStringList, various versions of which I have been using for many years. I thought it was fully optimized but I recently used a couple of new techniques to make it twice as quick and thought I would share them with you.

Background

Reusing the same copy of a distinct string value is called String Interning. String Interning is a good thing for two main reasons:

  • It reduces memory consumption: 100,000 references to the same string is far better than 100,000 references to 100,000 copies of the same string.
  • It speeds up string comparisons: If two string references point to the same place, you know they are equal without having to compare a single character.

The C# compiler automatically Interns all literal strings within your source files when building an assembly. You may have seen the String.Intern() method. Don't use it in your code! Any string you add to this Intern Pool cannot be removed and will stay there until your app closes!

Although not String Interning per se, in my Optimizing Serialization in .NET article, one of the biggest optimizations was to produce a token for each unique string being serialized. It did not affect the strings on the serializing end but ensured that at the deserializing end, only one copy of the string would exist - automatically interned if you will.

Prevention of duplicate strings is very beneficial if you can remove the duplicates at source. Imagine a CSV file being read with 100,000 lines and assume that some string appears on every line (e.g., "No fault found"). Since each line is read separately and has no knowledge of the lines above and below it, you will be storing 100,000 copies of that same string in memory. If you can Intern all the strings read from a single line, you will end up with just one copy of each unique string, which can save megabytes of memory.

It is a similar case when reading XML files or retrieving data from a middle tier layer and creating POCOs (Plain Old Class Objects) for each item, the chances are you will be getting duplicates of the strings involved unless you remove them yourself.

Imagine displaying a large amount of data in a grid. You click on the filter icon in the header and, after a short pause, it displays a list of unique items to filter against. Wouldn't it be nice if the process of generating this list was so quick it appeared instantly?

An even simpler scenario is just asking whether the string has been seen before and maybe skipping some processing if it has.

Of course, the process of interning strings is not free, it will take some time and so the less time it takes, the better, and that is what this article is about. Also, don't forget that you will be saving some time by making fewer memory allocations and so ultimately you could get the best of both words: large memory savings with little or no additional overhead.

So we have some scenarios here for string Interning/identifying duplicates but with slightly different requirements:

  1. The 'tokenizing' scenario needs to get an index back and that index must not subsequently change. In addition, it is helpful to know whether that index is for an existing string or a new one.
  2. The 'reuse of identical strings' scenario (i.e., real Interning) requires a string as an output - an index is neither here nor there and it doesn't matter if it is a new string or not.
  3. The 'have I seen this string before' scenario simply needs to return a bool.
  4. The 'just give me all the unique strings' scenario doesn't need to return anything when adding - just a way of getting all of them back out at the end.

The code I eventually came up with uses ideas from three sources:

  1. The original code from the Optimizing Serialization in .NET article. The concept of using Golden Primes and accelerating the bucket count for lower capacities and also having a LoadFactor to have more buckets than capacity to reduce the chance of HashCode collisions.
  2. The .NET 4.0 HashSet<T> code as seen in Reflector. Instead of storing the strings directly in an array, I use a Slot struct which also stores the hash code of the string and the index of the Next Slot. The storage of the hash code eliminates a lot of string compares and pre-calculations especially when expanding capacity. The Next index helps with collisions ensuring that the minimum number of compares is achieved. Whilst Reflectoring HashSet<T>, I also discovered a HashHelpers class in the System.Collections namespace which has a number of primes pre-calculated. Unfortunately, this (and a near identical class of the same name in System.Collections.Generic) are marked as internal.
  3. An extremely interesting article by rob tillaart called Optimizing integer divisions with Multiply Shift in C# which shows how to speed up integer divisions, and by implication, modulo operations. The code I have doesn't use divide operations but does use modulo to determine which hash bucket to use for a given hash code.

To promote reusability, I have created a MathHelper class and a QuickDivideInfo struct and the code for these is listed below. These incorporate the ideas of pre-calculating Golden Primes and pre-calculating the Fast Division information for them.

From the QuickDivideInfo struct, only the ModuloPositive() method is actually used by UniqueStringList but I have included the other methods for completeness and checked them against all possible int numerators. Apart from int.MinValue, all other values return exactly the same result as the '%' operator but 3 times faster.

QuickDivideInfo struct

My time testing has shown that all these methods are inlinable.

I have also added unchecked statements around the arithmetic since there is no chance of overflow. Therefore compiling with the "Check for arithmetic overflow/underflow" option set will not affect the speed increase.

C#
public struct QuickDivideInfo
{
    public readonly int Divisor;
    public readonly long Multiplier;
    public readonly int Shift;

    public QuickDivideInfo(int divisor, long multiplier, int shift)
    {
        Divisor = divisor;
        Multiplier = multiplier;
        Shift = shift;
    }

    public int Divide(int numerator)
    {
        return numerator < 0 ? DivideNegative(numerator) : DividePositive(numerator);
    }

    public int DividePositive(int numerator)
    {
        unchecked
        {
            return (int) ((numerator * Multiplier) >> Shift);
        }
    }

    // Does not work with int.MinValue as the numerator!
    public int DivideNegative(int numerator)
    {
        unchecked
        {
            return (int) -((-numerator * Multiplier) >> Shift);
        }
    }

    public int Modulo(int numerator)
    {
        return numerator < 0 ? ModuloNegative(numerator) : ModuloPositive(numerator);
    }

    public int ModuloPositive(int numerator)
    {
        return numerator - DividePositive(numerator) * Divisor;
    }

    // Does not work with int.MinValue as the numerator!
    public int ModuloNegative(int numerator)
    {
        return numerator - DivideNegative(numerator) * Divisor;
    }
}

MathHelper

For completeness, I have also added the normal pre-calculated Primes as used by Microsoft and pre-calculated their Fast Division information too.

C#
public static class MathHelper
{
  public const int Lower31BitMask = 0x7fffffff;

  // Allows quadrupling of bucket table size for smaller sizes then
  // reverting to doubling.
  static readonly int[] GoldenPrimeAccelerators =
    new[]
      {
        389, 1543, 6151, 24593, 98317
      };

  // Based on Golden Primes (as far as possible from nearest two powers of two)
  // at http://planetmath.org/encyclopedia/GoodHashTablePrimes.html
  // and Optimizing integer divisions with Multiply Shift in C#
  // at <a href="/KB/string/FindMulShift.aspx">FindMulShift.aspx</a>
  static readonly QuickDivideInfo[] GoldenPrimes =
    new[]
      {
        new QuickDivideInfo(53, 1296593901, 36),    // acceration skip
        new QuickDivideInfo(97, 354224107, 35),     // acceration skip
        new QuickDivideInfo(193, 356059465, 36),    // acceration skip
        new QuickDivideInfo(389, 2826508041, 40),
        new QuickDivideInfo(769, 2859588109, 41),   // acceration skip
        new QuickDivideInfo(1543, 356290223, 39),
        new QuickDivideInfo(3079, 714200473, 41),   // acceration skip
        new QuickDivideInfo(6151, 178753313, 40),
        new QuickDivideInfo(12289, 1431539267, 44), // acceration skip
        new QuickDivideInfo(24593, 2861332257, 46),
        new QuickDivideInfo(49157, 1431510145, 46), // acceration skip
        new QuickDivideInfo(98317, 2862932929, 48),
        new QuickDivideInfo(196613, 715809679, 47),
        new QuickDivideInfo(393241, 1431564749, 49),
        new QuickDivideInfo(786433, 1431653945, 50),
        new QuickDivideInfo(1572869, 2863302429, 52),
        new QuickDivideInfo(3145739, 2863301519, 53),
        new QuickDivideInfo(6291469, 2863305615, 54),
        new QuickDivideInfo(12582917, 1431655197, 54),
        new QuickDivideInfo(25165843, 1431654685, 55),
        new QuickDivideInfo(50331653, 2863311247, 57),
        new QuickDivideInfo(100663319, 2863310877, 58),
        new QuickDivideInfo(201326611, 2863311261, 59),
        new QuickDivideInfo(402653189, 357913937, 57),
        new QuickDivideInfo(805306457, 2863311215, 61),
        new QuickDivideInfo(1610612741, 1431655761, 61),

      };

  // Based on the list of primes in Systems.Collections.Generic.HashHelpers
  static readonly QuickDivideInfo[] Primes =
    new[]
      {
        new QuickDivideInfo(3, 2863311531, 33),
        new QuickDivideInfo(7, 2454267027, 34),
        new QuickDivideInfo(11, 780903145, 33),
        new QuickDivideInfo(17, 2021161081, 35),
        new QuickDivideInfo(23, 2987803337, 36),
        new QuickDivideInfo(29, 2369637129, 36),
        new QuickDivideInfo(37, 3714566311, 37),
        new QuickDivideInfo(47, 2924233053, 37),
        new QuickDivideInfo(59, 582368447, 35),
        new QuickDivideInfo(71, 3871519817, 38),
        new QuickDivideInfo(89, 3088515809, 38),
        new QuickDivideInfo(107, 1284476201, 37),
        new QuickDivideInfo(131, 1049152317, 37),
        new QuickDivideInfo(163, 210795941, 35),
        new QuickDivideInfo(197, 1395319325, 38),
        new QuickDivideInfo(239, 2300233531, 39),
        new QuickDivideInfo(293, 3752599413, 40),
        new QuickDivideInfo(353, 3114763819, 40),
        new QuickDivideInfo(431, 2551071063, 40),
        new QuickDivideInfo(521, 1055193501, 39),
        new QuickDivideInfo(631, 871245347, 39),
        new QuickDivideInfo(761, 1444824741, 40),
        new QuickDivideInfo(919, 2392843587, 41),
        new QuickDivideInfo(1103, 498418689, 39),
        new QuickDivideInfo(1327, 3314277703, 42),
        new QuickDivideInfo(1597, 2753942713, 42),
        new QuickDivideInfo(1931, 284700059, 39),
        new QuickDivideInfo(2333, 1885146383, 42),
        new QuickDivideInfo(2801, 785085061, 41),
        new QuickDivideInfo(3371, 2609342339, 43),
        new QuickDivideInfo(4049, 2172411219, 43),
        new QuickDivideInfo(4861, 904761677, 42),
        new QuickDivideInfo(5839, 188304783, 40),
        new QuickDivideInfo(7013, 2508510773, 44),
        new QuickDivideInfo(8419, 2089581429, 44),
        new QuickDivideInfo(10103, 870641693, 43),
        new QuickDivideInfo(12143, 1448751219, 44),
        new QuickDivideInfo(14591, 602843741, 43),
        new QuickDivideInfo(17519, 2008355049, 45),
        new QuickDivideInfo(21023, 1673613285, 45),
        new QuickDivideInfo(25229, 1394600345, 45),
        new QuickDivideInfo(30293, 2322937451, 46),
        new QuickDivideInfo(36353, 3871413319, 47),
        new QuickDivideInfo(43627, 3225926339, 47),
        new QuickDivideInfo(52361, 167989401, 43),
        new QuickDivideInfo(62851, 1119612165, 46),
        new QuickDivideInfo(75431, 932888921, 46),
        new QuickDivideInfo(90523, 97169703, 43),
        new QuickDivideInfo(108631, 2591110979, 48),
        new QuickDivideInfo(130363, 2159163081, 48),
        new QuickDivideInfo(156437, 1799286465, 48),
        new QuickDivideInfo(187751, 2998385913, 49),
        new QuickDivideInfo(225307, 1249295303, 48),
        new QuickDivideInfo(270371, 2082138815, 49),
        new QuickDivideInfo(324449, 1735095357, 49),
        new QuickDivideInfo(389357, 722922605, 48),
        new QuickDivideInfo(467237, 2409697663, 50),
        new QuickDivideInfo(560689, 2008064911, 50),
        new QuickDivideInfo(672827, 1673386929, 50),
        new QuickDivideInfo(807403, 87154425, 46),
        new QuickDivideInfo(968897, 2324085857, 51),
        new QuickDivideInfo(1162687, 1936720557, 51),
        new QuickDivideInfo(1395263, 403472287, 49),
        new QuickDivideInfo(1674319, 336226223, 49),
        new QuickDivideInfo(2009191, 1120749503, 51),
        new QuickDivideInfo(2411033, 933956447, 51),
        new QuickDivideInfo(2893249, 1556589021, 52),
        new QuickDivideInfo(3471899, 1297157443, 52),
        new QuickDivideInfo(4166287, 135120301, 49),
        new QuickDivideInfo(4999559, 56299961, 48),
        new QuickDivideInfo(5999471, 3002664487, 54),
        new QuickDivideInfo(7199369, 2502219085, 54),
      };

  public static bool IsPrime(int candidate)
  {
    if ((candidate & 1) == 0)
    {
      return candidate == 2;
    }

    int max = (int) Math.Sqrt(candidate);

    for (int i = 3; i <= max; i += 2)
    {
      if (candidate % i == 0)
      {
        return false;
      }
    }

    return true;
  }

  public static int GetPrime(int min)
  {
    return GetPrimeCore(Primes, min);
  }

  public static int GetGoldenPrime(int min)
  {
    return GetPrimeCore(GoldenPrimes, min);
  }

  public static QuickDivideInfo GetPrimeInfo(int min)
  {
    return GetPrimeInfoCore(Primes, min);
  }

  public static QuickDivideInfo GetGoldenPrimeInfo(int min, bool accelerate = true)
  {
    if (accelerate)
    {
      foreach (var goldenPrimeAccelerator in GoldenPrimeAccelerators)
      {
        if (min > goldenPrimeAccelerator) continue;

        min = goldenPrimeAccelerator;
        break;
      }
    }

    return GetPrimeInfoCore(GoldenPrimes, min);
  }

  static QuickDivideInfo GetPrimeInfoCore(QuickDivideInfo[] set, int min)
  {
    for (int i = 0; i < set.Length; ++i)
    {
      if (set[i].Divisor >= min)
      {
        return set[i];
      }
    }

    throw new ArgumentOutOfRangeException("min", 
      "You really need a prime larger than 1,610,612,741?!");
  }

  static int GetPrimeCore(QuickDivideInfo[] set, int min)
  {
    for (int i = 0; i < set.Length; ++i)
    {
      int num = Primes[i].Divisor;
      if (num >= min) return num;
    }

    for (int i = min | 1; i < 2147483647; i += 2)
    {
      if (IsPrime(i))
      {
        return i;
      }
    }

    return min;
  }

  public static bool IsPowerOfTwo(int value)
  {
    return value > 0 && (value & (value - 1)) == 0;
  }

  public static bool IsPowerOfTwo(long value)
  {
    return value > 0 && (value & (value - 1)) == 0;
  }
}

UniqueStringList

I have kept this class lean and mean for performance purposes.

  • Null values are not supported but are not checked for - that is the responsibility of the caller.
  • There is no option to remove individual values once added. You can however use the Clear() method to remove them all.
  • I've added a few additional members such as Count, Contains(), IndexOf(), and this[] to show it is still a list, but I suspect the Add() and Intern() methods will be used primarily.
  • If you have an idea of the maximum capacity required, then it can be provided in a constructor to pre-size the internal arrays. Note: this will be rounded up to match the Golden Prime used and the LoadFactor but will not be less than specified.
C#
public sealed class UniqueStringList
{
  const float LoadFactor = .72f;

  Slot[] slots;
  int[] buckets;
  int slotIndex;
  QuickDivideInfo quickDivideInfo;

  public UniqueStringList()
  {
    Expand(0);
  }

  public UniqueStringList(int capacity)
  {
    Expand(capacity);
  }

  public int Count
  {
    get { return slotIndex; }
  }

  public void Clear()
  {
    Array.Clear(slots, 0, slotIndex);
    Array.Clear(buckets, 0, buckets.Length);

    slotIndex = 0;
  }

  public string Intern(string item)
  {
    int index;

    return slotIndex == (index = AddIfMissing(item)) ? item : slots[index].Value;
  }

  public void Intern(ref string item)
  {
    int index;

    if (slotIndex != (index = AddIfMissing(item)))
    {
      item = slots[index].Value;
    }
  }

  public void AddRange(IEnumerable<string> items)
  {
    foreach (var item in items)
    {
      if (item == null) continue;

      AddIfMissing(item);
    }
  }

  public bool Add(string item)
  {
    return slotIndex == AddIfMissing(item);
  }

  public bool Add(string item, out int index)
  {
    return slotIndex == (index = AddIfMissing(item));
  }

  public IEnumerable<string> GetItems()
  {
    var index = 0;

    while(index < slotIndex)
    {
      yield return slots[index++].Value;
    }
  }

  public string this[int index]
  {
    get { return slots[index].Value; }
  }

  public bool Contains(string item)
  {
    return IndexOf(item) != -1;
  }

  public int IndexOf(string item)
  {
    int hashCode = item.GetHashCode() & MathHelper.Lower31BitMask;
    int bucketIndex = quickDivideInfo.ModuloPositive(hashCode);

    for (int i = buckets[bucketIndex] - 1; i >= 0; i = slots[i].Next)
    {
      if (slots[i].HashCode == hashCode && string.Equals(slots[i].Value, item))
      {
        return i;
      }
    }

    return -1;
  }

  int AddIfMissing(string item)
  {
    var hashCode = item.GetHashCode() & MathHelper.Lower31BitMask;
    var bucketIndex = quickDivideInfo.ModuloPositive(hashCode);

    for (int i = buckets[bucketIndex] - 1; i >= 0; i = slots[i].Next)
    {
      if (slots[i].HashCode == hashCode && string.Equals(slots[i].Value, item))
      {
        return i;
      }
    }

    if (slotIndex == slots.Length)
    {
      Expand(slots.Length + 1);
      bucketIndex = quickDivideInfo.ModuloPositive(hashCode);
    }

    slots[slotIndex].HashCode = hashCode;
    slots[slotIndex].Value = item;
    slots[slotIndex].Next = buckets[bucketIndex] - 1;
    buckets[bucketIndex] = ++slotIndex;

    return slotIndex - 1;
  }

  void Expand(int capacity)
  {
    quickDivideInfo = MathHelper.GetGoldenPrimeInfo(Math.Max(quickDivideInfo.Divisor + 1, 
                                                   (int) (capacity / LoadFactor)));
    capacity = Math.Max(capacity, (int) (quickDivideInfo.Divisor * LoadFactor));

    buckets = new int[quickDivideInfo.Divisor];

    Array.Resize(ref slots, capacity);

    for (int i = 0; i < slotIndex; ++i)
    {
      int bucketIndex = quickDivideInfo.ModuloPositive(slots[i].HashCode);

      slots[i].Next = buckets[bucketIndex] - 1;
      buckets[bucketIndex] = i + 1;
    }

  }

  struct Slot
  {
    public int HashCode;
    public string Value;
    public int Next;
  }

}

Using the code

For simple interning use, there are two Intern() overloads:

  • The first will return the string passed in or an equivalent already seen:
  • C#
    var myString = myUniqueStringList.Intern(myString);
  • The second just updates (or not) the string passed by ref:
  • C#
    myUniqueStringList.Intern(ref myPOCOClass.AStringField);

There are also two Add() overloads:

  • Both will return true if the string was new, otherwise false:
  • C#
    var justAdded = myUniqueStringList.Add("XXX");
  • The second additionally returns the index of the string in the list:
  • C#
    int index;
    
    if (myUniqueStringList.Add("XXX", ref index) 
    {
        ...
    }

The AddRange() method allows the addition of multiple strings at once (nulls are checked here and will be ignored).

The GetItems() method returns an enumerator to extract the unique strings in the order they were added. The output from this can populate a string[] or List<string> or whatever.

How it works

Overview

The highest level overview for the process is: given an input string, we look in a list for a string that is equal to it and return its index within that list. If the string cannot be found, then it is added to the end of the list and that index is returned.

The main problem is the speed at which the list of strings is searched. We could start at the top and then check each string in turn until we find one that is equal. However, imagine a million strings to search - it would take, on average, 500,000 equality checks to find an existing string (and a full million if the string was not found!). This would be far too slow, of course.

An alternative is to sort the strings as they are added. It would then be very quick to locate a string by using a binary search but incredibly slow to build the list in the first place, since strings would have to be constantly shuffled to keep them sorted. Also, the requirements define a non-changing index so this would not work anyway.

What we need is to retain the idea of being able to just append new strings to the list but when searching the list, reducing the number of comparisons to be the fewest possible. Hashing is the solution; this quickly identifies a very small subset of the strings that could match the input string (or, if you prefer, to exclude all those strings that cannot be equal). We then just need to check against those to see if any is a match.

Hashing uses an array of 'hash buckets' and calculating into which 'bucket' an item should be placed or associated. For a given item, the calculation will always choose the same bucket, thus the contents of all other buckets can be ignored because they cannot contain the sought item. This is what gives hashing its fast search speed.

Hashing is achieved by producing a hash code for an item - a simple int value, usually obtained by its GetHashCode() method, although any method can be used. For our purposes, it is effectively a 'random' number between int.MinValue and int.MaxValue that is guaranteed to be the same for strings with the same content. It is not unique however - completely different strings can produce the same hash code but a good hashing algorithm will produce numbers evenly spread over the whole int range to minimize this. So for example, using words from my test dictionary:

"wren".GetHashCode() -> 1400949012
"inwood".GetHashCode() -> 1400949012

This produces 'collisions' which are expected and dealt with in any hashing system.

But it is not the only source of collisions:

"bougainvillia".GetHashCode() -> 1516975564
"aniseroot".GetHashCode() -> -630508084

Because we need to choose a bucket in an array, we need a positive number, so we have to make the hash code positive. The quickest way is to simply mask off the top bit - the negative flag - and so our latter case ends up with the same positive number, 1516975564, as the former - another source of collisions.

Furthermore, since the buckets are in the form of an array, we need to calculate an index between 0 and buckets.Length - 1, done using a modulo operation, which will result in even more collisions. If we have more buckets, fewer collisions would be created by this final step. This reduces the number of equality checks but would require more memory. If we have fewer buckets, we save some memory but would have to do more equality checks - the old speed vs. memory trade-off. So often in hashing, use a Load Factor is for the buckets which defines a reasonable compromise. When this load factor is reached (we are using 72%, for example), we increase the number of buckets and recalculate the buckets for the existing items to spread them out.

Collisions are dealt with by looking at each item associated with the bucket and performing equality checks to find the exact match - in our case, a string equality check. The number of checks we need to do will be tiny compared to the total strings in the list. Our system will use a linked list of Slots, the head of which is pointed to by the bucket. Although this collision resolution sounds problematic and potentially slow, in reality the figures are excellent. In my tests, after I added 213,557 unique strings, the longest chain is 7 and the average chain length is 1.2717 - much better than the 106,778 average if we tried to search sequentially!

So we will use two arrays. The first array will store Slot structs (which contain the string itself, the positive hash code to save recalculation when expanding, and an index to the next Slot in the linked list). The second array is for the hash buckets - a simple int[] which holds an index to the head item in the Slot array.

Our refined high-level overview changes to: given the input string, we obtain a hash code for it. From the hash code, we get a bucket index. From the bucket index, we get a Slot index. If we are lucky then the Slot will hold the string we are looking for; if not, the Slot.Next property will direct us to another Slot to try - a linked list within the array. If we reach the end of the linked list and have still not found our string, then it must be new and we add it into the system.

Detail

Most of the work is done in the AddIfMissing() method.

The first step is to get a hash code for the item and make it a positive number by ANDing it with 01111111111111111111111111111111 to mask off the negative flag. Then we calculate a bucket index from the positive hash code using our souped-up modulo operation (i.e., the remainder after dividing by the number of buckets).

The content of the bucket is an int being a 'pointer' into the slots array, but note that in the for/next loop, we subtract one first to get the actual index to use. The reason for this is that when an array is created, it initially contains all zeros (all memory allocations work this way). We could set all values in the buckets array to be -1 to indicate an empty bucket, but this would be slower than just assuming that the contents of the array are all 1-based and always subtracting 1 to get the 0-based head Slot index. An empty bucket would hold its original 0 but the for loop would then start with i == -1 and jump out immediately because of the conditional expression. If the for/next loop is entered, then this indicates a valid Slot but not necessarily holding our search string. We compare hash codes first (the positive hash code we stored) and only if they are the same do we need to actually compare the strings. If we find a match, we jump out of the loop with the search complete and the index found, otherwise the loop-expression will use the Next member of the Slot to set i to the next Slot to try. (A Next with a -1 will indicate the end of the list with the item not found and jump out of the loop.)

If execution ends up after the for/next loop, then the search string has not been located and will need to be added to the list. At this point, we check to see if the slots array is currently full and, if it is, call the Expand method to increase it (and the buckets array in proportion), not forgetting to recalculate the bucket index. The last unused Slot is then populated with the information we have and its index returned to the caller. Thus any string is always found. Note how the Next member is set to buckets[bucketIndex] - 1 which was the head of the linked list. This Slot will now become second in the linked list, with our newly populated Slot becoming the head. We increment slotIndex first and then store it in the buckets array (the increment is done first because buckets is 1-based) to make it officially the head Slot for the hash code.

The Expand() method is even simpler. The first thing is to calculate the new number of buckets always ensuring we use a Golden Prime number and one that is larger than the previous one used. We store this QuickDivideInfo struct and adjust the capacity variable to match our LoadFactor.

The increase in bucket capacity invalidates the existing buckets so they are simply thrown away and recreated as a new array. The slots array still holds (mostly) valid information so Array.Resize is used to increase the capacity but preserve the existing Slots. The final step is to recreate the hash buckets from the preserved Slots. The loop iterates over all the existing Slots and uses the positive hash code we preserved to calculate the new bucket index without having to call GetHashCode() again - another improvement over the original code, especially for long strings. The Next member on the Slot is set to whatever the current bucket holds, and then the bucket is updated to point to the current Slot (adjusting for the 1-based indexing in both cases). Thus the linked list is recreated automatically.

The other code is straightforward but it might be worth pointing out that the code in the Add() and Intern() methods all rely on C#'s left to right expression evaluation to detect when a string has just been added to the list. The *current* slotIndex is evaluated first and then AddIfMissing() is called which may increment slotIndex. If the numbers are the same then the string was just added. Note however that if AddIfMissing() is called first then this will not work and will always return false.

Performance testing

A question was asked about the code used to test performance so I have now added a download containing the whole solution (.NET 4.0/VS2010) including the source code above and the performance testing code (based on NUnit 2.5.2).

Whilst I was tidying the code so it was suitable for release, I refactored the speed testing code into an abstract class suitable for reuse. You just need to inherit your test class from TimingTestFixture and you will have access to the same performance testing code.

As an example, a [Test] method would look like:

C#
[Test]
public void UniqueStringList_AddCopies()
{
  TimeProcess(() =>
            {
              var l = new UniqueStringList();

              for (var i = 0; i < words.Length; i++)
              {
                l.Add(words[i]);
                l.Add(copiedWords[i]);
              }

              Assert.AreEqual(words.Length, l.Count);
            });
}

and return this output to the Console (or Unit Test Output Window):

Timing per Process; 10 sets; 20 repeats per set; GC Mode=PerRepeat:

 1: Total set time=639 ms. Average process time=31.950000 ms. GC(62/62/61). Memory used=6,236,952 bytes.
 2: Total set time=640 ms. Average process time=32.000000 ms. GC(61/61/61). Memory used=6,236,952 bytes.
 3: Total set time=639 ms. Average process time=31.950000 ms. GC(61/61/61). Memory used=6,236,952 bytes.
 4: Total set time=626 ms. Average process time=31.300000 ms. GC(2077/2077/61). Memory used=6,236,952 bytes.
 5: Total set time=627 ms. Average process time=31.350000 ms. GC(1088/1088/61). Memory used=6,236,952 bytes.
 6: Total set time=628 ms. Average process time=31.400000 ms. GC(61/61/61). Memory used=6,236,952 bytes.
 7: Total set time=625 ms. Average process time=31.250000 ms. GC(1248/1248/61). Memory used=6,236,952 bytes.
 8: Total set time=626 ms. Average process time=31.300000 ms. GC(596/596/61). Memory used=11,196,296 bytes.
 9: Total set time=627 ms. Average process time=31.350000 ms. GC(912/912/61). Memory used=6,236,952 bytes.
10: Total set time=629 ms. Average process time=31.450000 ms. GC(61/61/61). Memory used=6,236,952 bytes.

All:
Fastest: 625 ms, Slowest: 640 ms, Average: 630.60 ms (31.530000 ms), StdDevP: 5.82 ms (0.92%)

Best 9: (excludes 2).
Fastest: 625 ms, Slowest: 639 ms, Average: 629.56 ms (31.477778 ms), StdDevP: 5.17 ms (0.82%)

Best 8: (excludes 1-2).
Fastest: 625 ms, Slowest: 639 ms, Average: 628.38 ms (31.418750 ms), StdDevP: 4.18 ms (0.67%)

Best 7: (excludes 1-3).
Fastest: 625 ms, Slowest: 629 ms, Average: 626.86 ms (31.342857 ms), StdDevP: 1.25 ms (0.20%)

As you can see, the output is reasonably comprehensive. As well as the set/per process timings, it also shows memory usage and the counts for each Garbage Collection Generation collected during each set.

The additional summaries at the bottom show the figures when the slower sets are excluded, which can be useful when other processes on your machine happen to kick in and slow you down when you are performance testing.

The tests can be executed using Resharper's Unit Test Runner (or equivalent) for maximum convenience, or individually as a Console application to give slightly more accurate and faster figures. Either way, always remember to ensure Release Mode is on!

TimingTestFixture has a number of properties such as TimingCount (default=10), RepeatCount (default=20), and so on.

You can change any/all the figures by adding a constructor to your [TestFixture] class which calls this constructor on TimingTestFixture:

C#
protected TimingTestFixture(int defaultTimingCount = 10,
              int defaultRepeatCount = 20,
              GarbageCollectMode defaultGarbageCollectMode = 
                                 GarbageCollectMode.PerRepeat,
              TimingMode defaultTimingMode = TimingMode.AutoByRepeatCount,
              bool defaultPerformJITPass = true,
              bool defaultUseHighPriorityThread = true)

All tests in your class will then use your default settings.

Additionally, you can set any of the properties at the start of your [Test] method to override the default for that method only - TimingTestFixture/Nunit will reset back to the defaults before running other [Test] methods.

For comparison, UniqueStringListOriginal returned these figures:

Timing per Process; 10 sets; 20 repeats per set; GC Mode=PerRepeat:

 1: Total set time=1,286 ms. Average process time=64.300000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 2: Total set time=1,285 ms. Average process time=64.250000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 3: Total set time=1,285 ms. Average process time=64.250000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 4: Total set time=1,285 ms. Average process time=64.250000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 5: Total set time=1,289 ms. Average process time=64.450000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 6: Total set time=1,287 ms. Average process time=64.350000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 7: Total set time=1,291 ms. Average process time=64.550000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 8: Total set time=1,289 ms. Average process time=64.450000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
 9: Total set time=1,293 ms. Average process time=64.650000 ms. GC(61/61/61). Memory used=4,824,344 bytes.
10: Total set time=1,287 ms. Average process time=64.350000 ms. GC(61/61/61). Memory used=4,824,344 bytes.

All:
Fastest: 1,285 ms, Slowest: 1,293 ms, Average: 1,287.70 ms (64.385000 ms), StdDevP: 2.61 ms (0.20%)

Best 9: (excludes 9).
Fastest: 1,285 ms, Slowest: 1,291 ms, Average: 1,287.11 ms (64.355556 ms), StdDevP: 2.02 ms (0.16%)

Best 8: (excludes 7, 9).
Fastest: 1,285 ms, Slowest: 1,289 ms, Average: 1,286.63 ms (64.331250 ms), StdDevP: 1.58 ms (0.12%)

Best 7: (excludes 5, 7, 9).
Fastest: 1,285 ms, Slowest: 1,289 ms, Average: 1,286.29 ms (64.314286 ms), StdDevP: 1.39 ms (0.11%)

Since the new code shows ~31.5ms compared to the old code's ~63.3ms, I can confidently claim it is now twice as fast!

Points of interest

Whilst writing this code, I was especially interested in ensuring that calls to the methods on QuickDivideInfo got inlined for speed. I don't know a way to easily check the JIT output so I relied on timings of test cases. I wrote three tests for each method: one to time the normal operation (divide or modulo), one to put the multiplication and shift actually inline, and one to call the QuickDivideInfo method. If the latter two produced roughly equal times, then the method was assumed to have been inlined. Both should also produce times roughly one third of that method using the normal operator.

I was surprised when one of the sets produced funny results. At first I thought my method was not being inlined but then I noticed that the test with the physically inlined optimization was also not showing the expected speed up!

This code:

C#
var check = 0;

for (var numerator = 0; numerator <= maxNumerator; numerator++)
{
  check += numerator >= 0
         ? numerator - (int) ((numerator * qdi.Multiplier) >> qdi.Shift) * qdi.Divisor
         : numerator - (int) -((-numerator * qdi.Multiplier) >> qdi.Shift) * qdi.Divisor;
}

return check;

ran 3 times faster than this code:

C#
var check = 0;

for (var numerator = 0; numerator <= maxNumerator; numerator++)
{
  check += numerator >= 0
         ? (int) ((numerator * qdi.Multiplier) >> qdi.Shift)
         : (int) -((-numerator * qdi.Multiplier) >> qdi.Shift);
}

return check;

yet it had more operations to complete!

I eventually discovered that I could get the the expected speed increase back by either inverting the condition to be '<0' or by changing to a if/else construct (with either condition).

So out of the 4 ways of writing the expression, 3 produced fast JITed code but the other, the one I happened to have chosen, produced slow JITed code.

So my conclusion is that there isn't a bug as such in the JIT-compiler since the results were correct but there is definitely something that could be improved. And it has made me a little uneasy about using ?: in performance-critical code.

History

  • v1 - Initial release.
  • v2 - Added the How it works section.
  • v2.1 - Added solution download including code, tests, test harness; added the Performance testing section to the article.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)