Introduction
The new System.Collections.Generic
namespace contains generic versions of the most often used data structures such as lists and dictionaries. But unfortunately, it does not contain a generic set type. The project included with this article provides a generic set type to fill this gap.
Why another set type?
Yes, I know what you are thinking. Why another set type? Several set collections have been posted to this site, for example this and this. Well, first of all, this is a generic set implementation, so it will be more convenient to use and also much faster than the aforementioned non-generic set types. Implementing a generic collection class is also a good way to see how .NET generics work.
The Interface
Since this set type tries to fill a gap in the System.Collections.Generic namespace, it is coded in a similar style so that it fits in easily. It also strives to provide good interoperability with other collection types.
I tried to provide methods that are similar in name and function to the ones provided by the other collections in the System.Collections.Generic
namespace. For example, I implemented the very useful ConvertAll, TrueForAll, FindAll and ForEach methods exactly as they are in the List<T>
class. Of course, I also implemented IEnumerable
and ICollection
in both generic and non-generic versions. So, hopefully, there will be no surprises when using the Set<T>
class.
The implementation
The requirements for a set class are quite simple: the Add
, Remove
and Contains
methods should be as fast as possible. Fortunately, the System.Collections.Generic
namespace already contains a collection that has exactly these properties, namely the Dictionary<K,V> type. The Dictionary<K,V>
type is a strongly typed version of the System.Collections.HashTable
class.
The keys in a Dictionary<K,V>
are unique. And adding, removing, or finding keys in a Dictionary<K,V>
will be fast as long as the storage type K
provides a good GetHashCode()
implementation. So, in the interest of avoiding code duplication, it makes sense to use a Dictionary<K,V>
internally to store the set elements.
Inheritance or composition
Since Dictionary<K,V>
is not sealed, it would theoretically be possible to let Set<T>
inherit from Dictionary<T,V>
. This would in fact be the most efficient way to implement a set that is backed by a HashTable
, but I decided against it for several reasons: First of all, Dictionary<K,V>
contains several methods and properties which have nothing to do with sets, and which would needlessly clutter the Set<T>
class and confuse the user. And second, the fact that the Set<T>
class is implemented using a Dictionary<K,V>
is an implementation detail that should be invisible for the user of the class. If I should decide to write my own Set<T>
implementation instead of delegating all the work to Dictionary<K,V>
, I could do this without breaking API compatibility.
So, I decided to use a private field to store the Dictionary<K,V>
instead of inheriting from Dictionary<K,V>
even though this means that one more object is created.
Efficiency?
As we have seen, the keys of a HashTable
can be used to emulate a set. But what about the values? We don't need them, so we could just use arbitrary values for them. This is an often-used idiom in non-generic .NET programming.
This small code snippet demonstrates how to use a non-generic HashTable
to make the values in an int[]
unique. Since the value we store in the HashTable
is arbitrary and we want to avoid unnecessary boxing, we use null
even though it looks a bit funny.
int[] values=new int[] {1,1,2,2,3,3,3,4,6,6,7,8,9,9,9,9,9,9};
Hashtable temp=new Hashtable();
foreach(int i in values)
temp[i]=null;
int[] uniquevalues=new int[temp.Count];
temp.Keys.CopyTo(uniquevalues,0);
The above code works and is reasonably efficient, but there are several places where it is less than optimal. Since the non-generic HashTable
class uses object references internally for keys and values, every time we add an int
to the HashTable
, it is boxed and placed on the heap. And, we still need the space for storing the reference to the value even though we are not interested in it.
Generics to the rescue
The generics implementation in .NET really shines when it comes to value types. When a generic type is instantiated using a value type as type parameter, the CLR generates special IL code for this specialization. So, using value types with generics totally eliminates the boxing overhead. With this in mind, we can rewrite the above code using the generic version of a HashTable
:
int[] values=new int[] {1,1,2,2,3,3,3,4,6,6,7,8,9,9,9,9,9,9};
Dictionary<int,object> temp=new Dictionary<int,object>();
foreach(int i in values)
temp[i]=null;
int[] uniquevalues=new int[temp.Count];
temp.Keys.CopyTo(uniquevalues,0);
This code will be quite a bit faster than the non-generic version, since adding a new key to the temporary dictionary no longer has any boxing overhead. But there is still the unnecessary baggage of a reference to a value we never need.
Zero-Size structs?
I thought that I could eliminate this overhead by using a zero-size struct
, but as it turns out, the current .NET runtime reserves some memory even for zero-size struct
s.
This will hopefully change in the future, since there are quite a few nice tricks that use zero-size struct
s.
Here is what I tried: since we do not care about the value, we can give it any type we want. Obviously, we want the smallest type we can get. The smallest primitive type is byte
, which is 1 byte long (duh) but will usually consume between 4 and 8 bytes depending on pack size. But we can do even better by defining our own dummy type with zero size:
struct Dummy {};
Dummy dummy=new Dummy();
int[] values=new int[] {1,1,2,2,3,3,3,4,6,6,7,8,9,9,9,9,9,9};
Dictionary<int,Dummy> temp=new Dictionary<int,Dummy>();
foreach(int i in values)
temp[i]=dummy;
int[] uniquevalues=new int[temp.Count];
temp.Keys.CopyTo(uniquevalues,0);
Too bad it does not work as advertised.
Alternative implementation
Using a Dictionary<T,X>
to store the Set<T>
data is reasonably efficient. But it has some memory overhead due to the unused dictionary values. I am currently working on a Set<T>
that is backed by a List<T>
sorted by the hash code. This would be much faster and more space efficient in many situations. When I am happy with the code, I will post it to the site mentioned below. It will have exactly the same public methods as this implementation, so you will be able to use it as a drop-in replacement.
Set operations
In addition to the usual collection operations, the Set<T>
type supports all kinds of set specific operations. Here is an overview:
Method |
Operator |
Description |
a.Union(b) |
a|b |
All elements that are in a or b |
a.Difference(b) |
a-b |
All elements in a that are not in b |
a.SymmetricDifference(b) |
a^b |
(a^b)=(a-b)|(b-a) |
a.Intersection(b) |
a&b |
All elements that are in a and in b |
a.Equals(b) |
a==b |
a and b contain exactly the same elements |
- |
a<b |
b contains all elements in a and some additional elements |
- |
a<=b |
b contains all elements in a and maybe some additional elements |
There is a slight difference in the operator methods and the non-operator methods. While the operator methods all require sets as parameters, the non-operator methods often use an IEnumerable<T>
as parameter. I did this to avoid accidental conversions to Set<T>
when using the operators.
Conclusion
I found the new System.Collections.Generic
namespace a real pleasure to work with. It is much faster for value types because of the elimination of the boxing overhead. The many useful methods such as the above mentioned ConvertAll, TrueForAll, FindAll and ForEach make it possible to eliminate most loops. The only thing it was missing is a Set type. But no longer.
If you can afford using .NET 2.0 features, you should by all means use the System.Collections.Generic
namespace even if you don't need the type safety. For example, you should use List<object>
instead of ArrayList
.
About the code
The included project contains the described Set<T>
type, and a small demo program that performs a few performance and correctness tests on sets. The code is under a BSD license, so feel free to use it for your own projects.
References
- New versions of the code will be published here.
- Other set types for .NET can be found here and here.
- You can vote for the inclusion of a set type in .NET Whidbey.
- Please vote for this as well. It has nothing to do with sets, but it is much more important.
- If you want to experiment with .NET generics, you can download Visual C# Express 2005. At least on my machine, it did not break my VS.NET 2003 installation, but YMMV.
- Online documentation for .NET Whidbey namespaces.