Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

A Generic Set Data Structure

0.00/5 (No votes)
26 Jun 2005 2  
A "set" type data structure in C# with basic logical operators.

Introduction

I was recently challenged by a co-worker (and recent .NET convert from Java) to point him to an equivalent of the java.util.Set class.

Considering the breadth of the data structures in the framework, I was disappointed to see that no native support exists for sets in v1.1 of the framework. I also realized how quickly one could be coded though. CodeProject has a couple of set projects already, but they�re geared towards more specific scenarios like efficient memory usage or storage of specific types and do not explore set operations as I do below, or are much more vast and difficult to follow.

This project is aimed at beginners to explain the implementation of basic collection iteration and operator overloading.

Game, Set, Match

What's a set and why would you want one? The Java docs for java.util.Set define a set as a collection that contains no duplicate elements. Sets are like enums in the framework, except enums are static constructs defined at compile-time (ignoring fancy tricks with the Reflection.Emit namespace). An enum is a defined list of values for a particular data type. A particular constant is in the enum (is a member of the set), or it is not in the enum. Sets are unlike enums in that enum members have a name and a value. Sets only have the values.

Sets are usually written like this: {1, 2, 5, 8, 9}. They are frequently numerical, but sets of other "things" are no harder to imagine and no less useful. You can have a set of three Stooges members {Larry, Moe, Curly}, or planets {Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, Pluto}, or anything else. They are mechanisms for grouping and representing logically similar things.

You would want a set if you wanted to group data in a way that made more sense to store as a set than as a queue, or a stack, or as a dictionary, or as a collection. It�s a recursive definition, but some data just makes more sense as a set.

Basic set operations

Aside from the basic set requirement of being able to iterate over the members, you must be able to add new members to the set (and have the uniqueness constraint enforced), you should be able to clear the values of the set and I'm going to allow client classes to be able to access the members of the set with an indexer. Sets also support some basic logical functions, and a test for equality.

If you remember back to elementary school math, you were probably taught about sets with Venn diagrams. Consider the following two sets: set "A" {1, 2, 3, 4, 5} and set "B" {3, 5, 7, 8, 9}. They can be drawn this way:

You can see that there are members unique to A, members unique to B, and members that they share. The members that are common to the two sets are called the "intersection" of the two sets, and all of the members of the two sets together are called the "union" of the two sets.

Intersection, union and equality are the operations that will be implemented by this class.

The code

The basic internal storage unit for the Set class is an ArrayList. ArrayLists are agreeably dynamic in size and ability to "take" different data types as members.

The Set class implements IEnumerable, so the contents of the set can be iterated over, and IDisposable, because I like to have all my classes behave in a consistent way when shutting down.

public class Set : System.Collections.IEnumerable, System.IDisposable

Implementing IEnumerable is pretty simple and only requires a single method:

public IEnumerator GetEnumerator() {
   return new SetEnumerator(this);
}

The SetEnumerator object is a private class defined at the bottom part of the Set.cs file:

private class SetEnumerator : System.Collections.IEnumerator { 
   private Set _set = null;
   private Int32 _position = -1;

   public SetEnumerator(Set set) {
      _set = set;
   } 

   public void Reset() { 
      _position = -1; 
   }

   public object Current { 
      get {return _set[_position];} 
   } 

   public bool MoveNext() { 
      if (_position < _set.Length - 1) {
         _position++; 
         return true; 
      } else 
         return false; 
   } 
}

The Current property of the enumerator class exploits the fact that I exposed the members of the set through an indexer (directly wired to the underlying ArrayList). If we didn't do that, we would have to provide some other mechanism to do the same thing to make the object enumerable.

Now that we've satisfied the requirement of being able to iterate over the members in a foreach(){} construct, we have to write some other methods to get some data into the set and manipulate it.

Public methods

I wrote two constructors. The first is a trivial constructor with no parameters. The second accepts an array of Set objects and creates a new one based on a union of all the elements of the array. I originally had a constructor that took a single Set object as a parameter and duplicated it, but then realized that the Set[] constructor could handle that case easily enough and I chopped it from the library.

The Add() method is implemented as follows:

public void Add(object member) {
   for(Int32 i = 0; i < _members.Count; i++)
      if (_members[i].Equals(member))
         if (_behaviour == BadBehaviourResponses.BeAggressive)
            throw new ArgumentException(...);
         else
            return;

   _members.Add(member);
}

The top six lines search the existing members for the member being added. If it's found in the set already, the method returns before the new member is added to the ArrayList.

The BadBehaviourResponses.BeAggressive enumeration is a tool to allow the method to work differently depending on whether it is being called by an internal method or by an external client. If an external client calls Add() and it turns out that the member being added is already in the set, I want an ArgumentException to be thrown (the same exception thrown by a dictionary when duplicate keys are added) to force the client to do something with the duplicate data. If an internal method is calling Add() though, I want to be able to short-circuit the exception throwing. Catching the exceptions is pretty expensive computationally, especially when you known you're just going to ignore the error.

Even though there are only two states, BeAggressive and BeCool, I implemented them as an enumeration. I tried writing the same functionality with a boolean, but the readability of the code really suffered.

Most of the other methods are pretty self-explanatory and are straightforward to code. Clear(), IsEmpty(), Contains() and Remove() are pretty trivial. I added ToArray() because I get that easily from the underlying ArrayList, and also because I implemented an explicit cast operator to create Set objects from arrays (discussed below), and ToArray() is the complement to that cast. The Length property and the indexer are both tied to the respective properties of the underlying ArrayList.

ToString() is overridden to display the contents of the set surrounded by brace-brackets {} and delimited by commas and is helpful for debugging purposes.

Overloaded operators

The logical operators allowed on the Set object are more interesting. Since "intersection" is pretty analogous conceptually to a bitwise "and", I overrode the & operator like this:

public static Set operator & (Set s1, Set s2) {
   Set result = new Set();
   result.DuplicateDataResponse = BadBehaviourResponses.BeCool;
   if (s1.Length > s2.Length) {
      foreach (object o in s1)
         if (s2.Contains(o))
            result.Add(o);
   } else {
      foreach(object o in s2)
         if (s1.Contains(o))
            result.Add(o);
   }
   result.DuplicateDataResponse = BadBehaviourResponses.BeAggressive;
   return result;
}

Overloading the & operator allows the client code to look like this:

Set s3 = s2 & s1; // s3 is the intersection of s1 and s2

Algorithmically, the method determines which set is bigger and then iterates over the members of the larger set. If any of the members in the larger set are also in the smaller set, they get added to the newly created Set object. Whatever members are in the result object at the end of the method are those that are common to the two input Set objects.

Setting the private property to control the duplicate data addition behaviour ensures that we don't have to trap exceptions in these loops, and setting the behaviour back to BeAggressive before the new object is returned means that consumers of this object will still have to trap their exceptions for subsequent exceptions.

Similarly to the case for intersection, "union" is analogous to "or", and I overloaded the bitwise or operator | like this:

public static Set operator | (Set s1, Set s2) {
   return new Set(new Set[2] {s1, s2});
}

It directly uses the constructor I discussed earlier, which takes an array of existing Set objects, and returns a new one with the input Set objects unioned together:

public Set(Set[] sources) {
   _behaviour = BadBehaviourResponses.BeCool;

   foreach(Set initialSet in sources)
      foreach(object o in initialSet)
         this.Add(o);

   _behaviour = BadBehaviourResponses.BeAggressive;
}

Again I remember to return the error raising behaviour of the Set after it's created back to the default action of BeAggressive. No matter how an instance of Set is created, either explicitly when newed up by a client or implicitly by one of the overridden operators, I want it to throw exceptions when it is fully created.

Implementing an overload of the equals operator (==) is the easiest of all the operators we're overloading.

public static bool operator == (Set s1, Set s2) {
   if (s1.Length != s2.Length)
      return false;

   foreach(object o in s1)
      if (!s2.Contains(o))
         return false;

   return true;
}

The method returns quickly if the two sets being compared are not equal in length. If they are equal in length, the loop looks for a mismatch between the members of the sets. If one is found, false is returned. If no mismatches are found, the method returns true. Since we're not concerned about whether the sets are sorted, members of the sets being compared don't have to be in the same position; if all the members of one set are also members of the second, they are equal.

The final operator to overload is the explicit cast from an array, so the client code can look like this:

Int32[]  intArray1 = new Int32[] {1, 4, 5, 7, 8, 9};
Set s5 = (Set)intArray1;

The code to overload the Set cast operator is pretty straightforward too:

public static explicit operator Set(Array array) {
   Set s = new Set();

   foreach(Object o in array) {
      try {
         s.Add(o);
      } catch (ArgumentException e) {
         throw new InvalidCastException(�, e);
      }
   }

   return s;
}

A shorter way to do this would have been to skip the try/catch block and just rely on any duplicate exceptions generated by s.Add(o) to bubble back to the client, but this would have resulted in ArgumentExceptions and not InvalidCastExceptions, which is more accurate in this case.

Conclusion

This class can store any data type, including intrinsic types like Int32 and System.String, more complex types like exceptions and database connections, or any other user defined types. It can store and operate on instances of any type that can be differentiated by the evaluation of their Equals() methods.

You can never have too many data structures at your disposal. For those cases when you need a set instead of a hash table or queue or stack or some other collection, you can use and extend this code to get the flexibility you need. I�ve included a sample project to exercise the library, and MathLib.chm created by NDoc, for documentation purposes.

Revision history

  • June 26th, 2005 � Initial revision.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here