Introduction
I decided to try my hand at a custom generic collection class, and started looking into the implementations of the generic collection classes such as List<T>
or the interfaces like ICollection<T>
. But while what I found as I went along was useful information, ultimately, one question rose above all others as I looked at example after example of simple and complex implementations of CollectionBase
or ICollection<T>
that at its core just wrapped an ArrayList
or List<T>
internally, which to me defeats the purpose of implementing a baser class or interface in the first place. If you could write a class that wraps a List<T>
successfully, why bother with implementing ICollection
or IList
if you could just put a List<T>
in your class? What if I didn't want to wrap an existing structure that is more advanced than the more primitive interfaces I was considering? What options are there?
Background
Eventually, I ended up with my own collection class, which I think could still be useful even in the age of 3.0 - HashSet<T>
is a speedy unique-value list that can beat my collection on Contains checks for unique values, but it relies on being unique and cannot sort, and List<T>
is slightly faster than my collection on Sorts, but not by much. Neither one still natively provide a single-value sorted unique-value list. (See Benchmarks near the end for actual numbers.)
Even if it is rendered obsolete by extending a List<T>
to add a Unique function, I think it's still good information to know. I had never broken down a collection to this level before, and picked up a few new concepts.
Getting to the code
The most basic question I started with was simply this:
Given a class A, how do you implement A[index]? There was a quick obvious answer of an internal array, but the most obvious drawback to that was the static size, since we obviously will want a dynamic Add option that doesn't cause a recreation of the array on every Add. The less obvious drawback that I saw immediately that caused me to code far too much was that it did exactly what I didn't want, namely it implemented a pre-defined collection class (Array
) even if this was a far simpler collection than List<T>
and the like.
The answer that I ended up with is a nested class. A nested class is simply a class that is declared within your class, and in this case, I declared a nested class with a generic type property, and I can carry that generic type value to the nested class in the constructor. In this code, we see both a declaration of a class with a "generic" value type as well as a nested class that carries the same T
type. Note that I am implementing zero inheritance, they derive directly from System.Object
.
public class basicList<T>
{
private class basicItem
{
private T _data;
public T Value { get { return _data; } set { _data = value; } }
public basicItem(T t) { _data = t; }
}
}
Note that the nested class is private
, but the constructor/property is public
- if you had a public nested class, it could be accessed outside the parent class, but since it's private only, the parent class can access it; however, the methods your parent class need to use (including the constructor) must not be private
.
Now, we need a method on the parent basicList
class that creates a basicItem
, something like this:
public class basicList<T>
{
public Add(T t) { basicItem newItem = new basicItem(t);}
}
You are probably saying that's great that we made a basicItem
with data, but how do we find anything once we leave Add
?
Answer: we need to implement indexing of some kind. This is probably where people with more common sense or less free time than me decide to implement a List
or ArrayList
because they're pretty quick to spring to mind when you think to yourself, "What can I use to keep a collection of values and also easily access by index?" But, part of this project was no implementation of another collection class, partially because I want direct access to the data instead of just passing through another collection, and partially because that was the whole point I got started on.
So, with a collection of basicItem
objects that contain T
data, how can I keep track of them for indexing? The answer is what got me finally going: each basicItem
needs a link to another basicItem
in the form of a reference property, and we need to keep track of the count of objects, and the parent object needs to keep a reference to one of the objects. This string of references will keep the classes from getting collected as well as provide easy enumeration implementation. For performance reasons, I went with the ability to go forwards or backwards, so I need to keep track of:
basicList
:
- reference to first
basicItem
- reference to last
basicItem
- current count of items
basicItem
:
- reference to previous
basicItem
- reference to next
basicItem
On construction of a basicItem
in a basic Add
function, it can know the previous item if it's not the first, but not the next, since in a basic Add
, the new item is appended, there is nothing after it.
The new basicItem
constructor and basicList
constructor, and the Add
method:
public class basicList<T>
{
private basicItem _firstItem, _lastItem;
private int _count;
public basicList()
{
_firstItem = null;
_lastItem = null;
_count = 0;
}
public void Add(T t)
{
basicItem n = new basicItem(t, _lastItem);
if (_count == 0) { _firstItem = n; }
else { _lastItem.Next = n; }
_lastItem = n;
_count++;
}
private class basicItem
{
private basicItem _prev, _next;
private T _data;
public basicItem Previous { get { return _prev; } set { _prev = value; } }
public basicItem Next { get { return _next; } set { _next = value; } }
public T Value { get { return _data; } set { _data = value; } }
public basicItem(T t) : this(t, null) { }
public basicItem(T t, basicItem prev)
{
_prev = prev;
_data = t;
}
}
}
Now we are getting somewhere! When basicList.Add()
is called with a value, it first calls the basicItem
constructor with the value given and a link to _lastItem
to store in Previous
. Note that if it is the first item in the collection, it gets a null reference from the basicList
constructor, and if it is not, the previous _lastItem
gets a Next
link to the new basicItem
. Items placed by Add
are appended by default, so the item created is always the _lastItem
. _count
goes up by 1, and if it is the first item, it is also placed in _firstItem
. This system also has the advantage of not hard-coding an index in the basicItem
, but rather by letting it be relative, we can easily move/remove/add at will by simply changing the links of the basicItem
to their new orientation.
Now to implement indexing! With what we have here, we know _firstItem
and its index is always 0, and we have _lastItem
and we know its index is always _count - 1
. Two things we can implement here - one is a generic IEnumerator<T>
to support foreach
operations, and a this[int index]
to allow direct access. Since we are implementing a IEnumerator<T>
, our parent class can also inherit from IENumerable<T>
for no more cost than simply pointing its implementation of IEnumerator<T>
to the one we are creating. Incidentally, this is the only inheritance the basicList
class will ever get - it's probably not adding much, but I thought I might as well.
New code:
public class basicList<T> : IEnumerable<T>
{
public IEnumerator<T> GetEnumerator()
{
basicItem current = _lastItem;
while (current != null)
{
yield return current.Value;
current = current.Previous;
}
}
IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); }
public T this[int index]
{
get
{
int curIndex = 0;
basicItem curItem = _firstItem;
while(curIndex < index)
{
curItem = _firstItem.Next;
curIndex++;
}
if(curItem != null) { return curItem.Value;}
return default(T);
}
set
{
int curIndex = 0;
basicItem curItem = _firstItem;
while(curIndex < index)
{
curItem = _firstItem.Next;
curIndex++;
}
if(curItem != null) { curItem.Value = value;}
}
}
}
We now have a fully functional generic list! We can Add
an item, foreach
the collection, or get/set a value directly by index by simply going through our collection from the beginning (or end, more on this later).
Extending the list functions
OK, so a list we can append an item to, enumerate, and access by index is the most basic of lists. Surely, we can implement some other functions while still retaining zero inheritance (besides the freebie) and staying completely generic? Of course, we can!
Basic list functions - zero inheritance and fully generic:
AddRange
AddAt
RemoveAt
Clear
Contains
I'm not going to rewrite the code for AddAt
/RemoveAt
in this section, since I implemented some performance enhancement I haven't gotten to yet - but basically, because the indexes are not hard-coded, all you have to do is just re-hook the links. In the case of RemoveAt
, you grab the basicItem
at the index, then link the Previous
and Next
directly to each other, then either null the item (for actual removal) or just re-hook it wherever you want to go in the case of an uproot and replant. In the case of AddAt
, you grab the items before and after, and hook them to the new item, and then hook the new item to the before and after. In both cases, the first/last need to be checked for replacement, and the count needs to be updated.
AddRange
couldn't be easier, just take a T[]
and loop through them, calling Add
on each value. For Clear
, I actually hooked into the constructor to call, but it nulls everything, resets values to 0, and parses through the collection removing links just to make sure we don't have a long chain floating out in the abyss that the collection doesn't clean up. We can implement Contains
at this point, though we are limited to the Object
inherited Equals
.
public class basicList<T> : IEnumerable<T>
{
public basicList() { this.Clear(); }
public void AddRange(T[] t) { for (int i = 0; i < t.Length; i++) { Add(t[i]); } }
public bool Contains(T t)
{
foreach (T tVal in this) { if (tVal.Equals(t)) { return true; } }
return false;
}
public void Clear()
{
if (_firstItem != null)
{
basicItem curItem = _firstItem;
while (curItem.Next != null)
{
curItem = curItem.Next;
basicItem prevItem = curItem.Previous;
prevItem = null;
}
}
_firstItem = null;
_lastItem = null;
_count = 0;
}
}
This is about as far as we can possibly get with an unrestricted generic Object
-derived no-inherit list. We only have the basic Object
functions without limiting T
. At this point, if we want more, we can either subclass basicList
, or just use the minimal restriction on T
to get the rest of the functions we want.
Performance enhancements
I would go directly to the part where we implement sorting and the extended functions, but unfortunately, that is pretty complex, and uses what I'm about to describe here. Rather than rewriting entire code segments to make them make sense without explaining this system, I thought I would explain it first.
What I implemented is what I call the pointer system. This is the main reason my list class can compete (and beat some) of the predefined collection classes. Now I know what pointer means, and I'm not really using pointers. I'm using object references, but for my own clarity, I call them pointers, since by grabbing one, you are essentially grabbing the object it's referring to.
Recall that _firstNode
and _lastNode
are essentially references to the first and last nodes, whose positions we know as 0 and (Count - 1), respectively. What this system does is keep more of these references, except instead of being anchored to the first and last, they are free to be wherever within those bounds. Necessarily, this system is pretty complex to keep track of, but it is vital to actually perform anything other than appending items. Remember that at this stage, to get index[y], we have to start at 0 and iterate 0 -> y. Some performance enhancement can be had by determining if y is closer to the end or start and going from there, and that was my initial implementation. But what happens when we have to go through a collection one by one and flip back and forth between nodes, such as an operation like sorting requires? I'll tell you - in small collections, it's not very noticeable, but in large ones, it's extremely noticeable, as to get to the middle of the collection, it always has to start from the start or end.
In my implementation, I allow the user to optionally set the number of pointers used; in large collections, larger numbers will help somewhat, although all my tests were done with the default number, which is 4.
Since this article is getting long already, I'll briefly explain the implementation which you can see in the source code.
- Pointers are kept in two separate arrays, a
basicItem[] _curPointers
and a int[] _curPointerIndices
.
_curPointers[]
contains the actual object references_curPointerIndices[]
corresponds to basicList[index]
to the current node index
- When a call is made to get a pointer, the function
_getClosestPointer
just returns the index of the most strategic pointer to _getNodeAt
, who then passes it on to the caller.
- if none exists, it creates one at zero or the end depending on which is closer to
_createNewPointer
- if some exist, but making a new one at zero or end would be closer:
- if there is an empty spot to create one, grab it and create it
- if not, replace the least strategically placed pointer (based on proximity to other pointers and start/end) in
_calcPointerWorthLeast
- Functions then take that index and just use
_getClosestPointer
, and functions that grab the object reference use _getNodeAt
- functions keep track of their working object acquired from the pointer and its index, or just the index depending on the function
- they update the pointer/index where needed, either by directly inputting to
_curPointers
/ _curPointerIndices
or with _updatePointer
- if any nodes are added/removed during use, affected indices of pointers are updated accordingly with
_movePointers
- pointers support a reverse lookup, given a
basicItem
_getPointerIndex
can return the index of the pointer for updates
Enhanced functions
Now that I've explained that a bit, let's move on to the final part where we set up a Sort
and reforged Contains
, and enable auto-sorting on Add
. First off, to do any kind of comparison other than the basic Object
-derived Equals(obj)
, we need to finally put a little limitation on our T
, namely asking it to implement IComparable<T>
by tacking on where T:Comparable<T>
to the class name. I chose to augment the current list, rather than stop the class here and inherit it and then tack on the requested inheritance to T
, because IComparable<T>
is incredibly easy to implement. It has only one method - CompareTo
, which given t1.CompareTo(obj t2)
, returns -1 if t1
< t2
, 0 if t1
= t2
, 1 if t1
> t2
. I usually implement that myself in custom classes with a redirect to a type-specific comparison and avoid the normal object kludge.
Now that we can guarantee T
is IComparable
, we can update Contains
and implement sorting. A basic sort routine would be nice, but since I'm me, it's not basic. Before we get to that, though, here is the updated Contains
method; recall that before this, we could only use Object.Equals
; this is much faster:
public bool Contains(T t)
{
foreach (T tVal in this) { if (tVal.CompareTo(t) == 0) { return true; } }
return false;
}
It's now using CompareTo == 0
. One other thing that you could do is a quick compare of Object.getHashCode
first since that's quicker than CompareTo
(which is still faster than Equals
). But keep in mind, getHashCode
isn't a bulletproof comparison operator, and if hash codes are equal, that doesn't necessarily mean the two objects are equal, thus the second check with CompareTo
. I tested this with integers, and it sped up value checking considerably, but then when I used strings, it slowed it down- my guess is hash codes are equal more often, or it takes longer to calculate them with strings, and since this is intended to be a generic List
with as few restrictions as possible, I didn't want to set up a special string handler, a special integer handler etc., and so set it back to the most reliable comparison. If I was going to use it with strings, I would just override the comparison in a new string-targeted class inheriting basicList
.
On to sorting! At its heart, sorting only relies on a few things - that you have a reliable method to compare two values, which we do now with CompareTo
, and that you can keep track of where you are while you're flipping around. The latter is much more complex with the pointer system I described above, but it's pretty impressive on a sort op.
The sorting I used here is a combination of several parts. I'll post the actual sort handling code, although most of it won't make a ton of sense since we didn't go into that much code on the pointer system and the sort function itself calls another function which calls another. Nonetheless, here is the current _handleSortCall
(the public Sort looks at _autosort
, and if it's already autosorted, it ignores the call).
private void _handleSortCall()
{
basicItem nEnd = _firstItem, nStart = _firstItem;
int posEnd = 0;
int workingIndex = 0;
while (nEnd.Next != null)
{
basicItem thisNode = nEnd.Next;
workingIndex++;
if (thisNode.Value.CompareTo(thisNode.Previous.Value) < 0)
{
_removeNodeAtIndex(workingIndex, ref thisNode);
int newIndex = _findProperIndexByValue(thisNode.Value, 0);
_addNodeAtIndex(newIndex, ref thisNode);
posEnd++;
workingIndex++;
}
else
{
nEnd = thisNode;
posEnd++;
}
}
}
Nothing is passed to it since this is the basic sort call. The next more specific function is _findProperIndexByValue
, which, given a value, figures out the first index where the next value is greater, basically an ascending sort finder. To do this, it creates up to 10 zones, and checks the starting value one at a time, and then recurses, narrows the zones, and repeats until it has less than 20 possibilities, at which point, it calls another function _findFirstBiggerValueIndex
, whose function you can probably guess. The functions _removeNodeAtIndex
and _addNodeAtIndex
are exactly what you would think they are, the workhorses behind the AddAt
and RemoveAt
functions as well as absolutely vital for sort op since we are essentially plucking one node from its home and placing it somewhere else.
Since we are starting at the beginning and moving forward, any value behind the previous work node is guaranteed to be of lower or equal value than the current node, meaning that for each node we move forward, we only have to check if the value is larger than the last, and if it is, we can keep on moving, since a gap will be covered by later nodes being moved back if such a gap exists.
Epilogue
If you've read this far, I am quite proud of you. This is far, far longer than I anticipated it being, and I even started cutting it short once we got into the complex stuff. I do have a slight hope that somebody will find this useful despite there being obvious alternatives. There is a certain pride in building something from scratch that can compete with the established kings on nearly even ground. If not, it was a great brain exercise, hopefully for both of us.
Benchmarks!
I was curious how my little experiment stacked up against the big dogs, so I did an experiment using a 500K int
collection. I also did a 10K string
collection, but the results were so similar across the board I scrapped it until I felt good about this one. Only thing I will say about the strings is that my unsorted list's Add
time was exactly identical to all the others, and its sort time was exactly the same as List<string>
and ArrayList
, so I'm not just hiding bad results.
Code used for benchmark (in the code, plus two short verifications that the last 50 and first 50 are correct):
for (int x = 500000; x > 0; x--) { list.Add(x); }
list.Sort();
for (int x = 500000; x > 0; x--) { if (!list.Contains(x)) { list.Add(x, null); } }
I wrapped a little timer function around each of these to provide the milliseconds elapsed, and here are the results:
Collection Class | Add Test | Sort Test | Contains Test |
---|
basicList !_autoSort | 141 ms. | 203 ms. | Stopped at 10 seconds - 8022 completed. Estimated 623 needed. |
basicList _autoSort | 453 ms. | 0 ms. | Stopped at 10 seconds - 11648 completed. Estimated 429 needed. |
ArrayList | 109 ms. | 718 ms. | Stopped at 10 seconds - 652 completed. Estimated 7692 needed. |
List<int> | 31 ms. | 78 ms. | Stopped at 10 seconds - 805 completed. Estimated 6250 needed. |
Hashset | 125 ms. | NA | 765 ms. |
Hashtable(int, null) | 281 ms. | NA | 968 ms. |
SortedList<int, object> added (int, null) | added 100k in ~1min | NA | NA |
Dictionary<int, object> added (int, null) | 156 ms. | NA | 765 ms. |
SortedDictionary<int, object> added (int, null) | 703 ms. | 0 ms. | 781 ms. |
I did do the testing in 3.0, as you can see with the HashSet
, but downgraded the assembly to 2.0, since it doesn't need more than that. SortedList
couldn't even make it out of the gate, funnily enough. The unsorted list was beaten in both Add
and Sort
by the generic List
, but as you can see, managed to process 10x as many Contains
, and the sort time was much better than ArrayList
. The autoSorted
basicList
did 50% more Contains
in 10 seconds, but still not good enough for me to implement a unique function just yet. If I can get that down to Dictionary
sort times, then I finally have completely customizable competitive unique-value auto-sorted Set that is on equal grounds with the predefined ones. In my opinion, the non-unique unsorted and sorted are on pretty equal footing with the others. But I am biased!
Future plans
Currently (as of initial writing), there is also a unique attribute/public property/handler method in the class, which is commented out. You could implement it, it would require only adding another condition to the Add
/AddAt
/AddRange
for unique values. The reason I haven't implemented it is not difficulty, but because I am unsatisfied with the performance of the Contains
check. Since Unique would require a hit on the Contains
check for every single Add
, being satisfied is completely necessary for me. But, of course, I am benchmarking with 500K collections, and most applications would probably be fine.
Conclusion
Hope you found this as interesting an exercise as I did. But even a fraction is good enough for me to feel good about writing it!
Please feel free to let me know about bugs you find, I did test it pretty thoroughly, but I'm sure I didn't find everything.
As a footnote, if you are used to the normal ///writeup
on every function, I apologize - I only commented where I felt necessary, and I do not like paging through a million pages of mostly comments. If, however, this gets some downloads, and it is requested, I can update it with that. I have yet to see if anyone is even interested in downloading the actual code, the interest for me was the concept.
History