Introduction and Background
Just after writing my article about Dictionary + Locking versus ConcurrentDictionary, in which I did my own dictionary implementation, I had a situation at work that required dictionary like collections to do many fast searches without calling the database many times but the normal
.NET dictionaries simply couldn't be used.
The initial need was to have dictionaries that supported more than 50 millions of records. Normal dictionaries were simply not supporting that amount of data, throwing OutOfMemoryExceptions
even if the computer had enough memory to hold the dictionaries.
The problem is in .NET itself as it does not allow to allocate arrays bigger than 2GB in size (at work we are still using
.NET 3.5, so we can't use
gcAllowVeryLargeObjects
),
but a 64 bit .NET application is still able to allocate more than 2GB of memory as long as it is distributed in many arrays.
So, I made a BigArray
class, that internally allocates arrays in blocks (for example, of 10MB) but is
accessible as a single extremely large array, and as I already had a dictionary source code, I copied it and replaced the normal array by the BigArray
and the initial problem was solved. We could already create dictionaries with more than 50 million items in memory, and they were much faster to access than calling the database many times. But our requirements didn't end there, so this library was born.
Non-Exclusive Keys, Multiple Indexes and Real Out-Of-Memory Exceptions
We already had a collection with 50 million records. But we should index it by many different keys, some of which are not exclusive.
A dictionary instance is a single "exclusive index" in memory. So it does not naturally allow us to put two values for the same key. We can use a list as our TValue
type, but then we start to complicate how to use the dictionaries.
Surely a helper class could hide that extra work for us making it simpler to use, but we also needed to index the 50 million many times (once per key). The computer had enough memory to put two or maybe three indexes in memory, but not all the indexes that we were going to need.
So we needed another solution that could use files if necessary, as long as it was still fast enough for most uses.
Finally, if possible it is better to simply add an item and have it indexed by all indexes. Having to add an item in all the dictionaries by hand is not-elegant and error-prone.
The solution?
We could copy the data to a local database, use some dictionaries and create a manager class where we add a single item and it does the job of "distributing" what goes to the database and what goes to memory...
Or we could create a real dictionary like class that allows multiple indexes to be added, allowing them to be non-exclusive and making it support memory or files (or having different implementations that implement the same interfaces so they can easily replace each other).
And I think that if you read the article's title you know the path we followed.
Before continuing
Maybe you are not really interested in the indexing solution, but to make it I created a lot of components that you can use in your applications if you need very large arrays, lists, or a FileStream optimized to do reads at "random" positions.
So, if you simply want to continue on the indexing solution, click here and jump to the indexing solution.
The components that you may find useful for many different situations are:
BigArray
The BigArray
is an abstract class that looks very similar to a normal array but uses long lengths and can have 4 billions of items or more, considering you have enough space for that. But talking about enough space, let's see the
three implementations:
InMemoryBigArray
: As the name says, the big array is stored in memory. In 64-bits system with many
GBs of memory it will be able to use all the computer memory, without a size limit of 2GB (or 2 billion items). If you have enough memory, you can put 4 billion items in memory and that may be what you need; MmfArray
: MMF means MemoryMappedFile
. If you see the implementation you will notice that I always use the MemoryMappedFile.CreateFromFile()
method, but that's on purpose. Using the MemoryMappedFile.CreateNew()
the array size is limited to the windows page file size, but that does not happen when putting it inside a real file, so your only limit is your HD size. For some time I though this was the best solution, as it is pretty fast and it is not limited by the computer memory, but there are two problems with this array: In 32-bit computers, you still can't allocate arrays bigger than 2GB. That's because I use a single accessor and it must have enough addresses to access the file. It's not a question of memory size, but about memory addresses available. The other problem is that acessing really big MemoryMappedFile
s at random positions makes Windows remove other applications from memory... that makes all the other applications become extremely slow and my computer became instable in such situations, to the point that deleting the generated file after the application finished was making my computer restart; RafArray
: RAF means RandomAccessFile. Considering that the MmfArray
was making all other programs became extremely slow (and even crashing sometimes) I wanted to have something safer. Unfortunately, using normal streams and repositioning to do reads has a terrible performance. So I first created the RandomAccessFile
, that allows
accessing random positions in files pretty fast, and then I wrote this array.
BigList
Using the same principle of a normal array and list, the BigList
doubles its capacity when there is no room for a new item. This usually means copying the contents of a smaller array to a bigger one. So, even if the
BigArray
has a Resize
method, it is still preferable to use a
BigList
if the number of items is unknown.
RandomAccessFile
This class tries to avoid the FileStream
buffer that's discarded at every Seek()
by creating the stream with a buffer of 1. But the important factor is that it still read full blocks and keep them in a weak-dictionary. Writes also read a block (if it is not yet in memory) and overwrite the content in such blocks. Then, on garbage collections, if the application size is bigger than 1GB (that's configurable) it saves the changed blocks and allows all blocks to be collected (but still keeps the weak references just in case you need them again before the next collection). This is usually 1/3 the speed of a MemoryMappedFile
, but it is much faster than not buffering and does not seem to affect other programs or cause crashes.
All the reads and writes receive the position as a parameter instead of having a Position
in the object itself and are thread-safe.
GuaranteedMemoryLimit
To configure how much memory the application can use before it starts to collect the RandomAccessFile
buffer you only need to set the RandomAccessFile.GuaranteedMemoryLimit
property. It is in bytes. But there are 3 small problems with such approach.
- I made it as
int
instead of long
. This is easy to solve, just change its type, but in fact it is not of much use if you don't solve the next problems; - Windows and .NET weak references don't work very well. If you put an item in memory to make it accessible very fast, windows may decide to page-out such memory, effectively saving in a file an item that is on memory. That is: The .NET application puts a file item on memory, for fast access, and then windows decides it should be saved in a file because it was not used for some time... pretty stupid, but...;
- In my tests with 4GB to 12GB computers, disabling windows page file did improve performance (I don't care if that's not suggested, I did it) but after the application starts to use more than 2GB in the whole size, it starts to be paged out (or better, the application itself is removed from memory without page-outs [without writes]) but then new page-in (reads) are needed. It was possible to reduce the problem by changing the application minimum and maximum workspace, but that didn't really solve the problem.
That is: I still think windows does an stupid job of "guessing" what is being used and what is not, but putting a cache of 6GB was less useful than a cache of 1GB... and in my tests 1GB did great. So, I will only change it if I see a real situation where it makes a difference or if windows starts to do different "guessing" (I didn't try Windows 8, but I really,
really, hope they stop removing an application from memory after 3GB usage of total memory when you have 16
GB of memory.)
RandomAccessStream
The RandomAccessFile
allows reading and writing to direct positions of the file, but it does not inherit from the Stream
class, so you can't use it where a Stream
is required.
This class simply implements the Stream
abstract methods and redirects them to a RandomAccessFile
, allowing you to replace a normal FileStream
when you know Seeking is going to be used frequently.
Now, the small presentations of such classes is done. Let's continue with the main focus of the article.
Indexer + Indexes
To understand this library you need to understand these main points:
- There is an
Indexer
type in which you can add many indexes; - There are equality indexes (they don't support searching items greater than or less than another item, only equality comparisons can be done) that you can add to the indexer;
- There are ordered indexes (so you can read all values correctly sorted, you can find all items inside a range, items that are bigger than an specific key and so on) that you can add to the indexer;
- And then there are the 64-bit (the normal), the 32-bit (the small) and the persisted version of those classes.
And after you create the indexer and add the indexes, it is enough to:
- Start a transaction;
- Add all the items you want;
- Commit the transaction.
After the commit you can start using the indexes to do as many searches as you need.
It may look strange that we add items to a transaction and then we use an specific index to do the searches, but this is how it should work.
We should never add items outside a transaction. So, instead of checking if there is a transaction to add an item, the
Add
method is in the transaction itself.
Then, when we search, we must know by which index we want to do the search.
Maybe in a database we can do something like:
SELECT [COLUMNS] FROM [TABLE] WHERE [KEY]=@SOMEPARAMETER
And the database is able to identify if the [KEY]
is indexed and use the right index. But why have to lose time searching for an index when we know which index to use?
Also, considering the indexes are generic types, you have the advantage that both the key and the result are already typed. There is no conversion of a resultset to objects, no casts and no boxing. The data is already of the good type.
Examples
Talking about how it works is usually not enough, so let's see some examples:
Creating the indexer, adding indexes and items to it:
var indexer = new Indexer<SomeType>();
var idParentIndex = indexer.AddEqualityIndex("IdParent", (x) => x.IdParent);
var nameIndex = indexer.AddOrderedIndex("Name", (x) => x.Name);
using(var transaction = indexer.StartTransaction())
{
for(int i=0; i<1000000; i++)
transaction.Add(new MyType { Id=i/2, Name="Child " + i });
transaction.Commit();
}
Note: I am using i/2
as the key, so we will always have two items per key (for example, key 0 will have Child 0 and Child 1)
Searching items by Id
foreach(var item in idParentIndex.EnumerateItems(0))
{
}
Searching items by Name
foreach(var item in nameIndex.EnumerateItemsInRange("A", "B"))
{
}
foreach(var item in nameIndex.EnumerateItemsStartingWith("Some partial name"))
{
}
Searching the position of an item per key
var position = nameIndex.BinarySearchFirst("Some name");
Creating a multiple-property index
The index by itself only knows one key to be indexed. You can't give two extract-keys to it but you can always create an index based on many properties by returning Tuple
s, KeyValuePair
s or any type that you create that can hold all the info you need. It is enough for such type to implement the Equals()
and the GetHashCode()
methods.
var idParentAndNameIndex = indexer.AddEqualityIndex("IdParentAndName", (x) => Tuple.Create(x.IdParent, x.Name));
Semi-persisted indexers and index
If you see the sample application you will see that I give two delegates to the Indexer
constructor. In fact, an Indexer
without a read and write delegate will use an InMemoryBigArray
. If the delegates are MmfReadDelegate
and MmfWriteDelegate
(or if you simply give the name of methods that use an unsafe byte *
to access the item memory), a MmfArray
will be used by the indexer and by its indexes. If you give delegates that use a byte[]
instead of unsafe pointers, then a RafArray
will be used by the indexer and the indexes.
I consider this a semi-persisted approach because the arrays will use files for their storage, but not all the info is kept on files and they will be deleted if you dispose your objects properly. But if you need or want to keep the data from one execution to the next one, you will probably want to see the PersistedIndexer
.
Small and Persisted indexers and indexes
I abused the use of var
on the sample on purpose. If you change from a normal Indexer
to a SmallIndexer
or to a PersistedIndexer
the index types will also change (and they have generic parameters)
and I simply don't want that you need to change too many places to do tests.
Also, even the BinarySearch
methods have different return types, as the small version (32-bit) returns an int
while the normal and the persisted versions return long
. If you want to be able to receive any of them as parameter, then use the IIndexer<TItem>
interface for the indexer, and all the methods will also return an appropriate interface type.
I used the indexer methods AddEqualityIndex
and AddOrderedIndex
. If you want, you can create the indexes yourself and then add them to the indexer. This is specially useful if you create alternative indexes and don't want to change the source code of existing classes or if you want to use the extra parameters on the indexes constructors (the EqualityIndex
es allow to set an expected number of items with the same key, which optimizes the index size and the persisted versions allow to tell the file names to be used).
But before looking at those alternative indexes, let's see our main methods:
In any Indexer
(and accessible through the IIndexer
interface):
AddIndex
: Adds an already created index into the indexer; AddEqualityIndex
: Creates and adds an equality index to the indexer; AddOrderedIndex
: Creates and adds an ordered index to the index; StartTransaction
: Starts a transaction object that you can use to add new items to the indexer. If a transaction already exists, this creates a "child" transaction.
Then, from the returned transaction, you have:
Add
: Adds an item into the indexer. At this moment new items are not indexed; Commit
: Commits the transaction if it is a main transaction. If it is not, simply tells that the transaction finished OK. Trying to commit an outer-transaction when an inner transaction did not commit throws an InvalidOperationException
; Dispose
: It is always indicated to use the transaction in a using clause. If the Dispose()
is called when a Commit()
was not called (maybe by an exception), the main transaction is rolled back (if this is the main transaction) or it is marked as requiring rollback (if the actual transaction is an inner/child transaction).
And finally, in the indexes (which should be used only when a transaction was already committed) we have:
EnumerateItems
: This is present in all indexes, be they ordered or not. It will simply enumerate all items that have a specific key.
And then all the other important methods are on the ordered indexes.
EnumerateItemsInRange
: Enumerates all items that have a key inside some range of values. For example, all objects where the Id is between 1 and 1000. This will incluside the 1 and 1000 itself;
EnumerateAll EnumerateItemsGreaterThan EnumerateItemsGreaterOrEqualTo EnumerateItemsLessThan EnumerateItemsLessThanOrEqualTo
| I think the names are pretty obvious. One important thing to note is that items are always returned ordered from the lesser ones to the greater ones, independent if you ask for items greater than something or less than something.
|
BinarySearch
: Searches for the position of an item with a given key. As happens with the
BinarySearch
on a normal list, if there are no items with a given key, it returns the position where the item should be as a complement (~position), which becomes a negative value. But, if there are many items with the same key,
it will simply return a position where an item was found, it is not guaranteed to be the first or the last item with such key; BinarySearchFirst
: Works mainly like the BinarySearch
, but if there are many items with the same key, it will always return the position of the first one; BinarySearchLast
: Exactly the opposite of the BinarySearchFirst
. If many items have the same key, it will return the position of the last one; index[position]:
This is not a method, but considering you can search the position of the items, you will probably want to get items by position; EnumerateItemsStartingWith:
This is an extension method available to ordered indexes of string keys. As the name says, it will enumerate all items that have a key starting with the given string.
Using the Persisted Indexer and Indexes
The persisted indexer always require more parameters to create, which are similar to the parameters to create a normal Indexer using RandomAccessFile
s as its storage.
To start using it, you should at least give:
- A file name or a stream where the items will be stored;
- A delegate to read an item stored on the stream;
- A delegate to write an item to the stream.
If you want you can easily read and write items using the .NET BinaryFormatter
but I highly recommend that you write the code for the read and write delegates properly.
The normal serialization process is slow and adds a lot of information into the files. It is best to write only what's needed to the files. After all, the entire idea here is to have good speed through indexing, and saving items to disk will already reduce such speed a lot. We don't want to reduce it even further by using the wrong serialization method.
You can also create the indexes directly from the PersistedIndexer
, but this will use the index name as the name of the file where it will store data. If you want to specify a different file name, save it to an alternative stream or use one of the alternative indexes, you should create the index yourself and then add it.
The persisted indexes also have a Cache
property that you can fill with a PersistedIndexerWeakCache
or a PersistedIndexerWeakPlusKeepAliveCache
. By using them recently read items are probably still in memory, so a new read is not necessary. This usually improves performance for the OrderedIndexes, as they usually read the same items many times.
Safety
The PersistedIndexer
is capable of reloading items previously saved, but it is far from being safe as a real database.
If you delete an index file, you will start to receive exceptions. At this moment, there is no way to simply ask an index to be regenerated.
Even if you create transactions and can rollback them, if an exception is caused at any moment, since the transaction is started until it is committed or rolled back, there are great chances that the contents are corrupted.
The purpose of these transactions is not to be a guarantee that all items are added or none is. The transaction is used to tell the moment when it is expected to stop adding items (so the indexes don't need to be regenerated before) and the fact that items can be stored on disk is only to avoid out-of-memory exceptions, so, remember, if you want to keep indexed data in memory and eventually save them, that's OK, but if you want to persist it and use as a database, then don't do it.
The Sample
If you download the sample you will have a running application that's effectively the same I've shown in the start of the article. It also has some commented code that you can uncomment to test the different indexes.
Adding 50 millions of records will not be possible for 32-bit computers without using the indexer constructors that receive delegates to read and write from RandomAccessFile
s or by using the PersistedIndexer
. It will be possible on 64-bit computers that have enough memory but you can always try the alternative versions for testing purposes.
Surely adding 50 million records still takes time and the persisted versions are slower than the memory ones. But it is all about overcoming limitations. You can always create an indexer that keeps all items in memory, then create indexes that are stored in files, or exactly the opposite. It is for you to try and discover which one best suits you.
Advanced - Understanding how they were done
Until now the article was relatively basic. You can simply download the sample code, change it, and start to use it in your own projects if you want.
Now we will explore a little on how the Indexer
s and the indexes themselves were made.
First, the Indexer
Even if you need an Indexer
to start your work and its interface is one of the biggest ones, it is one of the simplest types.
All kinds of indexers simply hold all indexes and all items added by the transactions. When a commit is done, the indexer simply does a
foreach
over its indexes and ask them to "reindex" themselves.
The difference between the indexers is that the Indexer
uses a BigList
to store all items in memory or temporary files (that depends if you give the delegates to read and write to files or not), the SmallIndexer
uses a normal List
and the PersistedIndexer
uses a RandomAccessStream
. The PersistedIndexer
is so "dumb" that it can't find the 10th item in its file, it is the index that should give the right position in the file so the indexer can read it.
The transactions? Well, they are simpler yet. They simply guarantee that the Commit()
or Dispose()
is done only once. In fact, all its methods redirect to the Indexer
.
The real code happens in the index themselves.
Do you remember the list of methods in the EqualityIndex
es and on the OrderedIndex
es?
The OrderedIndex
es have much more methods... but they are simpler!
The small version holds the items as arrays of KeyValuePair<TKey, TItem>
. And, when a commit is done, they allocate a new array, copy the old values, copy the UncommittedItems
(which is given by the Indexer
), extracting their keys and putting them into the array... and then call a Sort()
is made.
Exactly, it is as simple as that. A simple Sort()
is called, and now the items are sorted in memory. So the BinarySearch
can be done. If you need items that are less than a given value, it is enough to do a foreach
breaking it as soon as the first key is "greater or equal to" a given key.
If you need items that are greater or equal to a given value, do a BinarySearchFirst
to get the position of the first item with a given key or the position where it should be. If the result is negative, do a ~result
to get the positive value, and now iterate through the end. Nothing really special there.
The normal and the persisted versions don't store a KeyValuePair
. The normal one stores the index where it can find the item in the items array. It is done this way so if the item is saved to the
hard drive, it will only be saved once. The persisted version, as it allows items of different sizes to be serialized, stores the position of the item on the file. That is, the 10th item may be at position 1000. It will store the position 1000. But the sort process is the same Sort()
method, only using a delegate that is capable of finding items by their position.
The Equality Indexes
The equality indexes work like a dictionary/hashtable.
That is, the hashcode of the key is used to select a bucket were the item will be put. As happens with dictionaries, the number of the buckets can increase, it is always possible that items with different keys enter the same bucket, so the nodes must be prepared to point to other nodes and so on.
One big difference is that inserts don't check if an item with the same key exists. They simply create a new node, referencing the actual first node in the bucket and then such new node is set as the first node in the bucket. By doing this it is possible that duplicate keys exist, but that's the purpose. We only want to index the items so they can be found pretty fast, there is no need to avoid duplicates.
Then, when enumerating, we get the hashcode of the key, discover the bucket that we want to read, and we read all nodes even if an item is found, after all we can have duplicates, but we should always revalidate if the found item key is the same as the given key, as it is possible that many different keys share the same bucket.
And, that's all. The most difficult parts is the resize of the buckets. Simple finding the right bucket is a simple matter of doing hashcode % bucketCount
.
The nodes
As I presented in the Dictionary + Locking versus ConcurrentDictionary article, the nodes of the dictionary can be full objects (where a reference
to the next node is a simple object reference) or they can be struct
s, all allocated in a single array and having the nextNode
as an index inside such array.
Having them as full objects is easier, but the node allocation becomes slower. So, for performance reasons and to support referencing items on the items file I decided to make the nodes as struct
and instead of keeping the key and the item,
they simply point to the index on the items list/file. Then, when reading, it is always necessary to load the item, re-extract the key and then finally check if it is the right item or not.
That's all
Even if I support many different scenarios, I think it is possible to see that the greatest complexity is on the equality indexes, which are also the fastest ones. I really hope that such indexers or at least the other classes like the RandomAccessFile
are useful to you.
Version History
- March, 29, 2013. Corrected some errors in the text;
- March, 24, 2013. Added the reduced download, which avoids a lot of classes by not having the caches for the persisted version;
- March, 22, 2013. Added the GuaranteedMemoryLimit topic;
- March, 17, 2013. First version.