Introduction
Searching and sorting collections of data are at the heart of almost all programming projects. Fortunately, the .NET Framework includes a large number of tools which can help you to both reduce your development time and at the same time write programs which are more stable and perform better. This program demonstrates examples of many different searching and sorting techniques. The code for each technique can easily be borrowed to help you speed development of your applications.
This WinForms program visually compares the performance of nearly a dozen different approaches to searching and sorting data. The program also allows you to compare running these data manipulations using different multitasking techniques and settings. The multitasking settings will let you to see the degree to which the foreground program remains responsive to user interaction while performing the background data processing.
Background
Almost all programs must gather and manipulate sets of data. There are many very familiar ways to store, search, sort, and iterate through these collections of data. I have found that choosing the most appropriate technique for a particular task will greatly simplify coding while also improving the program's performance. For instance, I recently rewrote a complex production scheduling and optimizing routine that had originally been written in C++ (for speed, don't you know). The original program stored its data in a tangled collection of objects and linked lists which had to be frequently searched and sorted. The program was both slow and so mind-numbingly complex that even minor changes frequently broke the program in quite unexpected ways. As a result, none of my client company's developers wanted to touch the darn thing for fear it would explode. After studying the program and the problem, it became clear that using a set of Queues and Hashtable Collections would be much more elegant fit for the core scheduling and optimizing problem. The program is now much smaller, far less complex, much more stable, and runs about twenty times faster even though it is no longer coded in C++.
This sample program will allow you to compare the performance and underlying code for many different data manipulation techniques running with varying amounts of data. I have found that this is a very handy workbench when choosing the best tool to use for a coding problem. It is also a good source of code snippets to help speed development.
The program presents a list of search and sorting techniques in a ListView
. I have used Alan Anderson's outstanding Glacial List (see: article) in place of the standard ListView
. This very nice free control supports many enhanced features which allowed me to provide the user with a richer and more interactive experience. The control makes it very easy for me to embed progress bars to display the current processing status. I have also replaced the standard progress bar with Alan Zhao's nifty Color Progress Bar (see: article) for much of the same reasons.
The user may start a timed search of sort by clicking on the corresponding cell of the ListView
. The cell will then display a progress bar as the processing proceeds, and then will display the number of seconds needed to complete the operation. Clicking the cells one by one will allow you to compare the amount of time needed for each operation. If one of the multithreading modes is set, then you may click on as many multiple cells for concurrent processing. Naturally, this will greatly increase the individual searching and sorting times as they will reflect the total load on the CPU from all the chosen tasks rather than that of just the single selected task. The advancing progress bars do make rather nice eye candy. More importantly, they can help you to gauge the responsiveness of the foreground window as the processing load grows.
You may also click on any of the descriptions in the Method column to see the data for that object. For instance, clicking on the "ADO Detached Recordset" cell yields the following popup screen:
As you can see, our automatically generated sample data is structured as follows:
Public Structure TestDataStructure
Dim lUnique As Integer
Dim sUnique As String
Dim lNonUnique As Integer
Dim sNonUnique As String
End Structure
The lUnique
field is a 32 bit Integer
that is a unique key for the data. The sUnique
field is a unique String
while the lNonUnique
and sNonUnique
are non unique Integer
and String
respectively. While coding for this compound structure is a bit more complicated than would be the case for the more commonly used Key
and Item
, it is also far more flexible. The data structure could be easily expanded to have any number of fields of any data type. The TestDataStructure
is essentially a table of data.
You will notice that our searching and sorting methods fall into three categories:
- those that rely upon ADO and ADO.NET services;
- those that use arrays; and
- those that rely upon the .NET Framework
System.Collections
classes.
The .NET Framework includes a remarkably broad and flexible set of tools in its Systems.Collections
namespace that also tend to be rock solid and of very high performance. These classes and interfaces are so wide ranging that a decent argument could be made that a developer should use them almost exclusively. The other technique do have some advantages for certain problems. For instance, arrays tend to be the fastest to load, scan, and in some cases to use for retrieving a data item.
The advantages of the ADO.NET facilities may be less obvious, but in my mind are even more compelling for certain types for problems. While ADO.NET is often an order of magnitude slower than some of the System.Collections
classes, it still is a very high performance tool. Many developers would be surprised at just how fast ADO.NET is for local data manipulations. As long as we do not need to access data on a server or some form of file, ADO.NET is a screamer. For instance, on my middle of the road 2Ghz test box, it is able to perform over 100,000 keyed lookups in a second and sort over 150,000 records a second. For comparison, the best performing HashTable
collection was able to perform about 600,000 keyed lookups per second. Most programs manipulate much smaller working data sets.
ADO.NET is able to crunch the average sized working data table in an imperceptibly small amount of time. In the case of a WinForm application, the processing is performed exclusively on the user's workstation so the extra processing time cannot slow any other users. In such cases, other factors should often be more important than raw performance gains particularly when the performance gain would never be noticed by a user. One such consideration is the time it takes the programmer to code and maintain the solution. ADO.NET can speed development by reducing and simplifying the code need. For instance, retrieving one or more sorted records can be as simple as:
Dim rows As DataRow() = m_dt.Select("sNonUnique = 'Fifth'", "lUnique")
Much more code is need for this example when working with collections or arrays. The extra code will require more time to write and debug. Having the data already in an ADO.NET structure also simplifies other tasks such as storing the data into a database server or outputting it into an XML file. It is also very easy to create multiple copies or DataView
s which can be manipulated independently. It is also fairly easy to layer in data integrity and table relationship features that are alien to the System.Collections
world.
You may wonder why I included support for the old ADO Recordset
s. I have used detached ADO RecordSet
s in many Visual Basic 6 projects for local data manipulations. I wanted to see how ADO.NET stacked up against the trusty old ADO. The answer is -- quite well. Detached ADO RecordSet
s required the developer to manually create indexes in order to archive decent performance. ADO.NET seems to dynamically create its own indexes. Loading, scanning, sorting, and most searches are much faster with ADO.NET.
One surprise was how slow the VisualBasic.Collection
class is for iterating through the data. While keyed searches are fast, simply scanning the collection seemingly takes forever. Sorts are also very slow. I assume that this class exists mostly for backward compatibility with VB6 programs. I recommend avoiding this class whenever possible.
Another surprise was the behavior of the SortedList
collection. It performed reasonably well when loading small numbers of records. The performance exponentially degraded when loading larger number of records. 100,000 records took over 80 seconds while the ArrayList
collection loaded the data in 0.01 second. The SortedList
should only be used for fairly small sets of data.
It was not surprising to see that the HashTable
is the fastest way to perform keyed searches. I was surprised the scans are much slower than is the case with other System.Collection
classes.
The following table summarizes some key features of our data manipulation techniques:
Method |
Duplicates |
Resizing |
Sorting |
ADO Detached RecordSet |
User defined |
(automatic) |
Good |
ADO Detached RecordSet (Indexed) |
User defined |
(automatic) |
Good |
ADO.NET DataTable |
User defined |
(automatic) |
Fast |
ADO.NET DataView |
User defined |
(automatic) |
Fast |
Array(X) of Structures |
Allowed |
No |
Good |
Array(X,3) of Objects |
Allowed |
No |
Slow |
Collection |
Not Allowed |
(automatic) |
Slow |
ArrayList Collection |
Allowed |
(automatic) |
Good |
SortedList Collection |
Not Allowed |
(automatic) |
(automatic) |
HashTable Collection |
Not Allowed |
(automatic) |
(automatic) |
Queue Collection |
Allowed |
(automatic) |
n/a (FIFO only) | |
Method |
Searching |
Insert |
Remove |
ADO Detached RecordSet |
Slow |
Flexible |
Flexible |
ADO Detached RecordSet (Indexed) |
Good |
Flexible |
Flexible |
ADO.NET DataTable |
Good |
Flexible |
Flexible |
ADO.NET DataView |
Very Good |
Flexible |
Flexible |
Array(X) of Structures |
Slow |
No |
No |
Array(X,3) of Objects |
Slow |
No |
No |
Collection |
Very Fast |
Flexible |
Flexible |
ArrayList Collection |
Fast |
Flexible |
Flexible |
SortedList Collection |
Fast |
n/a (sorting always forced) |
Flexible |
HashTable Collection |
Fastest |
n/a (sorting always forced) |
Flexible |
Queue Collection |
Slow |
At the end only |
From top only | |
Using the code
The primary data class is contained in the TestData
class. This class is responsible for storing, populating, and comparing the data. The class implements IComparer
routines to support the search and sort routines.
Public Class lUnique_IComparer
Implements IComparer
Function Compare(ByVal x As Object, ByVal y As Object) As Integer _
Implements IComparer.Compare
Return (DirectCast(x, TestData.TestDataStructure).lUnique) - _
(DirectCast(y, TestData.TestDataStructure).lUnique)
End Function
End Class
Public Class lNonUnique_IComparer
Implements IComparer
Function Compare(ByVal x As Object, ByVal y As Object) As Integer _
Implements IComparer.Compare
Return (DirectCast(x, TestData.TestDataStructure).lNonUnique) - _
(DirectCast(y, TestData.TestDataStructure).lNonUnique)
End Function
End Class
Public Class sUnique_IComparer
Implements IComparer
Function Compare(ByVal x As Object, ByVal y As Object) As Integer _
Implements IComparer.Compare
Return StrComp((DirectCast(x, TestData.TestDataStructure).sUnique), _
(DirectCast(y, TestData.TestDataStructure).sUnique), _
CompareMethod.Binary)
End Function
End Class
Each data management type is handled by its own class. These classes are responsible for populating, scanning, searching, sorting, and displaying their data. Each of these are aware of the multitasking mode and can support the ability to cancel a looping task, update the screen with progress information, and where necessary, release control to the foreground window via a DoEvents()
call. The class exposes the following core functions:
Public Function Create() As String
Public Function DatasetScan() As String
Public Function Find_1_Key_Field() As String
Public Function Find_All_NonKey_Values() As String
Public Function Find_n_Fields() As String
Public Function Sort() As String
The Main form contains the dispatching code. Most of this is straightforward. It does contain delegates to support the multithreading operations which may confuse those who are seeing this for the first time. The code can be easily extended to support the searching and sorting method of your choice by adding just a few lines to the Main form and copying and customizing one of the existing searching/sorting classes.
Credits
My thanks to Alan Anderson and Alan Zhao for their handy controls. I am particularly indebted to Rob Macdonald and his excellent Serious ADO: Universal Data Access with Visual Basic book from which some of the ideas for this program where drawn.
History
I will be adding more details to the article soon. For now, enjoy.