This article deals with the implementation, testing and practical use of a C# library implementing array tries. The library can be used by any managed-code Visual Studio application requiring efficient access to data stored in memory.
Introduction
A trie is a data structure used to access keys (strings) from a given data set. Most often, tries are built from character strings and represented as trees where the link from one node to each of its children is labeled with a unique character. According to the Wikipedia page titled “Trie”, the idea of a trie was first abstractly described by Axel Thue in 1912, and tries were described in terms of computing by Rene de la Briandais in 1959. In 1960, Edward Fredkin coined the term trie.
The following two figures illustrate tree representations of tries. The first figure was obtained from sample images of tries on the web. Each branch of the tree ends with a node labeled “EOW
” which stands for “end-of-word”. As will be seen in the implementation of the library, such nodes are not necessary.
The second figure is from the Wikipedia page titled “Trie”. In the original, the tree is missing two links: one from node “te
” to node “ten
” and one from node “in
” to node “inn
”. Those links have been added in red.
From the example figures, it can be seen that tries are multiway trees, which means that a node in a trie can have an arbitrary number of children.
In his seminal 1960 paper titled “Trie Memory” (Communications of the ACM, Vol. 3, No. 9, pp. 490-499), Edward Fredkin described the use of memory registers to store tries. (As a passing note, the author had free access through the site ResearchGate only to the first page of Fredkin’s paper. To obtain the full paper, the user not only has to pay a “joining” fee but also the fee for the paper, and those fees add up to an unreasonable amount of money for such an old paper).
Figure 1 in Fredkin’s paper shows a schematic representation of the storage of words in trie memory, and it is reproduced in the following figure:
For simplicity, Fredkin limited the character set to the first five letters of the alphabet plus a symbol (Ω in this reproduction) to denote “space” or end-of-word. The rows represent registers and the columns are labeled with the character set. Each non-empty entry in the table contains the number of the next register to be inspected. The process to retrieve a word always starts at register 1 (bottom row).
Lacking access to the rest of Fredkin’s paper, the workings of the schematic representation must be figured out somehow. As an example, entry (1,D) contains number 2; entry (2,A) contains number 3; entry (3,B) contains number 4; finally, entry (4,Ω) contains the special character “|” thus indicating the end of word “DAB”. Carrying out this process with the rest of some of the non-empty entries in register (row) 1, one obtains the following partial interpretation of the schematic.
The reader is encouraged to add the links for the words “BE
”, “BED
” and “A
”.
Implementation of the Trie Library
In this section, the C# implementation of the trie library will be described from the bottom up, that is, from elementary data structures and functions to more abstract ones. The complete code is in the ZIP file attached to this article. All the code in this section is assumed to reside within the namespace TriesLib
.
Trie Nodes
Nodes in a trie contain the information necessary to encode the characters in a string
.
public class TrieNode
{
private char ch;
private int limit, index;
private TrieNode[] subtrie;
public TrieNode( char _ch, int _limit, int _index, bool endOfString )
{
ch = _ch;
limit = _limit;
subtrie = new TrieNode[ limit ];
for ( int i = 0; i < limit; ++i )
{
subtrie[ i ] = null;
}
index = endOfString ? _index : -1;
}
The constuctor receives as arguments the character to be stored in the node, the number of subtrie nodes that the node will have, an index for the node and a flag indicating whether or not the character is at the end of the string being processed.
To access the value of the private data members of the node, there are three read-only (get
) properties: Ch
, Limit
and Index
. The following accessor function is used to get or set a subtrie element of the node.
public TrieNode this[ int i ]
{
get
{
if ( 0 <= i && i < limit )
{
return subtrie[ i ];
}
else
{
InvalidIndex( i );
return null;
}
}
set
{
if ( 0 <= i && i < limit )
{
subtrie[ i ] = value;
}
else
{
InvalidIndex( i );
}
}
}
Function InvalidIndex
simply displays a MessageBox
stating that the index passed to this is invalid.
Tries
Tries are built by creating trie nodes. In the examples of the previous section, tries were represented as trees. In the trie library being described, the implementation of tries is array-based.
public class Trie
{
private TrieNode root;
private int limit;
public Trie()
{
limit = TLconst.asciiTrieLimit;
root = new TrieNode( '@', limit, -1, false );
}
Every time the preceding constructor is called, a root TrieNode
is created with the number of its possible subtries set by the following class:
public static class TLconst
{
public static readonly
int
asciiTrieLimit = 256;
}
What this means is that each trie node can have up to 256 possible subtries. This allows handling strings containing any possible combination of characters from the standard ASCII character set.
Insertion of Words into a Trie
Assuming that a sample application has a static
class ATconst
containing an array of words to be inserted, a Trie would be created and the words would be inserted as follows:
Trie trie = new Trie();
for ( int j = 0; j < ATconst.words.Length; ++j )
{
trie.Insert( ATconst.words[ j ], j );
}
The following two functions handle the insertion of the characters of a string
into the trie.
public void Insert( string str, int _index )
{
if ( !string.IsNullOrEmpty( str ) )
{
TraverseToInsert( str, _index, root );
}
else
{
MessageBox.Show( "Trie class error: null or empty strings cannot be inserted" );
}
}
private void TraverseToInsert( string str, int _index, TrieNode node )
{
int n = str.Length, m = n - 1;
for ( int i = 0; i < n; ++i )
{
char _ch = str[ i ];
int j = (int)_ch;
if ( node[ j ] == null )
{
node[ j ] = new TrieNode( _ch, limit, _index, i == m );
}
node = node[ j ];
}
}
For example, starting with an empty trie, the insertion of the characters of the string
“CAB
” would leave the trie in the state shown in the following figure:
There are four nodes in the trie. The top-most node corresponds to the root of the trie. At index 67 (ASCII code for character ‘C’) of the subtrie array of the root node a new TrieNode
was created. At index 65 (ASCII code for character ‘A’) of the subtrie array of the second node a new TrieNode
was created. Finally, at index 66 (ASCII code for character ‘B’) of the subtrie array of the third node a new TrieNode
was created. All the elements of the subtrie array of the fourth node are null
and ready for the insertion of more characters if necessary.
Backward Listing of Entries in a Trie
In order to allow the verification that all the strings in a data set have been inserted into the trie, an application using the library can call the following function, which collects into a list the indices of the nodes in the trie. The function relies on an auxiliary function that does the actual collection. The indices are collected in reverse order for no particular reason.
public List<int> BackwardListingOfEntries()
{
List<int> indices = new List<int>();
DescListNode( root, indices );
return indices;
}
private void DescListNode( TrieNode node, List<int> indices )
{
if ( node != null )
{
if ( node.Index != -1 )
{
indices.Add( node.Index );
}
for ( int i = limit - 1; i >= 0; --i )
{
DescListNode( node[ i ], indices );
}
}
}
Retrieval of Words That Start With a Given Prefix
One important application of tries is finding all the strings in a data set that start with a given prefix. Assuming that the strings are not sorted, for small data sets exhaustive search may be tolerated. However, for large data sets such an approach is not acceptable. If the strings are stored in a trie, there is no search at all and the retrieval of the indices of target strings takes time linear in the number of strings.
The retrieval of words that start with a given prefix is as follows. Starting at the root of the trie, traverse the prefix until reaching the node corresponding to the last character of the prefix. Then collect the indices of all the nodes under that node. This algorithm is implemented by the following three functions:
public List<int> Collect( string prefix )
{
List<int> indices = new List<int>();
if ( !string.IsNullOrEmpty( prefix ) )
{
int n = prefix.Length;
TrieNode startNode;
startNode = TraverseStr( prefix, root );
if ( startNode != null )
{
CollectAll( startNode, indices );
}
}
return indices;
}
private TrieNode TraverseStr( string prefix, TrieNode startNode )
{
TrieNode node = startNode;
for ( int i = 0; i < prefix.Length; ++i )
{
char _ch = prefix[ i ];
int j = (int)_ch;
if ( node[ j ] != null )
{
node = node[ j ];
}
else
{
node = null;
break;
}
}
return node;
}
private void CollectAll( TrieNode node, List<int> indices )
{
if ( node != null )
{
if ( node.Index != -1 )
{
indices.Add( node.Index );
}
for ( int i = 0; i < limit; ++i )
{
CollectAll( node[ i ], indices );
}
}
}
An unstated assumption in the preceding functions is that the data set resides in memory and the target strings are accessed with the list of indices retrieved from the trie.
Testing the Library
The library to implement array tries can be tested by a simple console application such as the following. The code in the ZIP file contains more test cases that yield the expected results.
using TriesLib;
namespace ArrayTries
{
class Program
{
static void Main( string[] args )
{
Trie trie = new Trie();
for ( int j = 0; j < Constants.words.Length; ++j )
{
trie.Insert( Constants.words[ j ], j );
}
ShowBackwardListing( trie.BackwardListingOfEntries() );
ShowCollection( "a", trie );
ShowCollection( "beh", trie );
ShowCollection( "foo", trie );
Console.WriteLine();
}
public static void ShowBackwardListing( List<int> indices )
{
if ( indices.Count == 0 )
{
Console.WriteLine( "\nThe trie contains no entries\n" );
}
else
{
Console.WriteLine( "\nBackward listing of trie entries\n" );
foreach ( int i in indices )
{
Console.WriteLine( Constants.words[ i ] );
}
Console.WriteLine();
}
}
public static void ShowCollection( string prefix, Trie trie )
{
List<int> indices = trie.Collect( prefix );
if ( indices.Count == 0 )
{
Console.WriteLine(
string.Format( "\nThe trie contains no entries beginning with '{0}'", prefix ) );
}
else
{
Console.WriteLine( string.Format( "\nEntries beginning with '{0}'", prefix ) );
foreach ( int i in indices )
{
Console.WriteLine( Constants.words[ i ] );
}
}
}
}
}
To save space, the output from the program was organized in table form, and is shown in the following two figures:
A Simple Application: Simulation of Auto-Complete
Most, if not all, of the graphical user-interface (GUI) programs that provide a search capability have an auto-complete functionality. As the user is typing characters into the search box, the search “engine” displays suggestions to automatically complete the search text. It is very likely that such “engines” use data stored in a trie to present those suggestions.
The main program in the console application in the previous section can be extended to call the following function to simulate a user typing characters in a search box and a search engine’s list of auto-completion strings at each step.
public static void SimulateAutoComplete( string target, Trie trie )
{
int n = target.Length;
if ( n > 4 )
{
Console.WriteLine( "\nSimulating auto-complete with string '{0}'\n", target );
for ( int i = 0; i < n / 2; ++i )
{
string prefix = target.Substring( 0, i + 1 );
Console.WriteLine( "prefix: '{0}'", prefix );
ShowCollection( prefix, trie );
Console.WriteLine();
}
}
}
For example, the function call...
SimulateAutoComplete( "behemoth", trie );
...would generate the output shown in the following figure. Again, the output was organized in table form.
Implementing the Library
The ZIP file attached to this article contains three files that implement the library: TrieNode.cs, Trie.cs and TLconst.cs. In Visual Studio, create a new Visual C# Class Library with the name TriesLib
. In the Solution Explorer, right-click on the default class name chosen by Visual Studio (usually Class1.cs), re-name it as TrieNode.cs and copy the code from file TrieNode.cs. Right-click on the solution name and select Add->New Item. In the menu that appears, select Class
, name it Trie.cs and copy the code from file Trie.cs. In the Solution Explorer, right-click on References and select Add Reference. In the window that appears, select the .NET tab and add a reference to the library System.Windows.Forms
. Right-click on the solution name and select Add->New Item. In the menu that appears, select Class, name it TLconst.cs and copy the code from file TLconst.cs. Build the solution and the bin\Debug directory will contain the library file TriesLib.dll.
Using the Library
The ZIP file also contains a file named Program.cs that implements a simple console application that uses the library, and a file named ATconst.cs which contains the sample strings to be inserted in a trie. In Visual Studio, create a new console application with the name ArrayTries
. In the Solution Explorer, right-click on References and select Add Reference. In the window that appears, select the Browse tab and navigate to the bin\Debug directory containing the library file, TriesLib.dll. Select the library file and click on OK. Replace the entire contents of the empty Program.cs file with the text in the Program.cs file. Right-click on the solution name and select Add->New Item. In the menu that appears, select Class, name it ATconst.cs and copy the code from file ATconst.cs. Build the solution and run it.
Conclusion
This article has dealt with the implementation, testing, and practical use of a C# library implementing array tries. The library can be used by any managed-code Visual Studio application requiring efficient access to data stored in memory. The code can be adapted to access data in a database if the primary keys are string
s. All the code described in this article was developed with Visual Studio 2010. The code was then ported to Visual Studio 2019 and it ran without requiring any modifications.
History
- 11th June, 2021: Initial version