Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / VC9.0

Fast String Searching Using C++ STL Vectors in C#

4.92/5 (12 votes)
19 Aug 2012Zlib11 min read 47.2K   489  
C# interop using platform invoke and C++ vectors to implement fast searching and selection on index keys

 Image 1

Introduction 

 
The speed and efficiency of the C++ Standard Template Library (STL) vector class is virtually unbeatable. This article demonstrates how you can leverage the vector class in your C# application by using platform invoke (P-Invoke) with standard marshaling to call unmanaged DLL functions.

 

To open the vector class to DLL export, I wrap vector in a C++ struct called <span style="font-size: 10pt;">KVPCollection</span>. A number of public methods are used in <span style="font-size: 10pt;">KVPCollection</span>  to implement the desired functionality for my C# p-invoke calls. For example, <span style="font-size: 10pt;">GetFirst</span> and <span style="font-size: 10pt;">GetNext</span> methods implement a Java-like forward iterator pattern.  

The vector element type is a simple struct that contains a key-value pair. The key is a char* to a zero-terminated string buffer. The values are <span style="font-size: 10pt;">keySize</span> and an int index to a feature id (<span style="font-size: 10pt;">fid</span>). A managed version of this struct is created in C# and is marshaled as a reference type to and from the C++ DLL static function. It is then used in the other DLL methods as an unmanaged struct type.  

STL vector methods are used to iterate over a range of left-substring matches in a sorted list. This can be used in applications such as selecting data for instant "search suggest".   

In the case where there are no matches for the input string, I implement a version of Peter Kankowski's article  "Using Ternary DAGs for Spelling Correction".    I use the Ternary DAG data structure to collect all of the suggested spellings in a STL vector that I iterate over in the same pattern as before. Besides proposing correct spelling for misspelled dictionary words, this is useful for finding a street address, for example, where you don't know the exact spelling of the street name. 

I can index all records with string keys in the both data structures (std::vector and Ternary DAG) with the string plus the feature id. This can greatly improve the speed of simple selections in "Shapefile attribute" (.dbf) files, for example, because a "like" operator can be use for a left match, and the Ternary DAG is very efficient for finding alternative spellings in the data set. Features from the .dbf file can then be selected by fid instead of the string key.

 
 

Using the code 

CSharpTest Project  

The CSharpTest project is a console app that demonstrates how to call the exposed static functions in the C++ DLL. 

The first step is to populate the data structures.

 
C#
//Populate the data structures
SuggestWin32.ST_CreateInit();
FileStream fs = new FileStream(@"..\..\dic.txt",
                               FileMode.Open, FileAccess.Read, FileShare.Read);
StreamReader sr = new StreamReader(fs, Encoding.Default);
int iCount = 0;
while (!sr.EndOfStream)
{
  string line = sr.ReadLine();
  if (line.Length > 0)
  {
    SuggestWin32.ST_insert(line, iCount++);
  }
}
sr.Close();
SuggestWin32.ST_write_to_file(@"..\..\spellchk.dat");
 
 

The ST_CreateInit() call is used to create the underlying std::vector container (KVPCollection) and it also initializes a global counter for the Ternary DAG nodes.

ST_insert(line, iCount++) inserts a line of text from the dic.txt file into the KVPCollection and adds a node to the DAGTree structure.

Once all the data is inserted, ST_write_to_file is called to binary serialize the two data structures to files. The same path is used for both files. You can name the DAGTree serialize file what you like (in this case spellchk.dat), and the KVPCollection serialize file is named dict.dat. Both files will be replaced if they exist. The std:vector member of KVPCollection is also sorted prior to output to dict.dat.  

To use the index data sets you then call ST_open_dict as shown below.

C#
IntPtr Dict = SuggestWin32.ST_open_dict(@"..\..\spellchk.dat");
 
if (Dict == null)
{
  Console.WriteLine("Missing or bad input file");
}
 

This call deserializes the data to memory from both spellchk.dat and dict.dat. This is the reverse of what ST_write_to_file does. If anything goes wrong with deserializing the spellchk.dat file, null is returned. Otherwise an IntPtr handle to the root node of the DAGTree is returned.

If the serialized data is current, you can skip the write and go straight to ST_open_dict. This will save a few seconds in your application startup.

This C# struct is used in a reference parameter to hold the results. Note that the memory for the struct is allocated on the managed site so that default marshaling of a struct can be used.

C#
[StructLayout(LayoutKind.Sequential)]
public struct KVPIndex
{
  public int fid;
  public uint keySize;
  public string sKey;
}
 

LayoutKind.Sequential is used to insure that the order of fields will match the C++ version of this struct.  

A KVPIndex type is created like this.

C#
KVPIndex kvp = new KVPIndex();
kvp.sKey = new String(' ', 50);
   

Since this is a struct, the default constructor initializes all of the fields to their C# default values.

Memory is allocated for marshaling the String key member (sKey) with enough memory for 50 characters.  This is done by using the String constructor overload that creates a string with 50 repeated spaces in this example. The char you chose to use does not really matter.

Default marshaling automatically converts the .NET String type to a const char* buffer on the C++ side. 

Inside the do loop, each user-supplied word is tested for matches in the data sets, and the iterator interface is used to return the results.

 
C#
Console.Write("Enter a word to check: ");
word = Console.ReadLine();
if (word.Length > 0)
{
  //do a left-substring match on word and return the first instance
  SuggestWin32.ST_GetFirst(word, ref kvp);
  if (kvp.fid > -1)
  {
    for (; kvp.fid > -1; SuggestWin32.ST_GetNext(ref kvp))
    {
      Console.WriteLine("{0} fid:{1}", kvp.sKey, kvp.fid);
      kvp.sKey = new String(' ', 50);
    }
  }
  else //try to find words close to your spelling
  {
    int n;
    IntPtr KVPResults = SuggestWin32.ST_enum_suggestions(Dict, word, out n);
    if (n > 0)
    {
      for (SuggestWin32.ST_GetKVFirst(KVPResults, ref kvp); kvp.fid > -1; 
             SuggestWin32.ST_GetNext(ref kvp))
      {
        Console.WriteLine("{0} fid:{1}", kvp.sKey, kvp.fid);
        kvp.sKey = new String(' ', 50);
      }
    }
  }
}

ST_GetFirst is used to match all the words in the std::vector (which contains the dict.dat strings) that begin with the string parameter. Matching is case sensitive. If at least one match is found, the kvp object will contain the first match.  Otherwise, kvp.fid will be set to -1 to indicate no matches were found.

If matches to word were found, the match values are printed inside the for loop.

ST_GetNext is called on the next loop to get the next iteration of kvp in the range of matches. Then the check for kvp.fid > -1 is done. This time, if kvp.fid == -1, it indicates that there are no more matches left to return and the for loop terminates. 

If no matches were found in the first test, a second test is found to find other suggested spellings for the word. ST_enum_suggestions is called to create a std::vector of KVPIndex objects containing suggested spelling matches. A handle to this data structure (KVPResults) is returned along with the number of matches.

Then ST_GetKVFirst is called in the for-loop initializer to begin iterating through the results. ST_GetKVFirst takes the handle to KVPResults and the reference to kvp as parameters. Results are returned in kvp the same way as before.

All of the static extern method prototypes are defined in the SuggestWin32 static C# class.

C#
public static class SuggestWin32
{
    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern void ST_CreateInit();

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern void ST_insert(string word, int index);

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern bool ST_write_to_file(string path);

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern IntPtr ST_open_dict(string filename);

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern void ST_close_dict(IntPtr dict);

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern IntPtr ST_enum_suggestions(IntPtr dict, string word, out int n);

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern void ST_GetFirst(string cStr, [MarshalAs(UnmanagedType.Struct)] ref KVPIndex kvp);

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern void ST_GetKVFirst(IntPtr KVPResults, [MarshalAs(UnmanagedType.Struct)]  ref KVPIndex kvp);

    [DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
    public static extern void ST_GetNext([MarshalAs(UnmanagedType.Struct)] ref KVPIndex kvp);
 
} 

Note that you only need to supply the path to your dll in the DLLImport statement parameter.  There are other optional settings, but they are not needed for this project.

For this project, all of the parameters for the static method prototypes are either string, int, IntPtr, or KVPIndex types. MarshalAs(UnmangedType.Struct) is the default marshaling for C# struct, but I put the statement in just to be clear on what it is doing with the struct parameter. 

DllImport can be set to two different calling conventions: StdCall or Cdecl.  StdCall is the default (except in CE.NET) and Cdecl needs to be used if the C++ function you import takes a variable number of arguments (e.g. printf).

SpellSuggest Project 

This project source has two folders with all of the source code: Header and Source.

The Header folder contains CreateTree.h and spellchecker.h.  The Source folder contains CreateTree.cpp and spellcheker.cpp.

CreateTree.h

This header file contains three function prototype exports. 

C#
 extern "C" {
 
   __declspec(dllexport) void ST_CreateInit();
 
   //insert a string
   __declspec(dllexport) void ST_insert(const char* str, int index);
 
   //balance tree and write to file
   __declspec(dllexport) bool ST_write_to_file(const char* path);
} 

It also contains the C++ unmanaged version of the KVPIndex struct and the C++ KVPCollection struct.

C#
struct KVPIndex {
   int fid;
   size_t keySize;
   char *sKey;
};
 
struct KVPCollection
{
   std::vector<KVPIndex*> KVPVect;
 
   KVPCollection() { //constructor
      KVPVect.reserve(ST_MAX_ENTRIES);
   }
 
   ~KVPCollection() {
      vector<KVPIndex*>::iterator vit;
      for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
      {
         free((KVPIndex*)(*vit)->sKey);
         delete(*vit);
      }
   }
   //...
}

Note that in C++ structs are the same as classes except that members are public by default.

The first line of KVPCollection creates a object, KVPVect, that is an instance of std::vector<KVPIndex*>. Thus, each cell in KVPVect contains a pointer to KVPIndex.

The constructor reserves enough cells in the vector to hold ST_MAX_ENTRIES (300000) pointers. If more space is needed the vector object will reallocate a new buffer that is larger and copy all the cell contents (the pointers) to the new buffer.  ST_CreateInit() calls this constructor to create and instance of KVPCollection;

The destructor frees memory that has been dynamically allocated for sKey and KVPIndex instances. 

ST_insert calls the InsertCStr method in the KVPCollection class. Here is the implementation of the method. 

C#
 //Insert the cStr with given index value into KVPVect if length > 0
 void InsertCStr(const char *cStr, int index)
 {
    KVPIndex* m_kvp;
    m_kvp = new KVPIndex();
    m_kvp->keySize = strlen(cStr);
 
    if (m_kvp->keySize > 0) {
       if (m_kvp->keySize >= KVPIndexC_sKey_Size) { 
          m_kvp->keySize = KVPIndexC_sKey_Size - 1; 
       }
       m_kvp->sKey = (char *)malloc(m_kvp->keySize + 1);
       strncpy(m_kvp->sKey, cStr, m_kvp->keySize);
       m_kvp->sKey[m_kvp->keySize] = '\0';
       m_kvp->fid = index;
       KVPVect.push_back(m_kvp);
    }
 }

The minimum amount of memory is allocated to each sKey pointer up to 50 chars (including the 0 terminator). The given index parameter is saved to fid.

The same data is inserted into the DAGTree. See the Ternary DAG Tree link above for additional info on the DAGTree structure.

ST_write_to_file calls the Write method in the KVPCollection class to create the dict.dat file. 

C#
 void Write(const char *fileName)
 {
    Sort();
    ofstream outFile(fileName, ios::out | ios::binary);
    char* out = (char*)malloc(sizeof(KVPIndexC));
    vector<KVPIndex*>::const_iterator vit;
    for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
    {
       //m_kvp = *vit; //for debug
       KVPIndex *kvpTemp = (KVPIndex*)*vit;
       memmove(out, (const void*)kvpTemp, 8);
       memmove(out+8, kvpTemp->sKey, kvpTemp->keySize + 1);
       outFile.write(out, kvpTemp->keySize + 9);
    }
    outFile.close();
 } 
 

KVPVect is sorted and vector<KVPIndex*>::const_iterator is used to iterate through the contents of KVPVect to write each element with a binary write method to outFile.

spellchecker.h 

 

This header file contains the rest of the function prototype exports that are imported in CSharpTest. 

C#
extern "C" {
 __declspec(dllexport) size_t MAX_ENUM_WORDS();
}
 
// Open a dictionary, returns NULL in case of error
extern "C" {
 __declspec(dllexport) ST_FILE_NODE * ST_open_dict (const char * filename);
}
 
// Enumerate spelling suggestions (fuzzy matches)
extern "C" {
 __declspec(dllexport) std::vector<KVPIndex*> * ST_enum_suggestions (ST_FILE_NODE * dict, const char * word, int *n);
}
 
// Close the dictionary
extern "C" {
 __declspec(dllexport) void ST_close_dict (ST_FILE_NODE * dict);
}
 
extern "C" {
 __declspec(dllexport) void ST_GetFirst(const char *cStr, KVPIndex *kvpSt);
 __declspec(dllexport) void ST_GetKVFirst(std::vector<KVPIndex*> *KVPResults, KVPIndex *m_kvp);
}
 
extern "C" {
 __declspec(dllexport) void ST_GetNext(KVPIndex *kvpSt);
} 
 


 

The ENumResults struct is used to manage another KVPVect that stores a list of spelling suggestions generated in the ST_enum_suggestions call.

C#
struct ENumResults {
   std::vector<KVPIndex*> KVPVect;
 
   ~ENumResults() {
       vector<KVPIndex*>::iterator vit;
       for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
       {
          free((KVPIndex*)(*vit)->sKey);
          delete(*vit);
       }
   }
 
   void RemoveEntries() {
       vector<KVPIndex*>::iterator vit;
       for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
       {
          free((KVPIndex*)(*vit)->sKey);
          delete(*vit);
       }
       KVPVect.clear();
   }
};

The destructor is similar to RemoveEntries except that RemoveEntries frees memory for all elements and clears KVPVect instead for deleting the KVPVect object.

spellchecker.cpp  and CreateTree.cpp

Most of the wrapper functions for dllexport are implemented in spellchecker.cpp.

Many of these functions call methods in the KVPCollection struct that are either implemented in CreateTree.h or CreateTree.cpp.

ST_GetFirst calls an overload of GetFirst that is implemented in CreateTree.cpp. 

A C function, comp, is used to compare CompStr to the range of values in the sorted std::vector.

C#
const char *CompStr; 
C#
//Finds range of keys using left-substring match
bool comp(const KVPIndex *pred, const KVPIndex *range)
{
   if (strcmp(CompStr, range->sKey))
   {
      if (strcmp(CompStr, pred->sKey)) {
         return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
      }
      //CompStr == pred->sKey
      if (range->keySize <= pred->keySize) { 
         return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
      }
      return (strncmp(pred->sKey, range->sKey, pred->keySize) < 0) ? true : false;
   }
   else //CompStr == range->sKey
   {
      if (pred->keySize <= range->keySize) { 
         return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
      }
      return (strncmp(pred->sKey, range->sKey, range->keySize) < 0) ? true : false;
   }
}

pred and range are actually interchangeable so CompStr is used to hold the search string in order to correctly left match CompStr as a substring in sKey vector elements. If CompStr length is less than the vector key item, then only the key size of the input key should be compared. The right part of the range key that is past the length of the input key can be ignored. This function took some debugging to get it working correctly for left matches.

Here is this GetFirst implementation.

C#
void KVPCollection::GetFirst(const char *cStr, KVPIndex *m_kvp, bool (*compFunc)(const KVPIndex *, const KVPIndex *))
{
   m_kvp->fid = -1;
   int len = strlen(cStr);
   if (len > 0) {
      CompStr = cStr; //save to use in comp function
      KVPIndex* kvp;
      kvp = new KVPIndex(); //create temparary search term for predicate
      kvp->keySize = len;
      if (len >= KVPIndexC_sKey_Size) { 
         kvp->keySize = KVPIndexC_sKey_Size - 1;
      }
      kvp->sKey = (char *)malloc(kvp->keySize + 1);
      strncpy(kvp->sKey, cStr, kvp->keySize);
      kvp->sKey[kvp->keySize] = '\0';
      lowBound = lower_bound(KVPVect.begin(), KVPVect.end(), kvp, compFunc);
      upBound = upper_bound(KVPVect.begin(), KVPVect.end(), kvp, compFunc);
 
      free(kvp->sKey);
      delete kvp; //delete search term;

      if (lowBound != upBound)
      {
         m_kvp->fid = ((KVPIndex*)*lowBound)->fid;
         m_kvp->keySize = ((KVPIndex*)*lowBound)->keySize;
         strcpy(m_kvp->sKey, ((KVPIndex*)*lowBound)->sKey);
      }
      kvpIter = lowBound;
   }
}

The search term must be a KVPIndex* type to match the std::vector element type. So kvp is created and set to the value of the given cStr parameter.

vector<KVPIndex*>::iterator lowBound and upBound are found using the std::lower_bound and std::upper_bound methods.

Then the member fields are set in the m_kvp reference parameter to the contents of the lowBound iterator. 

The other GetFirst overload (implemented in CreateTree.h) is much simpler. It just takes the results vector returned by ST_enum_suggestions and saves the first element pointer to kvpIter and the end() pointer to upBound.

GetNext (implemented in CreateTree.h)  is called to get the next element until upBound is reached. Once upBound is reached, final m_kvp reference parameter value for m_kvp->fid will be -1.

C#
void GetNext(KVPIndex *m_kvp)
{
   m_kvp->fid = -1;
   if (++kvpIter != upBound) {
      //return (KVPIndex*)(*kvpIter);
      m_kvp->fid = ((KVPIndex*)*kvpIter)->fid;
      m_kvp->keySize = ((KVPIndex*)*kvpIter)->keySize;
      strcpy(m_kvp->sKey, ((KVPIndex*)*kvpIter)->sKey);
   }
}

Not that the memory for m_kvp->sKey was allocated in manged code. This eliminates extra work that would be required in C# to marshal the unmanged memory buffer back to managed memory.

ST_enum_suggestions and fuzzy_match

I made most of my changes to the original Ternary DAG algorithm around these two functions.

C#
// Enumerate spelling suggestions (fuzzy matches)
std::vector<KVPIndex*> * ST_enum_suggestions (ST_FILE_NODE * dict, const char * word, int *n)
{
   static ST_ENUMINFO enuminfo; //everything initialized to 0 by default
   if (enuminfo.nresults > 0) {
      enuminfo.results.RemoveEntries();
      enuminfo.nresults = 0;
   }
   else
   {
      enuminfo.results.KVPVect.reserve(500);
	   enuminfo.dict = dict;
   }
   fuzzy_match(dict + 1, word, enuminfo.result, &enuminfo);
   *n = enuminfo.nresults;
   return &(enuminfo.results.KVPVect);
}

I added the static keyword to first line. This causes the variable to be allocated once on the heap instead of the automatic stack. All members are also initialized to 0 by default. This way I can reuse the object and clear the entries if there are any.

I also removed the (*enumproc) function parameter and added *n to hold the number of results to then caller. 

The function was also changed to return a pointer to the KPVect results vector. This pointer is then passed back from the C# interface in the ST_GetKVFirst function to begin iterating over the set of suggested spelling results.

The only thing changed in fuzzy_match is that the report_suggestion function was replaced by add_suggestion. Once I have my object vector of results, I can collect it in C#. I don't want to pass a function in to do anything with each result in C++.

C#
static void add_suggestion(ST_ENUMINFO * enuminfo) {
 
   KVPIndex *kvp = new KVPIndex();
   kvp->sKey = (char *)malloc(50);
    //find range that matches suggested spelling
   for (s_kvpList->GetFirst(enuminfo->result, kvp, compExact); kvp->fid > -1; s_kvpList->GetNext(kvp))
   {
      enuminfo->results.KVPVect.push_back(kvp);
      ++(enuminfo->nresults);
      kvp = new KVPIndex();
      kvp->sKey = (char *)malloc(50);
   }
}

Note that I need to allocate 50 bytes to hold the char buffer so everything else will work the same as when the GetFirst parameter originates from C#.

If there are duplicate matches, I still want all of them returned because they will have different fid values in my application source data. To allow for duplicate string results I still iterate over all matches to the suggested spelling the same as when I do the left-matching. Except that this time, the compare function is compExact instead of comp.

Here is the compExact implementation from CreateTree.cpp

C#
//Finds range of keys using full matach
bool compExact(const KVPIndex *pred, const KVPIndex *range)
{
  return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
} 

This is much simpler to implement than the comp function because it only compares full string matches instead of left-substring matches. 

Conclusion 

Using a sorted std::vector that contains a struct with an ID member and string key for fast left-match searching allows for adequate speed to support instant type-ahead style suggest. The Ternary DAG data structure is an efficient method for searching a word list for alternate spellings when no matches are found. In cases where the word list might contain duplicate search keys for unique features, the Ternary DAG matching can be combined with the sorted vector binary search to get the full results with the unique feature id values.

History

 
First published August 19, 2012

License

This article, along with any associated source code and files, is licensed under The zlib/libpng License