Introduction
The speed and efficiency of the C++ Standard Template Library (STL) vector class is virtually unbeatable. This article demonstrates how you can leverage the vector class in your C# application by using platform invoke (P-Invoke) with standard marshaling to call unmanaged DLL functions.
To open the vector class to DLL export, I wrap vector in a C++ struct called <span style="font-size: 10pt;">KVPCollection</span>
. A number of public methods are used in <span style="font-size: 10pt;">KVPCollection</span>
to implement the desired functionality for my C# p-invoke calls. For example, <span style="font-size: 10pt;">GetFirst</span>
and <span style="font-size: 10pt;">GetNext</span>
methods implement a Java-like forward iterator pattern.
The vector element type is a simple struct that contains a key-value pair. The key is a char* to a zero-terminated string buffer. The values are
<span style="font-size: 10pt;">keySize</span>
and an int index to a feature id (
<span style="font-size: 10pt;">fid</span>
). A managed version of this struct is created in C# and is marshaled as a reference type to and from the C++ DLL static function. It is then used in the other DLL methods as an unmanaged struct type.
STL vector methods are used to iterate over a range of left-substring matches in a sorted list. This can be used in applications such as selecting data for instant "search suggest".
In the case where there are no matches for the input string, I implement a version of Peter Kankowski's article "Using Ternary DAGs for Spelling Correction". I use the Ternary DAG data structure to collect all of the suggested spellings in a STL vector that I iterate over in the same pattern as before. Besides proposing correct spelling for misspelled dictionary words, this is useful for finding a street address, for example, where you don't know the exact spelling of the street name.
I can index all records with string keys in the both data structures (std::vector and Ternary DAG) with the string plus the feature id. This can greatly improve the speed of simple selections in "Shapefile attribute" (.dbf) files, for example, because a "like" operator can be use for a left match, and the Ternary DAG is very efficient for finding alternative spellings in the data set. Features from the .dbf file can then be selected by fid instead of the string key.
Using the code
CSharpTest Project
The CSharpTest project is a console app that demonstrates how to call the exposed static functions in the C++ DLL.
The first step is to populate the data structures.
SuggestWin32.ST_CreateInit();
FileStream fs = new FileStream(@"..\..\dic.txt",
FileMode.Open, FileAccess.Read, FileShare.Read);
StreamReader sr = new StreamReader(fs, Encoding.Default);
int iCount = 0;
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
if (line.Length > 0)
{
SuggestWin32.ST_insert(line, iCount++);
}
}
sr.Close();
SuggestWin32.ST_write_to_file(@"..\..\spellchk.dat");
The ST_CreateInit()
call is used to create the underlying std::vector container (KVPCollection
) and it also initializes a global counter for the Ternary DAG nodes.
ST_insert(line, iCount++)
inserts a line of text from the dic.txt file into the KVPCollection
and adds a node to the DAGTree structure.
Once all the data is inserted, ST_write_to_file
is called to binary serialize the two data structures to files. The same path is used for both files. You can name the DAGTree serialize file what you like (in this case spellchk.dat), and the KVPCollectio
n serialize file is named dict.dat. Both files will be replaced if they exist. The std:vector member of KVPCollection
is also sorted prior to output to dict.dat.
To use the index data sets you then call ST_open_dict
as shown below.
IntPtr Dict = SuggestWin32.ST_open_dict(@"..\..\spellchk.dat");
if (Dict == null)
{
Console.WriteLine("Missing or bad input file");
}
This call deserializes the data to memory from both spellchk.dat and dict.dat. This is the reverse of what ST_write_to_file
does. If anything goes wrong with deserializing the spellchk.dat file, null is returned. Otherwise an IntPtr handle to the root node of the DAGTree is returned.
If the serialized data is current, you can skip the write and go straight to ST_open_dict
. This will save a few seconds in your application startup.
This C# struct is used in a reference parameter to hold the results. Note that the memory for the struct is allocated on the managed site so that default marshaling of a struct can be used.
[StructLayout(LayoutKind.Sequential)]
public struct KVPIndex
{
public int fid;
public uint keySize;
public string sKey;
}
LayoutKind.Sequential
is used to insure that the order of fields will match the C++ version of this struct.
A KVPIndex type is created like this.
KVPIndex kvp = new KVPIndex();
kvp.sKey = new String(' ', 50);
Since this is a struct, the default constructor initializes all of the fields to their C# default values.
Memory is allocated for marshaling the String
key member (sKey
) with enough memory for 50 characters. This is done by using the String
constructor overload that creates a string with 50 repeated spaces in this example. The char
you chose to use does not really matter.
Default marshaling automatically converts the .NET String
type to a const char*
buffer on the C++ side.
Inside the do loop, each user-supplied word is tested for matches in the data sets, and the iterator interface is used to return the results.
Console.Write("Enter a word to check: ");
word = Console.ReadLine();
if (word.Length > 0)
{
SuggestWin32.ST_GetFirst(word, ref kvp);
if (kvp.fid > -1)
{
for (; kvp.fid > -1; SuggestWin32.ST_GetNext(ref kvp))
{
Console.WriteLine("{0} fid:{1}", kvp.sKey, kvp.fid);
kvp.sKey = new String(' ', 50);
}
}
else
{
int n;
IntPtr KVPResults = SuggestWin32.ST_enum_suggestions(Dict, word, out n);
if (n > 0)
{
for (SuggestWin32.ST_GetKVFirst(KVPResults, ref kvp); kvp.fid > -1;
SuggestWin32.ST_GetNext(ref kvp))
{
Console.WriteLine("{0} fid:{1}", kvp.sKey, kvp.fid);
kvp.sKey = new String(' ', 50);
}
}
}
}
ST_GetFirst
is used to match all the words in the std::vector (which contains the dict.dat strings) that begin with the string parameter
. Matching is case sensitive. If at least one match is found, the kvp
object will contain the first match. Otherwise, kvp.fid
will be set to -1 to indicate no matches were found.
If matches to word
were found, the match values are printed inside the for loop.
ST_GetNext
is called on the next loop to get the next iteration of kvp in the range of matches. Then the check for kvp.fid > -1
is done. This time, if kvp.fid == -1
, it indicates that there are no more matches left to return and the for loop terminates.
If no matches were found in the first test, a second test is found to find other suggested spellings for the word. ST_enum_suggestions
is called to create a std::vector
of KVPIndex
objects containing suggested spelling matches. A handle to this data structure (KVPResults) is returned along with the number of matches.
Then ST_GetKVFirst
is called in the for-loop initializer to begin iterating through the results. ST_GetKVFirst
takes the handle to KVPResults
and the reference to kvp
as parameters. Results are returned in kvp
the same way as before.
All of the static extern method prototypes are defined in the SuggestWin32 static C# class.
public static class SuggestWin32
{
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern void ST_CreateInit();
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern void ST_insert(string word, int index);
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern bool ST_write_to_file(string path);
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern IntPtr ST_open_dict(string filename);
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern void ST_close_dict(IntPtr dict);
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern IntPtr ST_enum_suggestions(IntPtr dict, string word, out int n);
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern void ST_GetFirst(string cStr, [MarshalAs(UnmanagedType.Struct)] ref KVPIndex kvp);
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern void ST_GetKVFirst(IntPtr KVPResults, [MarshalAs(UnmanagedType.Struct)] ref KVPIndex kvp);
[DllImport("..\\..\\..\\Debug\\SpellSuggest.dll")]
public static extern void ST_GetNext([MarshalAs(UnmanagedType.Struct)] ref KVPIndex kvp);
}
Note that you only need to supply the path to your dll in the DLLImport statement parameter. There are other optional settings, but they are not needed for this project.
For this project, all of the parameters for the static method prototypes are either string, int, IntPtr, or KVPIndex types. MarshalAs(UnmangedType.Struct) is the default marshaling for C# struct, but I put the statement in just to be clear on what it is doing with the struct parameter.
DllImport can be set to two different calling conventions: StdCall or Cdecl. StdCall is the default (except in CE.NET) and Cdecl needs to be used if the C++ function you import takes a variable number of arguments (e.g. printf).
SpellSuggest Project
This project source has two folders with all of the source code: Header and Source.
The Header folder contains CreateTree.h and spellchecker.h. The Source folder contains CreateTree.cpp and spellcheker.cpp.
CreateTree.h
This header file contains three function prototype exports.
extern "C" {
__declspec(dllexport) void ST_CreateInit();
__declspec(dllexport) void ST_insert(const char* str, int index);
__declspec(dllexport) bool ST_write_to_file(const char* path);
}
It also contains the C++ unmanaged version of the KVPIndex struct and the C++ KVPCollection struct.
struct KVPIndex {
int fid;
size_t keySize;
char *sKey;
};
struct KVPCollection
{
std::vector<KVPIndex*> KVPVect;
KVPCollection() {
KVPVect.reserve(ST_MAX_ENTRIES);
}
~KVPCollection() {
vector<KVPIndex*>::iterator vit;
for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
{
free((KVPIndex*)(*vit)->sKey);
delete(*vit);
}
}
}
Note that in C++ structs are the same as classes except that members are public by default.
The first line of KVPCollection creates a object, KVPVect
, that is an instance of std::vector<KVPIndex*>
. Thus, each cell in KVPVect
contains a pointer to KVPIndex
.
The constructor reserves enough cells in the vector to hold ST_MAX_ENTRIES
(300000) pointers. If more space is needed the vector object will reallocate a new buffer that is larger and copy all the cell contents (the pointers) to the new buffer. ST_CreateInit()
calls this constructor to create and instance of KVPCollection
;
The destructor frees memory that has been dynamically allocated for sKey
and KVPIndex
instances.
ST_insert
calls the InsertCStr
method in the KVPCollection
class. Here is the implementation of the method.
void InsertCStr(const char *cStr, int index)
{
KVPIndex* m_kvp;
m_kvp = new KVPIndex();
m_kvp->keySize = strlen(cStr);
if (m_kvp->keySize > 0) {
if (m_kvp->keySize >= KVPIndexC_sKey_Size) {
m_kvp->keySize = KVPIndexC_sKey_Size - 1;
}
m_kvp->sKey = (char *)malloc(m_kvp->keySize + 1);
strncpy(m_kvp->sKey, cStr, m_kvp->keySize);
m_kvp->sKey[m_kvp->keySize] = '\0';
m_kvp->fid = index;
KVPVect.push_back(m_kvp);
}
}
The minimum amount of memory is allocated to each sKey
pointer up to 50 chars (including the 0 terminator). The given index
parameter is saved to fid
.
The same data is inserted into the DAGTree. See the Ternary DAG Tree link above for additional info on the DAGTree structure.
ST_write_to_file
calls the Write
method in the KVPCollection
class to create the dict.dat file.
void Write(const char *fileName)
{
Sort();
ofstream outFile(fileName, ios::out | ios::binary);
char* out = (char*)malloc(sizeof(KVPIndexC));
vector<KVPIndex*>::const_iterator vit;
for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
{
KVPIndex *kvpTemp = (KVPIndex*)*vit;
memmove(out, (const void*)kvpTemp, 8);
memmove(out+8, kvpTemp->sKey, kvpTemp->keySize + 1);
outFile.write(out, kvpTemp->keySize + 9);
}
outFile.close();
}
KVPVect
is sorted and vector<KVPIndex*>::const_iterator
is used to iterate through the contents of KVPVect
to write each element with a binary write method to outFile
.
spellchecker.h
This header file contains the rest of the function prototype exports that are imported in CSharpTest.
extern "C" {
__declspec(dllexport) size_t MAX_ENUM_WORDS();
}
extern "C" {
__declspec(dllexport) ST_FILE_NODE * ST_open_dict (const char * filename);
}
extern "C" {
__declspec(dllexport) std::vector<KVPIndex*> * ST_enum_suggestions (ST_FILE_NODE * dict, const char * word, int *n);
}
extern "C" {
__declspec(dllexport) void ST_close_dict (ST_FILE_NODE * dict);
}
extern "C" {
__declspec(dllexport) void ST_GetFirst(const char *cStr, KVPIndex *kvpSt);
__declspec(dllexport) void ST_GetKVFirst(std::vector<KVPIndex*> *KVPResults, KVPIndex *m_kvp);
}
extern "C" {
__declspec(dllexport) void ST_GetNext(KVPIndex *kvpSt);
}
The ENumResults
struct is used to manage another KVPVect
that stores a list of spelling suggestions generated in the ST_enum_suggestions
call.
struct ENumResults {
std::vector<KVPIndex*> KVPVect;
~ENumResults() {
vector<KVPIndex*>::iterator vit;
for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
{
free((KVPIndex*)(*vit)->sKey);
delete(*vit);
}
}
void RemoveEntries() {
vector<KVPIndex*>::iterator vit;
for (vit=KVPVect.begin(); vit < KVPVect.end(); ++vit)
{
free((KVPIndex*)(*vit)->sKey);
delete(*vit);
}
KVPVect.clear();
}
};
The destructor is similar to RemoveEntries
except that RemoveEntries
frees memory for all elements and clears KVPVect
instead for deleting the KVPVect
object.
spellchecker.cpp and CreateTree.cpp
Most of the wrapper functions for dllexport are implemented in spellchecker.cpp.
Many of these functions call methods in the KVPCollection
struct that are either implemented in CreateTree.h or CreateTree.cpp.
ST_GetFirst
calls an overload of GetFirst
that is implemented in CreateTree.cpp.
A C function, comp
, is used to compare CompStr to the range of values in the sorted std::vector.
const char *CompStr;
bool comp(const KVPIndex *pred, const KVPIndex *range)
{
if (strcmp(CompStr, range->sKey))
{
if (strcmp(CompStr, pred->sKey)) {
return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
}
if (range->keySize <= pred->keySize) {
return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
}
return (strncmp(pred->sKey, range->sKey, pred->keySize) < 0) ? true : false;
}
else
{
if (pred->keySize <= range->keySize) {
return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
}
return (strncmp(pred->sKey, range->sKey, range->keySize) < 0) ? true : false;
}
}
pred
and range
are actually interchangeable so CompStr
is used to hold the search string in order to correctly left match CompStr as a substring in sKey
vector elements. If CompStr length is less than the vector key item, then only the key size of the input key should be compared. The right part of the range key that is past the length of the input key can be ignored. This function took some debugging to get it working correctly for left matches.
Here is this GetFirst implementation.
void KVPCollection::GetFirst(const char *cStr, KVPIndex *m_kvp, bool (*compFunc)(const KVPIndex *, const KVPIndex *))
{
m_kvp->fid = -1;
int len = strlen(cStr);
if (len > 0) {
CompStr = cStr;
KVPIndex* kvp;
kvp = new KVPIndex();
kvp->keySize = len;
if (len >= KVPIndexC_sKey_Size) {
kvp->keySize = KVPIndexC_sKey_Size - 1;
}
kvp->sKey = (char *)malloc(kvp->keySize + 1);
strncpy(kvp->sKey, cStr, kvp->keySize);
kvp->sKey[kvp->keySize] = '\0';
lowBound = lower_bound(KVPVect.begin(), KVPVect.end(), kvp, compFunc);
upBound = upper_bound(KVPVect.begin(), KVPVect.end(), kvp, compFunc);
free(kvp->sKey);
delete kvp;
if (lowBound != upBound)
{
m_kvp->fid = ((KVPIndex*)*lowBound)->fid;
m_kvp->keySize = ((KVPIndex*)*lowBound)->keySize;
strcpy(m_kvp->sKey, ((KVPIndex*)*lowBound)->sKey);
}
kvpIter = lowBound;
}
}
The search term must be a KVPIndex*
type to match the std::vector
element type. So kvp
is created and set to the value of the given cStr
parameter.
vector<KVPIndex*>::iterator lowBound
and upBound
are found using the std::lower_bound and std::upper_bound methods.
Then the member fields are set in the m_kvp
reference parameter to the contents of the lowBound
iterator.
The other GetFirst
overload (implemented in CreateTree.h) is much simpler. It just takes the results vector returned by ST_enum_suggestions
and saves the first element pointer to kvpIter
and the end()
pointer to upBound.
GetNext (implemented in CreateTree.h) is called to get the next element until upBound is reached. Once upBound is reached, final m_kvp reference parameter value for m_kvp->fid will be -1.
void GetNext(KVPIndex *m_kvp)
{
m_kvp->fid = -1;
if (++kvpIter != upBound) {
m_kvp->fid = ((KVPIndex*)*kvpIter)->fid;
m_kvp->keySize = ((KVPIndex*)*kvpIter)->keySize;
strcpy(m_kvp->sKey, ((KVPIndex*)*kvpIter)->sKey);
}
}
Not that the memory for m_kvp->sKey was allocated in manged code. This eliminates extra work that would be required in C# to marshal the unmanged memory buffer back to managed memory.
ST_enum_suggestions and fuzzy_match
I made most of my changes to the original Ternary DAG algorithm around these two functions.
std::vector<KVPIndex*> * ST_enum_suggestions (ST_FILE_NODE * dict, const char * word, int *n)
{
static ST_ENUMINFO enuminfo;
if (enuminfo.nresults > 0) {
enuminfo.results.RemoveEntries();
enuminfo.nresults = 0;
}
else
{
enuminfo.results.KVPVect.reserve(500);
enuminfo.dict = dict;
}
fuzzy_match(dict + 1, word, enuminfo.result, &enuminfo);
*n = enuminfo.nresults;
return &(enuminfo.results.KVPVect);
}
I added the static keyword to first line. This causes the variable to be allocated once on the heap instead of the automatic stack. All members are also initialized to 0 by default. This way I can reuse the object and clear the entries if there are any.
I also removed the (*enumproc)
function parameter and added *n
to hold the number of results to then caller.
The function was also changed to return a pointer to the KPVect
results vector. This pointer is then passed back from the C# interface in the ST_GetKVFirst
function to begin iterating over the set of suggested spelling results.
The only thing changed in fuzzy_match
is that the report_suggestion
function was replaced by add_suggestion
. Once I have my object vector of results, I can collect it in C#. I don't want to pass a function in to do anything with each result in C++.
static void add_suggestion(ST_ENUMINFO * enuminfo) {
KVPIndex *kvp = new KVPIndex();
kvp->sKey = (char *)malloc(50);
for (s_kvpList->GetFirst(enuminfo->result, kvp, compExact); kvp->fid > -1; s_kvpList->GetNext(kvp))
{
enuminfo->results.KVPVect.push_back(kvp);
++(enuminfo->nresults);
kvp = new KVPIndex();
kvp->sKey = (char *)malloc(50);
}
}
Note that I need to allocate 50 bytes to hold the char buffer so everything else will work the same as when the GetFirst parameter originates from C#.
If there are duplicate matches, I still want all of them returned because they will have different fid values in my application source data. To allow for duplicate string results I still iterate over all matches to the suggested spelling the same as when I do the left-matching. Except that this time, the compare function is compExact
instead of comp
.
Here is the compExact implementation from CreateTree.cpp
bool compExact(const KVPIndex *pred, const KVPIndex *range)
{
return (strcmp(pred->sKey, range->sKey) < 0) ? true : false;
}
This is much simpler to implement than the comp function because it only compares full string matches instead of left-substring matches.
Conclusion
Using a sorted std::vector that contains a struct with an ID member and string key for fast left-match searching allows for adequate speed to support instant type-ahead style suggest. The Ternary DAG data structure is an efficient method for searching a word list for alternate spellings when no matches are found. In cases where the word list might contain duplicate search keys for unique features, the Ternary DAG matching can be combined with the sorted vector binary search to get the full results with the unique feature id values.
History
First published August 19, 2012