Introduction
I have always wondered why there is a strstr
function available in C/C++ (provided by the CRT library), but not the expected case-insensitive stristr
function (or in wide-character set fashion: wcsistr
).
Some searching on the web yields enough results though, but most of them poorly written and usually very slow. The most basic implementations, which give correct results, first convert both strings to lower-case (or upper-case) and then perform a regular substring search. Unfortunately the memory allocation and the case conversion routine make them painfully slow. Some of the faster alternatives use assembly code, like this one by Ralph Walden. Unfortunately it cannot be used in 64-bit builds because of the inline assembly code and cannot account for a changing locale. Other approaches follow an entirely different path and try to accelerate searches by using an advanced string search algorithm, like Knuth–Morris–Pratt (used in the GNU C library nowadays) or Boyer–Moore. However, such an algorithm is error prone and may cause a significant overhead when searching small strings.
Background
A basic implementation
Let's first write a function which performs reasonably well without optimization and gives correct results with regard to unicode characters and the selected locale.
As a starting point we'll look up the regular substring search function in the Microsoft CRT library. The file we're looking for is probably located on your computer at C:\Program Files\Microsoft Visual Studio XX\VC\crt\src\intel\wcsstr.c
. The function is split into three major parts: the first branch executes when SSE4.2 is detected, the second for SSE2 and the last part when none of these is present.
Take a look at the last part of the function, which is a very basic implementation of the regular substring search in plain C. Remember: wcs1
is the string to be searched, wcs2
is the substring that is being searched for. With some slight modifications to the code we can obtain a case-insensitive version.
The first character of the substring is converted to both lower and upper case and its values are stored in l
and u
. The outer for-loop traverses through the string and compares the characters to l
and u
. If the character matches, the rest of the substring is compared by converting the characters of both string and substring to lower case on the fly in the inner while-loop.
const wchar_t* wcsistr(const wchar_t *wcs1, const wchar_t *wcs2)
{
const wchar_t *s1, *s2;
const wchar_t l = towlower(*wcs2);
const wchar_t u = towupper(*wcs2);
if (!*wcs2)
return wcs1;
for (; *wcs1; ++wcs1)
{
if (*wcs1 == l || *wcs1 == u)
{
s1 = wcs1 + 1;
s2 = wcs2 + 1;
while (*s1 && *s2 && towlower(*s1) == towlower(*s2))
++s1, ++s2;
if (!*s2)
return wcs1;
}
}
return NULL;
}
So there we have it, an already quite usable case-insensitive substring search function. Although the function performs quite well, it is still about 8~10 times slower than it's regular counterpart wcsstr
. It becomes clear strong optimization is required.
Important note: The case conversion of the characters is done using the functions towlower and towupper. For these functions (and other CRT functions involving case conversions, like _wcsicmp) to work properly with unicode characters, you should set the locale using _wsetlocale. You can set the default system locale using:
_wsetlocale(LC_ALL, L"");
Cached case conversions
Closer inspection of the function makes it clear the functions towlower
and towupper
are critical for performance, unfortunately they are rather slow. Luckily the issue can be resolved by using a caching mechanism. A fair assumption is that most text processed by the search function will likely be of a western language, which basically means only character codes in the range 0-255 are used extensively.
Elaborating this idea further leads to the introduction of three new map
-functions, which should replace their non-map
counterparts. You are now required to call the _wsetlocale_map
function at the start of your program, because in addition to setting the locale it will initialize the cache.
wchar_t g_mapToLower[256];
wchar_t g_mapToUpper[256];
wchar_t* _wsetlocale_map(int category, const wchar_t *locale)
{
wchar_t *ret = _wsetlocale(category, locale);
wchar_t c;
for (c = 0; c < 256; ++c)
{
g_mapToLower[c] = towlower(c);
g_mapToUpper[c] = towupper(c);
}
return ret;
}
wchar_t towlower_map(wchar_t c)
{return (c < 256 ? g_mapToLower[c] : towlower(c));}
wchar_t towupper_map(wchar_t c)
{return (c < 256 ? g_mapToUpper[c] : towupper(c));}
By using this caching mechanism the speed improved drastically, the case-insensitive search is now about 3.1~3.4 times slower than wcsstr
(in case of a western language). Further optimization should be sought outside the realm of the C/C++ language.
Detecting SSE availability
Before we can use the SSE2 instructions we must make sure it is available on the system, this can be done easily using the __cpuid intrinsic. This should only be done once and since the locale will generally not change that frequently we modify the _wsetlocale_map
function to avoid having an additional initialization function.
#define __ISA_AVAILABLE_X86 0
#define __ISA_AVAILABLE_SSE2 1
#define __ISA_AVAILABLE_SSE42 2
int __isa_available = __ISA_AVAILABLE_X86;
wchar_t* _wsetlocale_map(int category, const wchar_t *locale)
{
wchar_t *ret = _wsetlocale(category, locale);
wchar_t c;
int CPUInfo[4];
for (c = 0; c < 256; ++c)
{
g_mapToLower[c] = towlower(c);
g_mapToUpper[c] = towupper(c);
}
__cpuid(CPUInfo, 1);
if (CPUInfo[2] & (1 << 20))
__isa_available = __ISA_AVAILABLE_SSE42;
else if (CPUInfo[3] & (1 << 26))
__isa_available = __ISA_AVAILABLE_SSE2;
return ret;
}
Optimizing the outer loop using SSE2 instructions
To optimize the testing of characters in the outer for-loop we first load three XMM registers. Each register has 8 available places for a character (128 bit register / 16 bit character = 8 places). The first register is loaded with l
, the second with u
and the third with zeros (the string terminating null character).
The string is now traversed in chunks of 8 characters, in each of these chunks we can quickly detect whether an interesting character is present by generating a mask. The provided comments in the following function should outline the procedure.
#define XMM_SIZE sizeof(__m128i)
#define XMM_ALIGNED(p) (0 == ((XMM_SIZE-1) & (intptr_t)(p)))
#define XMM_CHARS (XMM_SIZE / sizeof(wchar_t))
#define PAGE_SIZE ((intptr_t)0x1000)
#define PAGE_OFFSET(p) ((PAGE_SIZE-1) & (intptr_t)(p))
#define XMM_PAGE_SAFE(p) (PAGE_OFFSET(p) <= (PAGE_SIZE - XMM_SIZE))
const wchar_t* wcsistr(const wchar_t *wcs1, const wchar_t *wcs2)
{
const wchar_t *s1, *s2;
const wchar_t l = towlower_map(*wcs2);
const wchar_t u = towupper_map(*wcs2);
if (!*wcs2)
return wcs1;
if (__isa_available >= __ISA_AVAILABLE_SSE2)
{
__m128i patl, patu, zero, tmp1, tmp2, tmp3;
int mask;
unsigned long offset;
patl = _mm_cvtsi32_si128(l);
patl = _mm_shufflelo_epi16(patl, 0);
patl = _mm_shuffle_epi32(patl, 0);
patu = _mm_cvtsi32_si128(u);
patu = _mm_shufflelo_epi16(patu, 0);
patu = _mm_shuffle_epi32(patu, 0);
zero = _mm_setzero_si128();
for (;;)
{
if (XMM_PAGE_SAFE(wcs1))
{
tmp1 = _mm_loadu_si128((__m128i*)wcs1); tmp2 = _mm_cmpeq_epi16(tmp1, patl); tmp3 = _mm_cmpeq_epi16(tmp1, patu); tmp2 = _mm_or_si128(tmp2, tmp3); tmp3 = _mm_cmpeq_epi16(tmp1, zero); tmp2 = _mm_or_si128(tmp2, tmp3); mask = _mm_movemask_epi8(tmp2);
if (mask == 0)
{
wcs1 += XMM_CHARS;
continue;
}
_BitScanForward(&offset, mask);
wcs1 += offset / sizeof(wchar_t);
}
if (!*wcs1)
return NULL;
if (*wcs1 == l || *wcs1 == u)
{
s1 = wcs1 + 1;
s2 = wcs2 + 1;
while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
++s1, ++s2;
if (!*s2)
return wcs1;
}
++wcs1;
}
}
else
{
for (; *wcs1; ++wcs1)
{
if (*wcs1 == l || *wcs1 == u)
{
s1 = wcs1 + 1;
s2 = wcs2 + 1;
while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
++s1, ++s2;
if (!*s2)
return wcs1;
}
}
}
return NULL;
}
We have again improved speed, this time performing about 2.7~2.9 times slower than a regular substring search.
Optimizing the inner loop using SSE2
The last improvement involves optimization of the inner loop, the matching of the remainder of the substring. Although comparison of the entire substring to the current string position can be done using XMM registers by creating masks it involves continuous loading of characters. This will not result in a performance gain, for that reason we keep the original while-loop.
However most of the time is wasted in this expensive while-loop: comparing the second character, the third character maybe even the fourth character but then the match might fail at the fifth. In a large text there are many occurrences of the first character of the substring, still a lot occurrences of the first two characters in sequence, but less of the first three and even a lot less of the first four. I think you can see where this is going, after matching the first character in the outer loop we're going to test directly if the first 8 characters match the substring.
To do so we will cache the first 8 characters of the substring, one XMM register will hold the lower case version, the second register will hold the upper case version. When comparing the against the target string the combined mask should have a set bit (1) at every position (except in the case the substring is smaller than 8 characters, that is what the cache_mask
variable is for). Putting this idea into practice delivers the final version of the function.
const wchar_t* wcsistr(const wchar_t *wcs1, const wchar_t *wcs2)
{
const wchar_t *s1, *s2;
const wchar_t l = towlower_map(*wcs2);
const wchar_t u = towupper_map(*wcs2);
if (!*wcs2)
return wcs1;
if (__isa_available >= __ISA_AVAILABLE_SSE2)
{
__m128i patl, patu, zero, tmp1, tmp2, tmp3, cachel, cacheu;
int mask, i, cache_mask;
unsigned long offset;
patl = _mm_cvtsi32_si128(l);
patl = _mm_shufflelo_epi16(patl, 0);
patl = _mm_shuffle_epi32(patl, 0);
patu = _mm_cvtsi32_si128(u);
patu = _mm_shufflelo_epi16(patu, 0);
patu = _mm_shuffle_epi32(patu, 0);
zero = _mm_setzero_si128();
cachel = _mm_setzero_si128();
cacheu = _mm_setzero_si128();
cache_mask = 0;
s2 = wcs2;
for (i = 0; i < XMM_CHARS; ++i)
{
cachel = _mm_srli_si128(cachel, sizeof(wchar_t)); cacheu = _mm_srli_si128(cacheu, sizeof(wchar_t));
cachel = _mm_insert_epi16(cachel, towlower_map(*s2), XMM_CHARS-1); cacheu = _mm_insert_epi16(cacheu, towupper_map(*s2), XMM_CHARS-1);
if (*s2)
{
cache_mask |= 3 << (i << 1); ++s2;
}
}
for (;;)
{
if (XMM_PAGE_SAFE(wcs1))
{
tmp1 = _mm_loadu_si128((__m128i*)wcs1); tmp2 = _mm_cmpeq_epi16(tmp1, patl); tmp3 = _mm_cmpeq_epi16(tmp1, patu); tmp2 = _mm_or_si128(tmp2, tmp3); tmp3 = _mm_cmpeq_epi16(tmp1, zero); tmp2 = _mm_or_si128(tmp2, tmp3); mask = _mm_movemask_epi8(tmp2);
if (mask == 0)
{
wcs1 += XMM_CHARS;
continue;
}
_BitScanForward(&offset, mask);
wcs1 += offset / sizeof(wchar_t);
if (*wcs1 && XMM_PAGE_SAFE(wcs1))
{
tmp1 = _mm_loadu_si128((__m128i*)wcs1);
tmp2 = _mm_cmpeq_epi16(tmp1, cachel);
tmp3 = _mm_cmpeq_epi16(tmp1, cacheu);
tmp2 = _mm_or_si128(tmp2, tmp3);
mask = _mm_movemask_epi8(tmp2);
if (cache_mask == 0xFFFF) {
if (mask == cache_mask)
{
s1 = wcs1 + XMM_CHARS;
s2 = wcs2 + XMM_CHARS;
while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
++s1, ++s2;
if (!*s2)
return wcs1;
}
}
else {
if ((mask & cache_mask) == cache_mask)
return wcs1;
}
++wcs1;
continue;
}
}
if (!*wcs1)
return NULL;
if (*wcs1 == l || *wcs1 == u)
{
s1 = wcs1 + 1;
s2 = wcs2 + 1;
while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
++s1, ++s2;
if (!*s2)
return wcs1;
}
++wcs1;
}
}
else
{
for (; *wcs1; ++wcs1)
{
if (*wcs1 == l || *wcs1 == u)
{
s1 = wcs1 + 1;
s2 = wcs2 + 1;
while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
++s1, ++s2;
if (!*s2)
return wcs1;
}
}
}
return NULL;
}
Final result: a case-insensitive substring search is now about 2.2~2.4 times slower than a regular substring search.
Using the code
With this article I have attached a zip file which contains the C source file and its accompanying header file. Using this code in your project is very straightforward: just copy both files to your project directory and add them to your project in Visual Studio. To make use of the function you should notify the compiler of its existence and include the header file by:
#include "wcsistr.h"
Do not forget to set the current locale and initialize the case conversion cache by calling the newly introduced _wsetlocale_map
function. The function should at least be called once and this is done preferably at program start.
_wsetlocale_map(LC_ALL, L"");
A typical case-insensitive substring search is now performed like:
const wchar_t *p = wcsistr(L"HeLLo, wOrLD!", L"world");
Points of Interest
- Comparison of unicode strings involving complex characters require extra attention. In unicode characters are usually presented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. If the representation of the string and substring is different the
wcsistr
function obviously fails, it is just comparing strings using on a 16-bit basis. Which means you need to normalize both strings with NormalizeString before using the wcsistr
function. More information can be found here. But even then I highly doubt this will work in 100% of the cases, unicode is just very hard to get right. - Although SSE4.2 has specific instructions intended for string comparisons, the results I obtained using these instructions have been disappointing. At best the function still performed about 10~20% slower than the SSE2 accelerated version presented in this article.
I hope you have enjoyed this article. Please feel free to leave some feedback concerning errors, performance or other matters.
History
- May 14, 2012 - Article published.