Abstract
VC++, GCC and other compilers for x86 use inefficient implementation of strncmp()
. If this function call is part of an inner loop, your application may slow down dramatically. A set of preprocessor macros replaces this lame function, thus making your code smaller and faster.
What's the Problem?
Text search applications and interpreters typically have strncmp()
in their inner loops. This sample code interprets "PRINT" and "GOTO" commands (of course, it's just a primitive example):
char buf[] = "PRINT xyz";
if(0 == strncmp(buf, "PRINT", 5)) {
DoPrint(buf+5);
}
else if(0 == strncmp(buf, "GOTO", 4)) {
DoGoto(buf+4);
}
Let's see how VC++ 6.0 implements this function (use your favorite disassembler to view the code):
strncmp proc
push ebp
mov ebp, esp
push edi
push esi
push ebx
mov ecx, dword ptr [ebp+10]
jcxz exit
mov ebx, ecx
mov edi, dword ptr [ebp+08]
mov esi, edi
xor eax, eax
repnz scasb
neg ecx
add ecx, ebx
mov edi, esi
mov esi, dword ptr [ebp+0C]
repz cmpsb
mov al, byte ptr [esi-01]
xor ecx, ecx
cmp al, byte ptr [edi-01]
ja smaller_s1
je exit
dec ecx
dec ecx
smaller_s1:
not ecx
exit:
mov eax, ecx
pop ebx
pop esi
pop edi
leave
ret
end proc
push 5
lea ecx, DWORD PTR [esp+4]
push OFFSET PRINT
push ecx
call _strncmp
add esp, 12
test eax, eax
jne SHORT skip
skip:
Umm, tons of machine code from a simple line of your C program... Is there any easier way to compare strings? Sure, there is:
cmp DWORD PTR , 'NIRP'
jnz skip
cmp BYTE PTR , 'T'
jnz skip
skip:
These four instructions do the same comparison as the long code above, but they work much faster and take up less space. We start by comparing the first four characters of the string with 'PRIN'. 32-bit processors such as Pentium or Athlon can read and compare 4 bytes very quickly. Just put one cmp
instruction to do this.
But why 'NIRP' instead of 'PRIN'? Why are the character reversed? Remember, x86 is a little-endian processor: it reverses byte order when reading DWORD
s from memory. Thus we need to reverse our string too. The processor will read reversed DWORD
, compare it with reversed 'PRIN', and say if they are equal. If so, it will then compare fifth character with 'T'. If all conditions are met, DoPrint
will be executed. In other cases, the processor will jump to the skip
label.
You don't need to use assembler for this trick. The same can be done with C/C++:
if(('P' | ('R' << 8) | ('I' << 16) | ('N' << 24)) == *(long*)buf &&
'T' == buf[4]) {
DoPrint(buf+5);
}
Exactly the same nice and fast machine code will be generated. But such tricky code is hard to read and maintain. Let's make our work easier and write some macros.
Macros and Their Usage
#ifdef UNICODE
#define cmp2ch(s,a,b) (*(long*)(s) == \
((unsigned short)(a) | (unsigned short)(b) << 16))
#define cmp4ch(s,a,b,c,d) (*(long*)(s) == \
((unsigned short)(a) | (unsigned short)(b) << 16) && \
*(long*)(s+2) == \
((unsigned short)(c) | (unsigned short)(d) << 16))
#define set2ch(s,a,b) (*(long*)(s) = \
((unsigned short)(a) | (unsigned short)(b) << 16));
#define set4ch(s,a,b,c,d) (*(long*)(s) = \
((unsigned short)(a) | (unsigned short)(b) << 16), \
*(long*)(s+2) = \
((unsigned short)(c) | (unsigned short)(d) << 16));
#else
#define cmp2ch(s,a,b) (*(short*)(s) == \
((unsigned char)(a) | (unsigned char)(b) << 8))
#define cmp4ch(s,a,b,c,d) (*(long*)(s) == \
((unsigned char)(a) | (unsigned char)(b) << 8 | \
(unsigned char)(c) << 16 | (unsigned char)(d) << 24))
#define set2ch(s,a,b) (*(short*)(s) = \
((unsigned char)(a) | (unsigned char)(b) << 8));
#define set4ch(s,a,b,c,d) (*(long*)(s) = \
((unsigned char)(a) | (unsigned char)(b) << 8 | \
(unsigned char)(c) << 16 | (unsigned char)(d) << 24));
#endif
cmp2ch
and cmp4ch
macros compare 2 and 4 characters of a string. The following example shows how to use them:
if(cmp4ch(buf, 'P','R','I','N') && buf[4] == 'T')
DoPrint(buf+5);
Another example:
if(cmp4ch(buf, 'Y','E','S','\0'))
DoYes();
If the string is equal to "YES", DoYes()
function is invoked.
Although macros seem to be quite complicated, they are compiled to very simple machine code. Note that all shifts and ORs are evaluated during compilation, so the resulting code consists of just a fast comparison (CMP
and JNZ
instructions).
Finally, let's talk about set2ch
and set4ch
. As their names say, these macros do fast assignment of 2 or 4 characters:
set4ch(buf, 'P','R','I','N');
set2ch(buf+4, 'T','\0');
The same technique can be applied to other functions, e.g., to replace strcat()
:
TCHAR filename[MAX_PATH];
int len = SendMessage(hwnd, WM_GETTEXT, 0, (LPARAM)filename);
if(*(p + len) != '\\')
*(p + len) = '\\', len++;
set4ch(p + len, 'f', 'i', 'l', 'e');
set4ch(p + len + 4, '.', 't', 'x', 't');
*(p + len + 8) = '\0';
HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, ...);
Portability and Localization
These macros were tested on various platforms and compilers including MS VC++ 6.0, LCC 3.8, MinGW and GCC 3.2 under Linux. The tests compiled and run successfully, producing perfect and fast code. I have been using cmpXch
macros in my freeware applications for one year and had no problems with them.
When compiled with #define UNICODE
, the macros perform comparison of wide characters. So there should be no difficulties in maintaining Unicode and ANSI builds of your application. Macros also can be easily ported to Win64. (If anybody can test them on 64-bit processor, please post a message below.)
Localization seems to be much harder than porting. MS VC++ can properly compile this line with Russian characters in Unicode:
set4ch(str, TEXT('Ш'), TEXT('и'), TEXT('л'), TEXT('о'));
But other compilers generate a wrong code! LCC 3.8 can't properly convert a literal ASCII character to Unicode (not to mention it was unable to optimize shifts). MinGW also failed to select right Russian character when UNICODE
was defined.
However, macros work fine with English language (in both Unicode and ASCII modes) and with any language using 8-bit ASCII. If you are building BASIC interpreter or parsing HTML tags, this should be enough for you. When you need Unicode support for non-English characters, you have to use MS VC++ compiler.
Case-insensitive search
How to perform case-insensitive search using this technique? One of the possible solutions is to compare all possible combinations of 2 first characters (e.g., PR, pr, Pr, pR for "PRINT"). If this comparison returns true, let's use strcmpi
, lstrcmpi
or CharUpper
to check out the rest of our string:
if(cmp2ch(buf, 'P', 'R') || cmp2ch(buf, 'P', 'r') ||
cmp2ch(buf, 'p', 'R') || cmp2ch(buf, 'p', 'r')) {
char tmp[5];
*(long*)tmp = *(long*)buf;
tmp[4] = buf[4];
CharUpperBuff(tmp, 5);
if(cmp4ch(tmp, 'P', 'R', 'I', 'N') && tmp[4] == 'T')
DoPrint(buf+5);
}
Another variation (ugly, but faster):
if((cmp2ch(buf, 'P', 'R') || cmp2ch(buf, 'P', 'r') ||
cmp2ch(buf, 'p', 'R') || cmp2ch(buf, 'p', 'r')) &&
(cmp2ch(buf+2, 'I', 'N') || cmp2ch(buf+2, 'I', 'n') ||
cmp2ch(buf+2, 'i', 'N') || cmp2ch(buf+2, 'i', 'n'))
&& (buf[4] == 'T' || buf[4] == 't'))
DoPrint(buf+5);
If you need to compare only English letters, you can use a very fast trick. Note that each capital letter has the fifth bit cleared, and corresponding small letters have the same bit set:
0100 0001 � A |
0100 0010 � B |
0100 0011 � C |
0100 0100 � D |
0110 0001 � a |
0110 0010 � b |
0110 0011 � c |
0110 0100 � d |
Let's clear this bit by ANDing letters with 1101 1111 = 0xDF. Thus we get only capital letters and can compare them with a constant:
if(('P' | ('R' << 8) | ('I' << 16) | ('N' << 24)) == (*(long*)buf & 0xDFDFDFDF)
&& (buf[4] & 0xDF) == 'T') {
DoPrint(buf+5);
}
Although this code is extremely fast and small, it works only for English letters. Adding digits or brackets will introduce bugs. For example, "{" will be converted to "[" and this code won't operate correctly:
if(('[' | ('A' << 8)) == (*(short*)buf & 0xDFDF)) {
ReportError();
}
This can be rewritten as follows:
if(('[' | ('A' << 8)) == (*(short*)buf & 0xDFFF)) {
ReportError(buf+5);
}
This code converts the second letter by ANDing it with DF, but leave the first letter unchanged.
Buffer Length Warning
You should always check that string comparison doesn't exceed available buffer length. Avoid such kind of code:
char buf[256], *p = buf;
GetBuf(&buf, 256);
do {
if(cmp2ch(p, '\r', '\n'))
DoSomething(p);
} while(*p++);
Remember, you are comparing the current character and the next character by using cmp2ch
. If there are only 256 characters in string, the last iteration will try to compare 256 and 257 characters with "\r\n". So, you should add one extra character when allocating the buffer:
char buf[257], *p = buf;
GetBuf(&buf, 256);
...
Or, if you know the string length, make only (length - 1) iterations:
char *buf = (char*)malloc(len), *p = buf;
GetBuf(&buf, len);
len--;
while(len > 0) {
if(cmp2ch(p, '\r', '\n'))
DoSomething(p);
len--, p++;
}
free(buf);
Future
Of course, these macros can't be considered an elegant way to compare strings. But C standard library and CString
-like classes are even worse. strcmp()
and other string manipulation routines proved to be very large and non-effective, and they can significantly slow down your inner loops.
The problem resides in C language itself. There is no built-in string type, so the compiler can't optimize string-manipulation code. I hope some compiler writer will read this article and will think of a language with fast string processing. It would be great if any compiler could apply these tricks automatically.
History of changes
12-24-2004: