(untagged)

Optimizing String-Manipulation Functions

Peter Kankowski

0.00/5 (No votes)

26 Dec 2004

Optimizing string-manipulation functions.

Download demo project - 3.81 Kb

Abstract

VC++, GCC and other compilers for x86 use inefficient implementation of strncmp(). If this function call is part of an inner loop, your application may slow down dramatically. A set of preprocessor macros replaces this lame function, thus making your code smaller and faster.

What's the Problem?

Text search applications and interpreters typically have strncmp() in their inner loops. This sample code interprets "PRINT" and "GOTO" commands (of course, it's just a primitive example):

char buf[] = "PRINT xyz";
if(0 == strncmp(buf, "PRINT", 5)) {
    DoPrint(buf+5);
}
else if(0 == strncmp(buf, "GOTO", 4)) {
    DoGoto(buf+4);
}

Let's see how VC++ 6.0 implements this function (use your favorite disassembler to view the code):

strncmp proc ; int strncmp(const char *s1, const char *s2, size_t len)

    ; function prolog

    push ebp
    mov ebp, esp
    push edi
    push esi
    push ebx
    ; if (0 == len) goto exit

    mov ecx, dword ptr [ebp+10]
    jcxz exit

    ; search for terminating zero in s1

    mov ebx, ecx
    mov edi, dword ptr [ebp+08]
    mov esi, edi
    xor eax, eax
    repnz scasb
    ;ecx = the smaller of two numbers: len and strlen(s1)

    neg ecx
    add ecx, ebx

    ;compare ecx bytes from s1 and s2

    mov edi, esi
    mov esi, dword ptr [ebp+0C]
    repz cmpsb
    ;compare the first different character in both strings

    mov al, byte ptr [esi-01]
    xor ecx, ecx
    cmp al, byte ptr [edi-01]

    ja smaller_s1 ; s1 is less than s2, should return -1

    je exit ; s1 == s2, should return zero

    ; ecx = -2

    dec ecx
    dec ecx
smaller_s1:
    ; if s1 was less than s2, return ~0 = -1

    ; if s1 was greater than s2, return ~(-2) = 1

    not ecx

exit:
    ; epilog: return the value

    mov eax, ecx
    pop ebx
    pop esi
    pop edi
    leave
    ret
end proc


; Using strncmp

; if(0 == strncmp(buf, "PRINT", 5)) {

    push 5
    lea    ecx, DWORD PTR [esp+4] ; buf

    push OFFSET PRINT
    push ecx
    call _strncmp
    add    esp, 12
    test eax, eax
    jne    SHORT skip
; DoPrint(buf+5)

    ;some printing stuff here

skip:

Umm, tons of machine code from a simple line of your C program... Is there any easier way to compare strings? Sure, there is:

     cmp DWORD PTR , 'NIRP'
     jnz skip
     cmp BYTE PTR , 'T'
     jnz skip
; DoPrint

     ;printing

skip:

These four instructions do the same comparison as the long code above, but they work much faster and take up less space. We start by comparing the first four characters of the string with 'PRIN'. 32-bit processors such as Pentium or Athlon can read and compare 4 bytes very quickly. Just put one cmp instruction to do this.

But why 'NIRP' instead of 'PRIN'? Why are the character reversed? Remember, x86 is a little-endian processor: it reverses byte order when reading DWORDs from memory. Thus we need to reverse our string too. The processor will read reversed DWORD, compare it with reversed 'PRIN', and say if they are equal. If so, it will then compare fifth character with 'T'. If all conditions are met, DoPrint will be executed. In other cases, the processor will jump to the skip label.

You don't need to use assembler for this trick. The same can be done with C/C++:

if(('P' | ('R' << 8) | ('I' << 16) | ('N' << 24)) == *(long*)buf &&
    'T' == buf[4]) {
    DoPrint(buf+5);
}

Exactly the same nice and fast machine code will be generated. But such tricky code is hard to read and maintain. Let's make our work easier and write some macros.

Macros and Their Usage

#ifdef UNICODE
#define cmp2ch(s,a,b) (*(long*)(s) == \
    ((unsigned short)(a) | (unsigned short)(b) << 16))
#define cmp4ch(s,a,b,c,d) (*(long*)(s) == \
    ((unsigned short)(a) | (unsigned short)(b) << 16) && \
    *(long*)(s+2) == \
    ((unsigned short)(c) | (unsigned short)(d) << 16))
#define set2ch(s,a,b) (*(long*)(s) = \
    ((unsigned short)(a) | (unsigned short)(b) << 16));
#define set4ch(s,a,b,c,d) (*(long*)(s) = \
    ((unsigned short)(a) | (unsigned short)(b) << 16), \
    *(long*)(s+2) = \
    ((unsigned short)(c) | (unsigned short)(d) << 16));
#else
#define cmp2ch(s,a,b) (*(short*)(s) == \
    ((unsigned char)(a) | (unsigned char)(b) << 8))
#define cmp4ch(s,a,b,c,d) (*(long*)(s) == \
     ((unsigned char)(a) | (unsigned char)(b) << 8 |   \
     (unsigned char)(c) << 16 | (unsigned char)(d) << 24))
#define set2ch(s,a,b) (*(short*)(s) = \
    ((unsigned char)(a) | (unsigned char)(b) << 8));
#define set4ch(s,a,b,c,d) (*(long*)(s) = \
     ((unsigned char)(a) | (unsigned char)(b) << 8 |   \
     (unsigned char)(c) << 16 | (unsigned char)(d) << 24));
#endif

cmp2ch and cmp4ch macros compare 2 and 4 characters of a string. The following example shows how to use them:

if(cmp4ch(buf, 'P','R','I','N') && buf[4] == 'T')
    DoPrint(buf+5);

Another example:

if(cmp4ch(buf, 'Y','E','S','\0'))
    DoYes();

If the string is equal to "YES", DoYes() function is invoked.

Although macros seem to be quite complicated, they are compiled to very simple machine code. Note that all shifts and ORs are evaluated during compilation, so the resulting code consists of just a fast comparison (CMP and JNZ instructions).

Finally, let's talk about set2ch and set4ch. As their names say, these macros do fast assignment of 2 or 4 characters:

set4ch(buf, 'P','R','I','N'); // Equivalent of strcpy(buf, "PRINT"),

set2ch(buf+4, 'T','\0');      // but much shorter and faster

The same technique can be applied to other functions, e.g., to replace strcat():

TCHAR filename[MAX_PATH];
// Get path from control

int len = SendMessage(hwnd, WM_GETTEXT, 0, (LPARAM)filename);
if(*(p + len) != '\\')
   *(p + len) = '\\', len++;
// Append file name

set4ch(p + len, 'f', 'i', 'l', 'e');
set4ch(p + len + 4, '.', 't', 'x', 't');
*(p + len + 8) = '\0';
// Open the file

HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, ...);

Portability and Localization

These macros were tested on various platforms and compilers including MS VC++ 6.0, LCC 3.8, MinGW and GCC 3.2 under Linux. The tests compiled and run successfully, producing perfect and fast code. I have been using cmpXch macros in my freeware applications for one year and had no problems with them.

When compiled with #define UNICODE, the macros perform comparison of wide characters. So there should be no difficulties in maintaining Unicode and ANSI builds of your application. Macros also can be easily ported to Win64. (If anybody can test them on 64-bit processor, please post a message below.)

Localization seems to be much harder than porting. MS VC++ can properly compile this line with Russian characters in Unicode:

set4ch(str, TEXT('&#1064;'), TEXT('&#1080;'), TEXT('&#1083;'), TEXT('&#1086;'));

But other compilers generate a wrong code! LCC 3.8 can't properly convert a literal ASCII character to Unicode (not to mention it was unable to optimize shifts). MinGW also failed to select right Russian character when UNICODE was defined.

However, macros work fine with English language (in both Unicode and ASCII modes) and with any language using 8-bit ASCII. If you are building BASIC interpreter or parsing HTML tags, this should be enough for you. When you need Unicode support for non-English characters, you have to use MS VC++ compiler.

Case-insensitive search

How to perform case-insensitive search using this technique? One of the possible solutions is to compare all possible combinations of 2 first characters (e.g., PR, pr, Pr, pR for "PRINT"). If this comparison returns true, let's use strcmpi, lstrcmpi or CharUpper to check out the rest of our string:

if(cmp2ch(buf, 'P', 'R') || cmp2ch(buf, 'P', 'r') ||
   cmp2ch(buf, 'p', 'R') || cmp2ch(buf, 'p', 'r')) {
        char tmp[5];
        *(long*)tmp = *(long*)buf; // Fast copy 4 chars from buf to tmp

        tmp[4] = buf[4];           // or: memcpy(tmp, buf, 5 * sizeof(TCHAR));

        CharUpperBuff(tmp, 5); // Convert 5 chars to uppercase

        if(cmp4ch(tmp, 'P', 'R', 'I', 'N') && tmp[4] == 'T')
            DoPrint(buf+5);    // Compare converted characters

}

Another variation (ugly, but faster):

if((cmp2ch(buf, 'P', 'R') || cmp2ch(buf, 'P', 'r') ||
    cmp2ch(buf, 'p', 'R') || cmp2ch(buf, 'p', 'r')) &&
   (cmp2ch(buf+2, 'I', 'N') || cmp2ch(buf+2, 'I', 'n') ||
    cmp2ch(buf+2, 'i', 'N') || cmp2ch(buf+2, 'i', 'n'))
    && (buf[4] == 'T' || buf[4] == 't'))
        DoPrint(buf+5);

If you need to compare only English letters, you can use a very fast trick. Note that each capital letter has the fifth bit cleared, and corresponding small letters have the same bit set:

0100 0001 � A	0100 0010 � B	0100 0011 � C	0100 0100 � D
0110 0001 � a	0110 0010 � b	0110 0011 � c	0110 0100 � d

Let's clear this bit by ANDing letters with 1101 1111 = 0xDF. Thus we get only capital letters and can compare them with a constant:

if(('P' | ('R' << 8) | ('I' << 16) | ('N' << 24)) == (*(long*)buf & 0xDFDFDFDF)
    && (buf[4] & 0xDF) == 'T') {
        DoPrint(buf+5);
    }

Although this code is extremely fast and small, it works only for English letters. Adding digits or brackets will introduce bugs. For example, "{" will be converted to "[" and this code won't operate correctly:

if(('[' | ('A' << 8)) == (*(short*)buf & 0xDFDF)) {
        ReportError(); // Will call ReportError() for "[a", "{a", "[A", "{A"

    }

This can be rewritten as follows:

if(('[' | ('A' << 8)) == (*(short*)buf & 0xDFFF)) {
        ReportError(buf+5); // Will call ReportError() for both "[a" and "[A"

    }

This code converts the second letter by ANDing it with DF, but leave the first letter unchanged.

Buffer Length Warning

You should always check that string comparison doesn't exceed available buffer length. Avoid such kind of code:

char buf[256], *p = buf;
GetBuf(&buf, 256); // Read 256 characters from anywhere

do {
    if(cmp2ch(p, '\r', '\n'))
        DoSomething(p);
} while(*p++);

Remember, you are comparing the current character and the next character by using cmp2ch. If there are only 256 characters in string, the last iteration will try to compare 256 and 257 characters with "\r\n". So, you should add one extra character when allocating the buffer:

char buf[257], *p = buf;
GetBuf(&buf, 256); // Read 256 characters from anywhere

...

Or, if you know the string length, make only (length - 1) iterations:

char *buf = (char*)malloc(len), *p = buf;
GetBuf(&buf, len); // Read len characters from anywhere

len--; // Don't compare the last character

while(len > 0) {
    if(cmp2ch(p, '\r', '\n'))
        DoSomething(p);
    len--, p++;
}
free(buf);

Future

Of course, these macros can't be considered an elegant way to compare strings. But C standard library and CString-like classes are even worse. strcmp() and other string manipulation routines proved to be very large and non-effective, and they can significantly slow down your inner loops.

The problem resides in C language itself. There is no built-in string type, so the compiler can't optimize string-manipulation code. I hope some compiler writer will read this article and will think of a language with fast string processing. It would be great if any compiler could apply these tricks automatically.

History of changes

12-24-2004:

The first version.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here