Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / All-Topics

Advanced String Processing

4.60/5 (5 votes)
27 Jun 2010CPOL5 min read 18.3K  
An advanced article about string processing

Introduction

When writing software for Windows today, particularly commercial applications, one area can produce some real bottlenecks in code execution speed and even just in implementation.

This is in the area of string processing!

I don't mean the simple string processing in most applications, but I mean when you have to process extremely large amounts of string data (i.e. from files) quickly, while doing some complex parsing of the string data.

String data does not mean just text characters either. It could be large amounts of data in binary format which needs to be treated as a large string.

Let's say that your normal programming language (which you use all the time) doesn't handle a particular situation well when it comes to handling some complex string data manipulation. You decide, that you are willing to use another programming language to write a DLL, which can be called by your current programing language. You want speed too!

While there are a number of programming languages that may fit the bill, let's look at one of them that may offer you a far more extensive array of tools to produce exactly what you are looking for.

Powerbasic (see: http://powerbasic.com).

Never heard of it ?

Well that's OK. I won't go into the long history of this compiler, but have you ever heard of Turbo Basic? To make a long story short, Powerbasic is the grandchild of the famous TurboBasic, but a true Windows 32 bit compiler.

There are a couple of reasons why to seriously consider this compiler to write DLLs to interface with your current programming language applications.

  1. It's BASIC and a basic that follows the Microsoft standard for Basic syntax. Anyone can use it.
  2. It is designed for optimal speed and the people there (I don't work for Powerbasic) are Intel machine language experts who love to count CPU cycles.
  3. It has a true variable length native string data type which can hold store huge amounts of data in a single string.

Note: The Powerbasic string data type uses the OLE engine in Windows for data storage, rather than null terminated strings like other languages. This allows you to store any byte values (including zero) in the string and it is truly variable length. Powerbasic also provides other useful string formats such as fixed length strings, AsciiZ strings (null terminated).

Now this is where things get interesting and I have practical experience in using this compiler for over ten years so I know what it can do when it comes to string handling.

Here is a list of string functions not often found in other languages and some may not existing in any other compiler other than Powerbasic (which is why it solves problems).

Of course, it supports the standard string functions like ASC, MID$, SPACE$, LEFT$, RIGHT$, INSTR, TRIM$, LTRIM$, RTRIM$.

But it's the more advanced string functions which really get used.

So let's say you have this compiler and want to dig into working with strings. Where do you start?

The following are my favorites and the ones I use a lot:

ARRAY SCAN 

It is used like this:

ARRAY SCAN MyData$(1), =SomeString$ , TO Match& 

(There are a number of other useful ARRAY commands for strings like ARRAY SORT which are also very fast.)

This command allows you to search an array of strings for a matching string and it is fast and I mean fast.

PEEK$ 

This function can read a block of data via an address pointer and return a string.

PARSE$ 

I use this one all the time and it is very powerful for breaking up strings into variable length records. For example, lets' say you use a command (,) for separating records, you can go through a string like this:

CT&=PARSECOUNT(BigString$, ",")
FOR I&=1 TO CT&
    SmallString$=PARSE$(BigString$,",", I&)
NEXT I&

What could be simpler and it is fast!

REMOVE$ 

This one I use a lot.

For example, let's say you have a file which could either use Carriage Returns for end of lines or carriage return linefeeds. You don't know which way the file will be, but need to parse both. How would you do it?

Like this:

BigString$ = REMOVE$(BigString$, CHR$(10)     ' remove line feeds
CT&=PARSECOUNT(BigString$, CHR$(13))
FOR I&=1 TO CT&
    SmallString$=PARSE$(BigString$,CHR$(13), I&)
NEXT I&

REPLACE

This command is also a favorite. Now let's say that you have a strange file format which uses an unusual end of line character. You want a carriage return instead. It's simple. It's done like this:

REPLACE StrangeCharacter$ WITH CHR$(13) in BigString$ 

Now you can move through the string and parse out the data.

Now this is not an extensive list of the string functions/commands in PowerBasic, but let it suffice to say that it is one of the richest languages when it comes to string processing.

Now this is where things get very interesting.

Despite its speed (the compiler), I still at times find that things are not fast enough. I need optimized speed, but I am not a machine language programmer (or assembler). So what do I do?

Pointers!

Yes, you can work with pointers even within string data in variable length strings.

You can treat the data in the string as bytes (or ASCII characters) and move through the string at lightning fast speeds using pointers like this:

LOCAL B AS BYTE PTR
L&=LEN(BigString$)
B=STRPTR(BigString$)  ' get a pointer to start of string
FOR I&=1 TO L&
    Test&=@B      ' access data via pointer as a byte
    INCR B       ' increment pointer 1 byte
NEXT I&

Now you are not just limited to accessing the string as bytes only.

Let's say you have a string which holds 1000 Floating point numbers (SINGLE). Each number (binary) takes up four bytes. I can move through the string as if it were actually binary floating point numbers like this:

LOCAL S AS SINGLE PTR
L&=LEN(BigString$)/4 ' four bytes per Single floating point number
S=STRPTR(BigString$)  ' get a pointer to start of string
FOR I&=1 TO L&
    Test!=@S      ' access data via pointer as a singles
    INCR S       ' increment pointer 4 bytes
NEXT I&

Now you can even treat one big string as multiple data types using pointers.

Now in the rare case you really need maximum speed, this compiler can even let you use inline assembler, so if your company has any assembler experts you can get them to write code which is speed critical, while you write the rest.

The richness of this compilers string command set, plus the ability to handle data via pointers makes it a powerful tool when you need optimal string handling in your applications. Also the beauty of this is that you don't have to switch programming languages either. Just write your speed critical string handling code using Powerbasic, compile it to a DLL and then call the DLL in the speed critical areas.

Will It Make That Much of a Difference?

You won't know until you try it, but I have read experiences of some programmers getting speed increases of 10 fold or more simply by using this compiler for speed critical string code.

History

  • 27th June, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)