(untagged)

Regular Expression based tools

Tibor Blazko

0.00/5 (No votes)

14 Dec 2002

A quick introduction to regular expression based tools such as sed and awk for Win32 platforms.

A Quick Note

This article links to some tools are using regular expressions (RE). Most of them are free, many with sources included.

Introduction

This article follows the Introduction to RE and makes overview of some RE-based tools.

Grep

Grep is used for simple text-file line-filtering.

This

grep "www\.codeproject\.com" *.htm

searches for codeproject in actual directory files. Pattern in quotation marks contain regular expressions.

You can read more about grep at the grep-manual.

Sed - The Stream Editor

Sed allows you to modify a stream of text. We'll demonstrate basic sed possibilities using dir filtering. On my PC mine it displays thus:

 Volume in drive C has no label
 Volume Serial Number is 2547-1C06
 Directory of C:\US\TIBOR!\unixtools

.                    <DIR>  12-21-01  8:08p .
..                   <DIR>  12-21-01  8:08p ..
README   TXT             0  12-23-01  7:01p readme.txt
RETOOLS  HTM         1,054  12-23-01  7:00p retools.htm
SUBMIT~1 COM           107  12-23-01  6:31p submit@codeproject.com
SED-FAQ  HTM       136,956  12-20-01  9:07a SED-FAQ.htm
         4 file(s)        138,117 bytes
         2 dir(s)     122,454,016 bytes free

We'll use sed to translate this output to that which would be outputted by

dir /b *.txt

which will look like this:

readme.txt

At first we will delete lines starting with space(s):

dir | sed "/^ 
/d"

(If you do not know what | means then jump to the pipe appendix)

Next step is deleting of empty lines:

dir | sed "/^ 
/d;/^$/d;"

Than we will remove characters till (long) filename substituting them by nothing:

dir | sed "/^ /d;/^$/d;s/^.\{44\}//"

And finaly filter by *.txt (can be done as the penultimate step)

dir 
| sed "/^ /d;/^$/d;s/^.\{44\}//;/\.[tT][xX][tT]$/!d"

Each sed command (here separated by semicolons) can contain:

line specifier(s)
Few examples:
3 - processes third line
1,3 - first three lines
1,3! - all lines except 1, 2 and 3
/^A/,$ - from line starting with 'A' till last one

No line address in our substitute example means all lines (1,$)

Sed does not know how to define the n-th line from end.

command
In our example we used d and s commands.

command modifiers
For example s command by default substitutes only the first found occurrence in the line. To replace all others (non-overlapping) one uses the g modifier:
"s/reg/sub/g"
If we are sure our filenames do not contain spaces this way we can make our example independent of where the long filename begins (usually at column 44, but this can vary) an we can delete all space-followed words:
dir | sed "/^ /d;/^$/d;s/[^ ]* //g;/\.[tT][xX][tT]$/!d"
(Don't worry: many sed versions handle space in regular expressions like 'any spaces' (literaly ' +') and know nothing about + or {,} meta-characters treating them like common ones.)

Sed is able to get the next line contents or to jump between scripts.

More can be found at the sed-faq.

Awk

Awk knows much more than sed. You can use function calls, variables etc. Here's a small example that prints #includes into c sources (not thinking about possible spaces or comments):

awk "/^#include/ 
{printf(\"%%s %%s\n\", FILENAME, $2)}" *.c

Having no other reason I personally would rather use perl, of which awk is a smaller brother. Generally we have:

grep < sed < awk < perl

in possibilities and

grep >= sed >= awk >= perl

in performance (the same requirement using grep doesn't waste time with the perl syntax).

More can be found at the awk-faq.

Where to get them?

I am, thanks to markkuk, using files from unxutils. Others links you can get reading given FAQ links.

Be aware: they can differ in possibilities, especially if you using regular expression extensions. It is possible some can have their own handling for non-english letter characters.

Generally command's output can also depend on system regional settings. Timeor dir are good examples. That's why we didn't search for any text like 'Volume' into our example.

You can find many script examples. Some of them could be not running well under some OSs because of different path separators (Unix: /, MS: \). In Unix versions it is more common to use ' instead of ".

In many distributions you can find other tools solving common administration script problems: from mailing a file to displaying a date for logging purposes. (Do not wonder with tool named 'date': running 'date' in non-own (unix-like) shell will run system's date command (what waits for user input), just rename it or call by 'start date'.)

And what about Microsoft?

Microsoft offers findstr tool. If it is not present in your windows setup maybe you can get it by installing the appropriate Resource Kit.

Finally

Try to imagine you have to write given easy-cases as their c/cpp equivalents. You can see there are tools that save your time in many common situations.

Appendices

The Pipe Command

dir | sed "/^ /d"

This is the so called 'pipe'. Here we run the dir command and pipe (redirect) its (text) output to the sed command. Sed receives "/^ /d" as the input parameter and (because it is so designed) knows it has no input-file specification, so it starts to read standard input (normally entered by keyboard). Our pipe gets dir's output.

You can run

sed "/^ /d"

alone and it 'will repeat' every line you will type (and end with enter) except that start with space. You can end it by Ctrl-Z or Ctrl-C.

Systems contains the 'more' command which displays one page of input-text and waits for key to display next one. Again: without a parameter it reads from standard input. Because of this is most times it's used in a pipe:

dir | more

Redirection

dir > dir.txt

outputs dir's output to dir.txt file (if dir.txt already exists it is completely changed)

dir >> dir.txt

appends dir's output to dir.txt file (if dir.txt is already existing it is extended, if not it is created and filled)

dir > 
nul

moves dir's output to 'nowhere' (in Unix you can use /dev/null, where there is physically a file (you can 'ls' it). Once I typed something like 'mv d:\file nul' on NTFS and it (by a non repeatable mistake) created a 'd:\nul' file with 0 length, to which it was possible to copy what you wanted and make it all disappear - but when I wanted to delete it it failed with the error 'can't remove device' message.

dir > more

sends dir's output to more.

Similary you can use the 'push file' to command:

more < 
dir.txt

If you have a file 'enter.txt' with one newline inside (two empty lines) and run

time < enter.txt

it will not wait for your keyboard input.

You can also nest this constructions:

dir /b | grep 
"\.[tT][xX][tT]$" | more

(Some programs use output to 'console' and not to 'stdout' (only) so their redirection 'can fail'.)

How to set an environment variable using sed

In this example we will show how to change .bat's input (%1). We assume it will be a filename with a path, and by using sed we will remove the path keeping the filename only.

First problem is how to enter %1 to sed when sed works with files:

echo %1 > temp.file.1
sed -e "s/^.*\\\//" 
temp.file.1
del temp.file.1

(To not tell lies: some sed versions can use environment variables, especially as their input for compatibility settings.)

Second problem is how to enter sed's output to the set command because set works with input parameters only:

echo %1 > 
temp.file.1
sed -e "s/^.*\\\/set myvar=/" temp.file.1 > 
temp.file.2.bat
call temp.file.2.bat
del temp.file.2.bat
del 
temp.file.1

Using pipes it is possible to save temp.file.1 creation:

echo %1 | 
sed -e "s/^.*\\\/set myvar=/" > temp.file.bat
call temp.file.bat
del 
temp.file.bat

If you are working with read-only disc try place temporary files into %TEMP% directory:

echo %1 | sed -e "s/^.*\\\/set myvar=/" 
> %TEMP%\temp.file.bat
call %TEMP%\temp.file.bat
del 
%TEMP%\temp.file.bat

(And generaly: avoid sharing the same temp file by more than one process.)

Someone can get into trouble with this (for us simplified) .bat:

set 
myvar = abc
set my
echo %myvar%

Surprisely echo gets no input (that's why it displays its own state) but the second set displays the variable's existence (in this example it displays all variables starting with 'my'). The whole problem is in first line where it had set %mywar % (last character is space) and and its value is 'abc'.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here