A Quick Note
This article links to some tools are using regular expressions (RE). Most of
them are free, many with sources included.
Introduction
This article follows the Introduction to RE and makes
overview of some RE-based tools.
Grep
Grep is used for simple text-file line-filtering.
This
grep "www\.codeproject\.com" *.htm
searches for codeproject in actual directory files. Pattern in quotation
marks contain regular expressions.
You can read more about grep at the grep-manual.
Sed - The Stream Editor
Sed allows you to modify a stream of text. We'll demonstrate basic sed
possibilities using dir
filtering. On my PC mine it displays
thus:
Volume in drive C has no label
Volume Serial Number is 2547-1C06
Directory of C:\US\TIBOR!\unixtools
. <DIR> 12-21-01 8:08p .
.. <DIR> 12-21-01 8:08p ..
README TXT 0 12-23-01 7:01p readme.txt
RETOOLS HTM 1,054 12-23-01 7:00p retools.htm
SUBMIT~1 COM 107 12-23-01 6:31p submit@codeproject.com
SED-FAQ HTM 136,956 12-20-01 9:07a SED-FAQ.htm
4 file(s) 138,117 bytes
2 dir(s) 122,454,016 bytes free
We'll use sed to translate this output to that which would be outputted
by
dir /b *.txt
which will look like this:
readme.txt
At first we will delete lines starting with space(s):
dir | sed "/^
/d"
(If you do not know what | means then jump to the pipe appendix)
Next step is deleting of empty lines:
dir | sed "/^
/d;/^$/d;"
Than we will remove characters till (long) filename substituting them by
nothing:
dir | sed "/^ /d;/^$/d;s/^.\{44\}//"
And finaly filter by *.txt (can be done as the penultimate step)
dir
| sed "/^ /d;/^$/d;s/^.\{44\}//;/\.[tT][xX][tT]$/!d"
Each sed command (here separated by semicolons) can contain:
- line specifier(s)
Few examples:
3 - processes third line
1,3 - first three
lines
1,3! - all lines except 1, 2 and 3
/^A/,$ - from line starting
with 'A' till last one
No line address in our substitute example means all lines (1,$)
Sed does not know how to define the n-th line from end.
- command modifiers
For example s
command by default substitutes only the first found occurrence in the line. To replace all others (non-overlapping) one uses the g
modifier:
"s/reg/sub/g"
If we are sure our filenames do not contain spaces this way we can make our example independent of where the long filename begins (usually at column 44, but this can vary) an we can delete all space-followed words:
dir | sed "/^ /d;/^$/d;s/[^ ]* //g;/\.[tT][xX][tT]$/!d"
(Don't worry: many sed versions handle space in regular expressions like 'any spaces' (literaly ' +') and know nothing about + or {,} meta-characters treating them like common ones.)
Sed is able to get the next line contents or to jump between scripts.
More can be found at the sed-faq.
Awk
Awk knows much more than sed. You can use function calls, variables etc.
Here's a small example that prints #include
s into c sources (not
thinking about possible spaces or comments):
awk "/^#include/
{printf(\"%%s %%s\n\", FILENAME, $2)}" *.c
Having no other reason I personally would rather use perl, of which awk is a smaller
brother. Generally we have:
grep < sed < awk < perl
in possibilities and
grep >= sed >= awk >= perl
in performance (the same requirement using grep doesn't waste time with the
perl syntax).
More can be found at the awk-faq.
Where to get them?
I am, thanks to markkuk, using files from unxutils. Others links you can get
reading given FAQ links.
Be aware: they can differ in possibilities, especially if you using regular
expression extensions. It is possible some can have their own handling for
non-english letter characters.
Generally command's output can also depend on system regional settings. Time
or dir
are good examples. That's why we didn't search for
any text like 'Volume' into our example.
You can find many script examples. Some of them could be not running well under some OSs
because of different path separators (Unix: /, MS: \). In Unix versions it is
more common to use ' instead of ".
In many distributions you can find other tools solving common administration
script problems: from mailing a file to displaying a date for logging purposes.
(Do not wonder with tool named 'date': running 'date' in non-own (unix-like)
shell will run system's date command (what waits for user input), just rename it
or call by 'start date'.)
And what about Microsoft?
Microsoft offers findstr
tool. If it is not present in your
windows setup maybe you can get it by installing the appropriate Resource
Kit.
Finally
Try to imagine you have to write given easy-cases as their c/cpp equivalents.
You can see there are tools that save your time in many common situations.
Appendices
The Pipe Command
dir | sed "/^ /d"
This is the so called 'pipe'. Here we run the dir
command and
pipe (redirect) its (text) output to the sed command. Sed receives "/^ /d" as
the input parameter and (because it is so designed) knows it has no input-file
specification, so it starts to read standard input (normally entered by
keyboard). Our pipe gets dir
's output.
You can run
sed "/^ /d"
alone and it 'will repeat' every line you will type (and end with enter)
except that start with space. You can end it by Ctrl-Z or Ctrl-C.
Systems contains the 'more
' command which displays one page of
input-text and waits for key to display next one. Again: without a parameter it
reads from standard input. Because of this is most times it's used in a
pipe:
dir | more
Redirection
dir > dir.txt
outputs dir
's output to dir.txt file (if dir.txt already exists
it is completely changed)
dir >> dir.txt
appends dir
's output to dir.txt file (if dir.txt is already
existing it is extended, if not it is created and filled)
dir >
nul
moves dir
's output to 'nowhere' (in Unix you can use
/dev/null
, where there is physically a file (you can 'ls' it). Once
I typed something like 'mv d:\file nul
' on NTFS and it (by a
non repeatable mistake) created a 'd:\nul' file with 0 length, to which it was
possible to copy what you wanted and make it all disappear - but when I wanted
to delete it it failed with the error 'can't remove device'
message.
dir > more
sends dir
's output to more.
Similary you can use the 'push file' to command:
more <
dir.txt
If you have a file 'enter.txt' with one newline inside (two empty
lines) and run
time < enter.txt
it will not wait for your keyboard input.
You can also nest this constructions:
dir /b | grep
"\.[tT][xX][tT]$" | more
(Some programs use output to 'console' and not to 'stdout' (only) so their
redirection 'can fail'.)
How to set an environment variable using sed
In this example we will show how to change .bat
's input (%1). We
assume it will be a filename with a path, and by using sed we will remove the
path keeping the filename only.
First problem is how to enter %1 to sed when sed works with
files:
echo %1 > temp.file.1
sed -e "s/^.*\\\//"
temp.file.1
del temp.file.1
(To not tell lies: some sed versions can use environment variables,
especially as their input for compatibility settings.)
Second problem is how to enter sed's output to the set
command
because set
works with input parameters only:
echo %1 >
temp.file.1
sed -e "s/^.*\\\/set myvar=/" temp.file.1 >
temp.file.2.bat
call temp.file.2.bat
del temp.file.2.bat
del
temp.file.1
Using pipes it is possible to save temp.file.1 creation:
echo %1 |
sed -e "s/^.*\\\/set myvar=/" > temp.file.bat
call temp.file.bat
del
temp.file.bat
If you are working with read-only disc try place temporary files into
%TEMP%
directory:
echo %1 | sed -e "s/^.*\\\/set myvar=/"
> %TEMP%\temp.file.bat
call %TEMP%\temp.file.bat
del
%TEMP%\temp.file.bat
(And generaly: avoid sharing the same temp file by more than one
process.)
Someone can get into trouble with this (for us simplified) .bat:
set
myvar = abc
set my
echo %myvar%
Surprisely echo gets no input (that's why it displays its own state) but the
second set displays the variable's existence (in this example it displays all
variables starting with 'my'). The whole problem is in first line where it had
set %mywar %
(last character is space) and and its value is 'abc'.