This document describes the theory behind regular expressions
(RE) as well as their practical usage.
Table of Content
- What are Regular Expressions?
- Why would you use
Regular Expressions
- Where to use Regular
Expressions
- How to use the Regular Expression Syntax
Basics
- Regular Expression's Syntax Basics
- More Examples
- Summary
- Literature Resources
Regular expressions are a way to search for substrings ("matches")
in strings. This is done by searching with "patterns" through the
string.
You probably know the '*
' and '?
' charachters used in the dir
command on the
DOS command line. The '*
' character means "zero or more arbitrary
characters" and the '?
' means "one arbitrary character".�
When using a pattern like "text?.*
", it will find files like
textf.txt
text1.asp
text9.html
But it will not find files like
text.txt
text.asp
text.html
This is exactly the way REs work. While the '*
' and '?
' are a very
limited ways of pattern matching, REs supply a much broader spectrum of describing
patterns.
Example usages could be:
- Remove all occurences of a specific tag from an html file
- Check whether an e-mail address is well-formed
Basically you can do the following operations on a string with REs:
- Test for a pattern
I.e. search through a string and check whether a pattern matches a substring,
returning true or false.
- Extract a substring
I.e. search for a substring and return that substring.
- Replace a substring
I.e. search for a substring that matches a pattern and replace it by another
string.
REs are one of the foundations of the Perl programming language and therefore
built-into the compiler itself. There are many other languages that can use REs by
using third-party libraries or add ons.
Following are some other languages for which RE libraries exist:
Although being slightly different to use (because of the design of the languages),
all are quite similar to Perl's implementation of REs. Therefore I use Perl code
snippets in this document to describe examples.
The RE syntax is not completely standardized. AFAIK there is a POSIX version of
RE, defining the complete syntax. Perl's RE implementation is much more flexible than POSIX's, so having a library that is Perl-compatible as much as
can be is normally what you want.
The syntax itself can be sometimes different between the languages. I.e. one
library implements only a subset of the POSIX-RE syntax, while other implements
nearly all of the Perl-RE syntax.
As stated, I do all examples in Perl. Therefore here a quick overview over the
most common methods on how to execute a regular expression in Perl.
expression =~ m/pattern/[switches]
Searches the string expression
for the occurence(s) of a substring that matches
'pattern
' and returns the recognized subexpressions ($1
, $2
,
$3
, ...). "m
" stands for "match".
For example�
$test = "this is just one test";
$test =~ m/(o.e)/
Would return "one
" in $1
.
expression =~ s/pattern/new text/[switches]
Searches the string "expression
" for the occurence(s) of a substring that matches
'pattern
' and replaces the found substrings with "new text
". "s
" stands for "substitute".
For example�
$test = "this is just one test";
$test =~ s/one/my/
Would replace "one
" by "my
" resulting in a string "this
is just my test
", stored in $test
.
This chapter is not trying to be a reference of all characters that can be
used inside a RE pattern. There are other
documents that do this quite well. Instead the basic meta characters are shown
and explained.
Meta characters that you want to use literal must be
escaped with the backslash, just as in C++ strings. E.g. to use the square
bracket [
literal, write \[
. (Remember that this is so
for the Perl language and can be different for other languages).
Following are the most important meta charachters, as from chapter "Regular
Expression Syntax"
on MSDN:
Character |
Description |
\ |
Marks the next character as either a special character, a literal, a backreference, or an octal escape. For example,
'n ' matches the character "n ". '\n ' matches a newline character.
The sequence '\\ ' matches "\ " and "\( " matches "( ". |
. |
Matches any single character except "\n ". To match any character including the '\n ',
use a pattern such as '[.\n] '. |
A character class is a group of one or multiple characters. These are written
in square brackets '[
...]
'. E.g. the construct "B[iu]rma
" matches
"Birma
"
or "Burma
", i.e. a "B
" followed by either an
"i
" or an "u
",
followed by "rma
".
In other words a character class means "match any single character of
that class".
There are the opposite of character classes, too, the negotiated character
classes. Which means "match any single character that is not in
the class". E.g. '[^1-6]
' recognized any characters except the numbers "1
" to "6
".
See more examples at "Character
Matching" on MSDN.�
If you don't know exactly how many characters are coming, you can use
quantifiers to specify the number of times a character can occur. E.g. you can say "Hel+o
"
which means "He
" followed by one or multiple "l
"'s followed by an "o
".
More Quantifiers, as from chapter "Quantifiers"
on MSDN are
Character |
Description |
* |
Matches the preceding subexpression zero or more times.
For example, 'zo* ' matches "z " and "zoo ".
'* ' is
equivalent to '{0,} '. |
+ |
Matches the preceding subexpression one or more times.
For
example, "zo+ " matches "zo " and "zoo ", but not "z ".
'+ ' is equivalent to '{1,} '. |
? |
Matches the preceding subexpression zero or one time.
For
example, 'do(es)? ' matches the "do " in "do " or "does ".
'? ' is equivalent to '{0,1} '. |
{n} |
n is a nonnegative integer. Matches exactly n
times.
For example, 'o{2} ' does not match the "o "
in "Bob ",
but matches the two "o "'s in "food ". |
{n,} |
n is a nonnegative integer. Matches at least n
times.
For example, 'o{2,} ' does not match the "o "
in "Bob "
and matches all the "o "'s in "foooood ".
'o{1,} ' is equivalent to 'o+ '. 'o{0,} ' is equivalent to
'o* '. |
{n,m} |
m and n are nonnegative integers, where n
<= m . Matches at least n and at most m times.
For example, 'o{1,3} ' matches the first three "o "'s in "fooooood ".
'o{0,1} ' is equivalent to 'o? '. Note that you
cannot put a space between the comma and the numbers. |
An important fact about quantifiers is that the '*
' and '+
' are "greedy". I.e.
they match as much as they can, not as few. E.g.
$test = "hello out there, how are you";
$test =~ m/h.*o/
means "find a 'h
', followed by multiple arbitrary characters, followed by
an 'o
'". The author probably thought it matches "hello
", but actually it
matches "hello out there, how are yo
", since the RE is greedy and searches until
the last "o
", wich is the "o
" in
"you
".
You can explicitly say that a quantifier should be "ungreedy" by
appending a '?
'. E.g.
$test = "hello out there, how are you";
$test =~ m/h.*?o/
Would actually find "hello
", as intended, since it now means "find a
'h
', followed by multiple arbitrary characters, followed by
the first occurence 'o
'".
To check for the beginning or the end of a line (or string), you use the meta
characters ^
and $
. E.g. "^thing
"
matches for a line starting with "thing
". "thing
$
"
matches for a line ending with "thing
".
The meta characters '\b
' and '\B
' are used for testing word boundaries and
non-word boundaries. E.g.
$test =~ m/out/
would match not only match "out
" in "speak out loud
" but also the
"out
" in "please
don't shout at me
". To avoid this, you can precede the pattern with
a word boundary anchor:
$test =~ m/\bout/
Now, it only finds "out
" if it starts at a word boundary, not inside a word.
Alternation allows use of the '|
' character to allow a choice between two or
more alternatives. Using it together with the parantheses '(
...|
...|
...)
' it
allows you to group the alternations.
Parantheses ifself are used for "capturing" substring for later
usage and store them in the Perl-built-in variables $1
, $2
, ..., $9
. (See Backreferences,
below).
E.g.
$test = "I like apples a lot";
$test =~ m/like (apples|pines|bananas)/
Will match, since "apples
" is one of the three alternatives to mach and
therefore "like apples
" is found.� The�parantheses will also "capture" the
"apples
"
as a backreference in $1
.
One of the most important features of REs is the ability to store
("capture") a part of
the matches substring for later reuse. This is done by placing the substring in
parantheses (
...)
. These are stored in the Perl-built-in variables $1
, $2
, ..., $9
.�
If you don't want to capture a substring but need parantheses to group the
substring, use the '?:
' operator to avoid capturing.
E.g.
$test = "Today is monday the 18th.";
$test =~ m/([0-9]+)th/
will store "18
" in $1
, whereas
$test = "Today is monday the 18th.";
$test =~ m/[0-9]+th/
will store nothing in $1
since the parantheses are not present.
$test = "Today is monday the 18th.";
$test =~ m/(?:[0-9]+)th/
will store nothing in $1
, too since the parantheses are used
with the '?:
' operator. Another example of the direct use in a
replace operation:�
$test = "Today is monday the 18th.";
$test =~ s/ the ([0-9]+)th/, and the day is $1/
will result in $test
being "Today is monday, and the day is 18.
".
You can also backreferences inside the query to previously found substrings by using \1
, \2
, ..., \9
.
E.g. the following RE will remove duplicate words:
$test = "the house is is big";
$test =~ s/\b(\S+)\b(\s+\1\b)+/$1/
Will result in $test
being "the house is big
".
Sometimes it is necessary to say "match this, but only if it is not preceded
by that" or "match this, but only if it is not followed by
that". When just single charactes are concerned, you can use the negotiated
character class [^
...]
.
But when it comes to more than just a single
character, you need to use the so called lookahead-condition or the lookbehind-condition.
There are four possibly types:
- Positive lookahead-condition '
(?=re)
'
Match only when followed by the RE re
.
- Negative lookahead-condition '
(?!re)
'
Match only when not followed by the RE re
.
- Positive lookbehind-condition '
(?<=re)
'
Match only when preceded by the RE re
.
- Negative lookbehind-condition '
(?<!re)
'
Match only when not preceded by the RE re
.
Examples:
$test = "HTML is a document description-language and not a programming-language";
$test =~ m/(?<=description-)language/
Will match the first "language
" ("description-language
"), since it is preceded by
"description-
", wheras
$test = "HTML is a document description-language and not a programming-language";
$test =~ m/(?<!description-)language/
Will match the second "language
" ("description-language
"), since it is not preceded
by "description-
".
Here are some more real-world examples from the last chapter of the RE
section of [3]. These more
advanceds REs can be use as a starting point for your own REs, or just as
detailed examples you can look at in more detail.
Swap the first two words:
s/(\S+)(\s+)(\S+)/$3$2$1/
Find name=value pairs:
m/(\w+)\s*=\s*(.*?)\s*$/
Now name is in $1
, value is in $2
.
Read a date in the form YYYY-MM-DD:
m/(\d{4})-(\d\d)-(\d\d)/
Now YYYY is in $1
, MM is in $2
, DD
is in $3
.
Remove the leading path from a filename:
s/^.*\
This document tried to give you a brief introduction overview of what REs are
and where and how to use them.
Also being straightforward to get into using REs, there are quite a lot of
traps and errors you probably will meet in "real life". It is highly
recommended to refer to additional literature and examples to understand and use
the full power of REs. Especially [4] is a very valuable (but somewhat fastidiously)
resource you should read.
Topics that were not covered in this document include:
- Modificators to REs (also known as "switches")
These can be used for setting things like case-sensitivity, single-line and
multiline-mode, extended mode for better overview, etc.
- Internals of a RE engine
Different types of RE enginges (namely NFA and DFA) behave different.
- Using RE in other languages than Perl
There are language specific details that differ from Perls RE
implementation.
- Optimizations
There is always more than one way of writing a RE. Some are faster,
others are better to read.
For these and many others, please take a look at the resources below.
- Learning Perl
(2nd Edition)
Randal L. Schwartz, Tom Christiansen, Larry Wall (Foreword)
- Programming
Perl (3rd Edition)
Larry Wall, Tom Christiansen, Jon Orwant
- Perl Cookbook
Tom Christiansen, Nathan Torkington, Larry Wall
- Mastering
Regular Expressions: Powerful Techniques for Perl and Other Tools (O'Reilly
Nutshell)
Jeffrey E. Friedl (Editor), Andy Oram (Editor)
- Introduction
to Regular Expressions
Microsoft Developer Network (MSDN), Microsoft Corporation
- Perl 5 Pocket Reference, 3rd Edition: Programming Tools (O'Reilly Perl)
Johan Vromans, Larry Wall, Linda Mui