Introduction
Regexps (Perl-compatible regular expressions) are great, no doubt (refer to this wonderful article for a tutorial). But the little problem is that every regular expression's pattern should be presented in single string
.
For example, suppose we want to specify a pattern for phone number with the following rules:
- Digits are in groups of 1 or more
- Spaces and minus sign are used as separators
- At least one group of digits should be present
How would the appropriate pattern look? Something like this:
// the @ sign is used in C# to prevent parsing \ as escape sequence
@"(\d+[\s\-])*\d+"
This means that we have groups of digits (\d+
) followed by a separator, either minus or space ([\s\-]
), such groups can occur any (maybe 0
) number of times, but at least one group of digits should be present (final \d+
). Well, not very difficult, but not very nice at the same time.
Assume, at some moment, the customer says the number may include capital letters (like 1-800-GO-TO-THE-HELL-NOW). We have to change our digit group specification twice.
And if we have some regex for, example, real number in exponential format? Something like this...
"[1-9][\d]*[.,][\d]*[1-9][Ee][+\-](0|[1-9][\d]*)"
... for only one (full) type of record, like 123.456E+120. But we can omit integer or fractional part. Our regex becomes really complex:
"([1-9][\d]*[.,][\d]*[1-9][Ee](0|([+\-]?[1-9][\d]*))
|
([.,][\d]*[1-9][Ee](0|([+\-]?[1-9][\d]*))
|
([1-9][\d]*[.,][Ee](0|([+\-]?[1-9][\d]*))"
Brrr, really?
A Dream
For a long time, I had a dream (:-). A dream to write something like this:
SIGNIFICANT_DIGIT = @"[1-9]";
DIGIT = @"[0-9]";
// ` quote is the rare special character
// not having its own meaning in regex syntax
INT_PART = @"`SIGNIFICANT_DIGIT` `DIGIT`*";
FLOAT_PART = @"`DIGIT`* `SIGNIFICANT_DIGIT`";
EXP_PART = @"[Ee](0|[+\-]?`INT_PART`)";
FULL_EXP = @"`INT_PART` [.,] `FLOAT_PART` `EXP_PART`";
NO_INT_EXP = @"[.,] `FLOAT_PART` `EXP_PART`";
NO_FLOAT_EXP = @"`INT_PART` [.,] `EXP_PART`";
// and finally
PATTERN = @"`FULL_EXP` | `NO_INT_EXP` | `NO_FLOAT_EXP`";
Well, much more lines of code, but:
- Each group of symbols is defined once and reused then, no doubling groups in different parts of pattern.
- Each line is much shorter and contains named literals, this makes an expression easier to understand.
This article describes a class created for similar syntax to be used in C# programs. It handles such expressions and returns a Regex object created with expanded pattern.
Idea
OK, the idea is as simple as possible. We create a class that allows adding "variables". Each variable can be a single regex expression or regex-like expression with references to previously added variables. Then the pattern is set in the same form. After that, we receive ready Regex
object and use it as we like to.
We use ` quote to mark variables. If we want to use the quote itself (maybe someone still needs it :), we can write "\`".
Implementation Details
The class VarRegex
is created. It has nested enumerable class VariablesCollection
built around a Dictionary<String, String>
. This class allows adding and modifying variables using indexer property, retrieving their Count
, Clear
variables list and enumerating their values. The main VariablesCollection
's method is called Expand
. It receives a string
to be "expanded", looks for variable names occurrences and replaces each variable's reference with its expanded value.
The method is implemented in the following way:
public String Expand(String pattern)
{
if (pattern == "")
return "";
string p = pattern;
p = p.Replace("\\`", ""+(char)1);
r = new Regex("`([^`]+)`");
MatchCollection ms = r.Matches(p);
foreach (Match match in ms)
{
string t = match.Groups[1].Value;
p = p.Replace("`" + t + "`", Expand(variables[t]));
}
p = p.Replace(""+(char)1, "`");
return p;
}
First, we exclude "fake" quotes and slashes. Then we look for all quoted variables' names and replace each name with expanded variable's value. Finally we return all "fake" quotes (without slashes). Well, rather easy. Each time we make some changes to variables or patterns, a Regex
object is recreated inside our VarRegex
object. The class VariablesCollection
also utilizes nested enumerable class ExpandedVariablesCollection
, which allows enumerating or receiving by name expanded variables' values.
Using
Now the code for generating regex for phone number from the introduction will look like this:
VarRegex vr = new VarRegex();
vr.Variables["int"] = @"\d+";
vr.Variables["sep"] = @"[\s\-]";
vr.Variables["gr"] = @"`int``sep`";
vr.Pattern = @"`gr`*`int`";
vr.Options = RegexOptions.IgnoreCase;
string str = @"123 568-99";
Match m = vr.Regex.Match(str);
Console.WriteLine("Result for string {1}: {0}\n", m.Success, str);
Limitations
The main limitation is that variables should be added in the order that they are referenced. It means, the variable should be added to the VarRegex
after all variables it references are already added.
History
- 10th January, 2007: Initial post