Download The Source Files
Introduction
In fact I do like Regex : they do the job well. Even too well as all developers
have to use them and there is no way to get rid of it.
Unfortunately whenever I need a new one I am facing the same issue : I have forgotten
almost everything about their damned syntax... If I were to write one everyday
I would probably easily remember it but that's not the case as I barely need to
write a couple of them in a year...
Being fed up reading and learning that documentation again and again I decided to
implement the following String extensions method...
Background
Regular expressions are a powerful and concise mean for processing large amount
of text in order to validate, extract, edit, replace or delete part of
a text given a predefined pattern (ex: an email address)
In order to make proper use of Regex you need:
- a text to analyse
- a regular expression engine
- a regular expression
(the pattern to look for in the text to analyse)
the regular expression syntax varies depending on the regular expression engine
you use. In the Microsoft world the class that serves as the regular expression
engine is System.Text.RegularExpressions.Regex and its syntax
is described here : http://msdn.microsoft.com/en-us/library/az24scfc.aspx
If you are looking for an introduction to regular expression syntax please read
this excellent article : http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial
The problem with regular expressions
They have the drawback of their advantages : the syntax (concise and powerful)
is intended to be friendly for regular expression engines but not really to human
beings.
When not familiar with the syntax you can spend a long time writing a valid expression.
You can spend another long time testing that expression and make it bullet proof.
It is one thing to make sure your regular expression is matching what you expect
but it is another thing to make sure it is matching ONLY what you expect.
The idea
If you are familiar with SQL you know the LIKE operator. Why not bringing that
operator to C#?
Why not having a simplified syntax for the most frequent operations
you would ask your Regex engine to perform?
A simplified syntax
... means less operators. Here is the list that I have, very arbitrary, come up
with :
- ? = Any char
- % = Zero or more character
- * = Zero or more character but no white space (basically a word)
- # = Any single digit (0-9)
examples of simple expressions:
- a Guid can be expressed as : ????????-????-????-????-????????????
- an email address could be : *?@?*.?*
- as for a date : ##/##/####
Regular expression aficionados are already jumping on their chairs: obviously
nothing guarantees the latest expression match a valid date and they are right (that
expression would match 99/99/9999). But in no way that
syntax replace the regular expressions one. It is far from offering
the same level of capabilities especially in terms of validation.
Frequent operations
What are the frequent operations you need a regular expression engine for?
- determining if the text to analyse matches a given pattern : Like
- finding an occurrence of a given pattern in the text to analyse
: Search
- retrieving string(s) in the text to analyse :Extract
these 3 operations 'Like', 'Search' and 'Extract' have been implemented as
extension methods of strings as an alternative to a Regular expression engine.
Let's start describing their usage first and code will follow...
1. Determining if a string is 'like' a given pattern
You know SQL then you know what I am talking about...
the Like extension simply returns true when the input string match the
given pattern.
All following examples are returning true, meaning input strings are like their
patterns.
example: a string is a guid
var result0 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("????????-????-????-????-????????????");
example: a string ends with a guid
var result1 = "This is a guid TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("
%????????-????-????-????-????????????");
example: a string starts with a guid
var result2 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495 is a guid".Like("????????-????-????-????????????%");
example: a string contains a guid
var result3 = "this string TA0E02391-A0DF-4772-B39A-C11F7D63C495 contains a guid".Like("%????????-????-????-????-????????????%");
example: a string ends with a guid
var result4 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("%????????-????-????-????-????????????");
2. 'Searching' for a particular pattern in a string
The Search extension methods retrieve the first occurrence of
the given pattern inside the provided text.
example:
Search for a guid inside a text
var result5 = "this string [TA0E02391-A0DF-4772-B39A-C11F7D63C495] contains a string matching".Search("[????????-????-????-????-????????????]");
Console.WriteLine(result5);
3. 'Extracting' values out of a string given a known pattern
Almost like searching but does not bring back the whole string that matches the pattern but an array of the strings matching the pattern groups.
example: retrieving the consituents of a guid inside a text
var result6 = "this string [TA0E02391-A0DF-4772-B39A-C11F7D63C495] contains a string matching".Extract("[????????-????-????-????-????????????]");
example: retrieving the consituents of an email inside a text
var result7 = "this string contains an email: toto@domain.com".Extract("*?@?*.?*");
Here's the
code
The simple trick here is that the 3 different public methods relies on GetRegex which transforms the simplified expression into a valid .net one
public static class StringExt
{
public static bool Like(this string item, string searchPattern)
{
var regex = GetRegex("^" + searchPattern);
return regex.IsMatch(item);
}
public static string Search(this string item, string searchPattern)
{
var match = GetRegex(searchPattern).Match(item);
if (match.Success)
{
return item.Substring(match.Index, match.Length);
}
return null;
}
public static List<string> Extract(this string item, string searchPattern)
{
var result = item.Search(searchPattern);
if (!string.IsNullOrWhiteSpace(result))
{
var splitted = searchPattern.Split(new[] { '?', '%', '*', '#' }, StringSplitOptions.RemoveEmptyEntries);
var temp = result;
var final = new List<string>();
foreach(var x in splitted)
{
var pos = temp.IndexOf(x);
if (pos > 0)
{
final.Add(temp.Substring(0, pos));
temp = temp.Substring(pos);
}
temp = temp.Substring(x.Length);
}
if (temp.Length > 0) final.Add(temp);
return final;
}
return null;
}
static Regex GetRegex(string searchPattern)
{
return new Regex(searchPattern
.Replace("\\", "\\\\")
.Replace(".", "\\.")
.Replace("{", "\\{")
.Replace("}", "\\}")
.Replace("[", "\\[")
.Replace("]", "\\]")
.Replace("+", "\\+")
.Replace("$", "\\$")
.Replace(" ", "\\s")
.Replace("#", "[0-9]")
.Replace("?", ".")
.Replace("*", "\\w*")
.Replace("%", ".*")
, RegexOptions.IgnoreCase);
}
}
Conclusion
As stated above the intent is not to replace Regex but to provide a very simple approach for solving about 80% of the cases I previously
had the need for Regex. This approach keeps basic tasks very simple and makes the client code very easy to write and obvious to understand
to anyone who is not expert with Regex syntax.