Regular Expressions are an amazing way to go. A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. Almost all languages support them and with a little understanding, you can ace at them. What more, they condense tens and thousands of lines of logic into just a couple of simple lines. I’ve used regular expressions both through .NET framework and in JavaScript and found them to be immensely helpful. This blog post focuses majorly on them and here’s a little outline on how I’ll be approaching the concept on hand.
- A problem – We engineers are problem solvers, so a simple problem followed by how regular expressions solved it in an even simpler way
- Dissecting the solution – An understanding of the above simpler solution
- Link to raw regular expression resources – Unless you can make up some of those regular expressions, you can never make it easier to use this language feature
RegEx
class – The power of RegEx in .NET framework, what all you can achieve
Starting with this, followed by that, then this and then a dozen of that – A Problem!
I was given, the below problem statement and a laptop to code and solve. Developer instincts are hard to ignore and eventually I started scribbling off an elegant algorithm, or so I thought.
Domestic passport numbers either begin with the letter TW followed by 12 digits, or begin with 6 digits followed by 4 other characters and ending with the letters OFTW.
Pseudo logic:
- Is Starting With TW?
- Are the remaining 12 characters – digits?
i. Return “Domestic”
- Is Ending with OFTW?
- Does it begin with 6 digits?
i. Return “Domestic”
- If all above fails
- Return “Foreigner”
So with all the code logic and intricacies figured out, I exactly translate the pseudo code to C# code:
public bool IsDomestic(string PassportNumber)
{
if (PassportNumber.StartsWith("TW"))
{
string Rest=PassportNumber.Substring(2,12);
Int64 Result;
if (Int64.TryParse(Rest,out Result))
{
return true;
}
}
else if(PassportNumber.EndsWith("OFTW"))
{
string First6 = PassportNumber.Substring(0, 6);
Int64 Result;
if (Int64.TryParse(First6, out Result))
{
return true;
}
}
return false;
}
That’s when I was told; don’t you follow all the language features? Especially something called Regular Expressions?
Well, not to be an all knowing buff but yet, I knew what Regular Expression was and I perfectly knew how to use it. But surprisingly, it never struck me. Given a problem, my initial mode of tackling it was to analytically approach it and solve. Although regular Expression’s way of doing it was straight opposite – it was elegant; and I hit myself why my mind didn’t think of it.
The Solution – Simpler one
So here’s the code, if I had used RegEx
:
public bool IsDomestic(string PassportNumber)
{
return Regex.Match(PassportNumber,
"(TW)(\\d{12})($)|(\\d{6})([A-Za-z0-9]{4})(OFTW)($)").Success;
}
Surprisingly less lines of code, isn’t it?
Let’s dissect the solution
Without going into much depth, let me explain what the pattern (yes, I called it a pattern) means:
(TW)(\\d{12})($)|(\\d{6})([A-Za-z0-9]{4})(OFTW)($)
You see the pipe (|) symbol in the middle? Let me split the pattern at that part.
Left part: (TW)(\\d{12})($)
- (TW) - Simply means that the two initial characters are TW
- (\\d{12}) - tries to say that there are 12 digits following
- ($)- The
string
should end at the point, no more characters
Right part: (\\d{6})([A-Za-z0-9]{4})(OFTW)($)
- (\\d{6}) – The first 6 characters are digits
- ([A-Za-z0-9]{4}) - 4 alphanumeric (A-Z or a-z or 0-9) characters would follow
- (OFTW) – The next four characters would be OFTW
- ($)- The
string
should end at the point, no more characters – as we saw earlier
Did I miss the pipe (|)? You know what it is, it’s just an OR condition. The string
in question, say TW012345678901
can choose to match with either the left part pattern or the right part pattern.
Some links before we proceed
This should have given you a basic idea into formulating a simple RegEx pattern; if you need further help, you can visit http://www.regular-expressions.info/reference.html. It gives you a good glimpse on RegEx.
Or if you want a quick stop solution/pattern, visit Mark’s http://txt2re.com/, it’s an amazing site.
The almighty RegEx class of .NET
Let’s step back a bit and focus on the RegEx
class in .NET framework, and see how better we can use it!
Check if a string is looking as it should
The above solution simply does that. The Match
, IsMatch
methods of the RegEx
class try to fits the given string
into a pattern and says True
, the string
matches the pattern or False
, the string
looks nothing like the pattern! The simplest form of the method asks only for the pattern and the string
.
Replace a questionable part in a string
Take for instance, the exclaimer. He gets too excited for nothing and you just need to dial his excitement down. How?
string SampleText = "This is Outrageous!!!!!!!! Regex can’t solve all my problems!!!!
What if it can!!!!!";
You can never track the count of the exclamations he has used (!) nor does he use a specific number all the time. The Replace method of Regex would help you there.
SampleText = Regex.Replace(SampleText, "(!)+", "!");
Console.WriteLine(SampleText);
P.S.: The plus (+), here in the pattern says that the (!) can appear once or more in the string
. And the last part is the replace with part. All questionable parts matching the pattern would be replaced with the input to the method.
This would give me an elegant output like thus:
This is Outrageous! Regex can’t solve all my problems! What if it can!
Pattern matching, only simpler:
Imagine how hard it was to match patterns or find the count of substrings actually matching a pattern you’re looking for. RegEx
offers you an elegant way to the same, using RegEx.Matches
. Let’s take the below statement:
“Are you working on any special projects at work? I am not reading any books right now. Aren’t you teaching at the university now?”
I need to fish out all the present-continuous forms verbs in them, i.e., the words ending with “ing
”. Normally this would be a heavy code. But with RegEx
, we can do it faster and simpler using the pattern string
“(\\b)(\\w+)(ing)(\\b)
”.
This says – in sequence:
\\b | blank space |
\\w+ | [one or more] word character |
ing | The specific set of characters “ing” |
\\b | blank space |
MatchCollection collObj=Regex.Matches(TestString, "(\\b)(\\w+)(ing)(\\b)");
foreach (var item in collObj)
{
Console.WriteLine(item);
}
Would print me out, all occurrences of the pattern, which are:
Working
Reading
Teaching
Summary
Through this post, I’ve summarized most of the key uses of RegEx
with special emphasis on .NET code. Beyond a particular point, it depends on your creativity how you take it further to tailor the RegEx
solution based on your problem. You can do away with ugly for
loops or unnecessary if
-else
constructs in code.
CodeProject
Filed under:
CodeProject,
Technology Tagged:
.NET 4.0,
.NET Framework,
elegant algorithm,
language feature,
Pattern Replace,
Pattern Search,
RegEx,
Regular Expression