Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / multimedia / GDI

Regular Expressions – Big daddy o’ string manipulation

4.50/5 (7 votes)
2 Feb 2012CPOL5 min read 22.5K  
Regular Expressions – Big daddy o’ string manipulation

Regular Expressions are an amazing way to go. A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. Almost all languages support them and with a little understanding, you can ace at them. What more, they condense tens and thousands of lines of logic into just a couple of simple lines. I’ve used regular expressions both through .NET framework and in JavaScript and found them to be immensely helpful. This blog post focuses majorly on them and here’s a little outline on how I’ll be approaching the concept on hand.

  • A problem – We engineers are problem solvers, so a simple problem followed by how regular expressions solved it in an even simpler way
  • Dissecting the solution – An understanding of the above simpler solution
  • Link to raw regular expression resources – Unless you can make up some of those regular expressions, you can never make it easier to use this language feature
  • RegEx class – The power of RegEx in .NET framework, what all you can achieve

Starting with this, followed by that, then this and then a dozen of that – A Problem!

I was given, the below problem statement and a laptop to code and solve. Developer instincts are hard to ignore and eventually I started scribbling off an elegant algorithm, or so I thought.

Domestic passport numbers either begin with the letter TW followed by 12 digits, or begin with 6 digits followed by 4 other characters and ending with the letters OFTW.

Pseudo logic:

  1. Is Starting With TW?
    1. Are the remaining 12 characters – digits?

i. Return “Domestic”

  1. Is Ending with OFTW?
    1. Does it begin with 6 digits?

i. Return “Domestic”

  1. If all above fails
    1. Return “Foreigner”

So with all the code logic and intricacies figured out, I exactly translate the pseudo code to C# code:

C#
public bool IsDomestic(string PassportNumber)
{
if (PassportNumber.StartsWith("TW"))
{
string Rest=PassportNumber.Substring(2,12);
Int64 Result;
if (Int64.TryParse(Rest,out Result))
{
return true;
}
}
else if(PassportNumber.EndsWith("OFTW"))
{
string First6 = PassportNumber.Substring(0, 6);
Int64 Result;
if (Int64.TryParse(First6, out Result))
{
return true;
}
}
return false;
}

That’s when I was told; don’t you follow all the language features? Especially something called Regular Expressions?

Well, not to be an all knowing buff but yet, I knew what Regular Expression was and I perfectly knew how to use it. But surprisingly, it never struck me. Given a problem, my initial mode of tackling it was to analytically approach it and solve. Although regular Expression’s way of doing it was straight opposite – it was elegant; and I hit myself why my mind didn’t think of it.

The Solution – Simpler one

So here’s the code, if I had used RegEx:

C#
public bool IsDomestic(string PassportNumber)
{
return Regex.Match(PassportNumber, 
    "(TW)(\\d{12})($)|(\\d{6})([A-Za-z0-9]{4})(OFTW)($)").Success;
}

Surprisingly less lines of code, isn’t it?

Let’s dissect the solution

Without going into much depth, let me explain what the pattern (yes, I called it a pattern) means:

(TW)(\\d{12})($)|(\\d{6})([A-Za-z0-9]{4})(OFTW)($)

You see the pipe (|) symbol in the middle? Let me split the pattern at that part.

Left part: (TW)(\\d{12})($)

  • (TW) - Simply means that the two initial characters are TW
  • (\\d{12}) - tries to say that there are 12 digits following
  • ($)- The string should end at the point, no more characters

Right part: (\\d{6})([A-Za-z0-9]{4})(OFTW)($)

  • (\\d{6}) – The first 6 characters are digits
  • ([A-Za-z0-9]{4}) - 4 alphanumeric (A-Z or a-z or 0-9) characters would follow
  • (OFTW) – The next four characters would be OFTW
  • ($)- The string should end at the point, no more characters – as we saw earlier

Did I miss the pipe (|)? You know what it is, it’s just an OR condition. The string in question, say TW012345678901 can choose to match with either the left part pattern or the right part pattern.

Some links before we proceed

This should have given you a basic idea into formulating a simple RegEx pattern; if you need further help, you can visit http://www.regular-expressions.info/reference.html. It gives you a good glimpse on RegEx.

Or if you want a quick stop solution/pattern, visit Mark’s http://txt2re.com/, it’s an amazing site.

The almighty RegEx class of .NET

Let’s step back a bit and focus on the RegEx class in .NET framework, and see how better we can use it!

Check if a string is looking as it should

The above solution simply does that. The Match, IsMatch methods of the RegEx class try to fits the given string into a pattern and says True, the string matches the pattern or False, the string looks nothing like the pattern! The simplest form of the method asks only for the pattern and the string.

Replace a questionable part in a string

Take for instance, the exclaimer. He gets too excited for nothing and you just need to dial his excitement down. How?

string SampleText = "This is Outrageous!!!!!!!! Regex can’t solve all my problems!!!! 
What if it can!!!!!";

You can never track the count of the exclamations he has used (!) nor does he use a specific number all the time. The Replace method of Regex would help you there.

C#
SampleText = Regex.Replace(SampleText, "(!)+", "!");
Console.WriteLine(SampleText);

P.S.: The plus (+), here in the pattern says that the (!) can appear once or more in the string. And the last part is the replace with part. All questionable parts matching the pattern would be replaced with the input to the method.

This would give me an elegant output like thus:

This is Outrageous! Regex can’t solve all my problems! What if it can!

Pattern matching, only simpler:

Imagine how hard it was to match patterns or find the count of substrings actually matching a pattern you’re looking for. RegEx offers you an elegant way to the same, using RegEx.Matches. Let’s take the below statement:

“Are you working on any special projects at work? I am not reading any books right now. Aren’t you teaching at the university now?”

I need to fish out all the present-continuous forms verbs in them, i.e., the words ending with “ing”. Normally this would be a heavy code. But with RegEx, we can do it faster and simpler using the pattern string (\\b)(\\w+)(ing)(\\b)”.

This says – in sequence:

\\bblank space
\\w+[one or more] word character
ingThe specific set of characters “ing”
\\bblank space
C#
MatchCollection collObj=Regex.Matches(TestString, "(\\b)(\\w+)(ing)(\\b)");
foreach (var item in collObj)
{
Console.WriteLine(item);
}

Would print me out, all occurrences of the pattern, which are:

Working
Reading
Teaching

Summary

Through this post, I’ve summarized most of the key uses of RegEx with special emphasis on .NET code. Beyond a particular point, it depends on your creativity how you take it further to tailor the RegEx solution based on your problem. You can do away with ugly for loops or unnecessary if-else constructs in code.


Filed under: CodeProject, Technology Tagged: .NET 4.0, .NET Framework, elegant algorithm, language feature, Pattern Replace, Pattern Search, RegEx, Regular Expression

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)