Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / string

I don't like Regex...

4.77/5 (68 votes)
17 Jan 2013CPOL4 min read 153.1K   284  
This article will introduce you with a set of 3 simple extension methods that can help you getting rid of Regex in many situations
Download The Source Files

Introduction

In fact I do like Regex : they do the job well. Even too well as all developers have to use them and there is no way to get rid of it.

Unfortunately whenever I need a new one I am facing the same issue : I have forgotten almost everything about their damned syntax... If I were to write one everyday I would probably easily remember it but that's not the case as I barely need to write a couple of them in a year...

Being fed up reading and learning that documentation again and again I decided to implement the following String extensions method...

Background

Regular expressions are a powerful and concise mean for processing large amount of text in order to validate, extract, edit, replace or delete part of a text given a predefined pattern (ex: an email address)

In order to make proper use of Regex you need:

  • a text to analyse
  • a regular expression engine
  • a regular expression (the pattern to look for in the text to analyse)

the regular expression syntax varies depending on the regular expression engine you use. In the Microsoft world the class that serves as the regular expression engine is System.Text.RegularExpressions.Regex and its syntax is described here : http://msdn.microsoft.com/en-us/library/az24scfc.aspx

If you are looking for an introduction to regular expression syntax please read this excellent article : http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial

The problem with regular expressions

They have the drawback of their advantages : the syntax (concise and powerful) is intended to be friendly for regular expression engines but not really to human beings.

When not familiar with the syntax you can spend a long time writing a valid expression.

You can spend another long time testing that expression and make it bullet proof. It is one thing to make sure your regular expression is matching what you expect but it is another thing to make sure it is matching ONLY what you expect.

The idea

If you are familiar with SQL you know the LIKE operator. Why not bringing that operator to C#?

Why not having a simplified syntax for the most frequent operations you would ask your Regex engine to perform?

A simplified syntax

... means less operators. Here is the list that I have, very arbitrary, come up with :

  • ? = Any char 
  • % = Zero or more character
  • * = Zero or more character but no white space (basically a word)
  • # = Any single digit (0-9)

examples of simple expressions:  

  • a Guid can be expressed as : ????????-????-????-????-????????????
  • an email address could be : *?@?*.?*
  • as for a date : ##/##/####

Regular expression aficionados are already jumping on their chairs: obviously nothing guarantees the latest expression match a valid date and they are right (that expression would match 99/99/9999). But in no way that syntax replace the regular expressions one. It is far from offering the same level of capabilities especially in terms of validation.  

Frequent operations

What are the frequent operations you need a regular expression engine for?

  1. determining if the text to analyse matches a given pattern : Like 
  2. finding an occurrence of a given pattern  in the text to analyse : Search 
  3. retrieving string(s) in the text to analyse :Extract 

these 3 operations  'Like', 'Search' and 'Extract' have been implemented as extension methods of strings as an alternative to a Regular expression engine. 

Let's start describing their usage first and code will follow... 

1. Determining if a string is 'like' a given pattern  

You know SQL then you know what I am talking about...  

the Like extension simply returns true when the input string match the given pattern. 

All following examples are returning true, meaning input strings are like their patterns. 

example: a string is a guid

C#
var result0 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("????????-????-????-????-????????????");

example: a string ends with a guid

C#
var result1 = "This is a guid TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("
%????????-????-????-????-????????????");

example: a string starts with a guid 

C#
var result2 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495 is a guid".Like("????????-????-????-????????????%");

example: a string contains a guid  

C#
var result3 = "this string TA0E02391-A0DF-4772-B39A-C11F7D63C495 contains a guid".Like("%????????-????-????-????-????????????%");

example: a string ends with a guid  

C#
var result4 = "TA0E02391-A0DF-4772-B39A-C11F7D63C495".Like("%????????-????-????-????-????????????");

2. 'Searching' for a particular pattern in a string  

The Search extension methods retrieve the first occurrence of the given pattern inside the provided text. 

example: Search for a guid inside a text

C#
var result5 = "this string [TA0E02391-A0DF-4772-B39A-C11F7D63C495] contains a string matching".Search("[????????-????-????-????-????????????]");
Console.WriteLine(result5); // output: [TA0E02391-A0DF-4772-B39A-C11F7D63C495]

3. 'Extracting' values out of a string  given a known pattern 

Almost like searching but does not bring back the whole string that matches the pattern but an array of the strings matching the pattern groups.

example: retrieving the consituents of a guid inside a text

C#
var result6 = "this string [TA0E02391-A0DF-4772-B39A-C11F7D63C495] contains a string matching".Extract("[????????-????-????-????-????????????]");
// result is an array containing each part of the pattern: {"TA0E02391", "A0DF", "4772", "B39A", "C11F7D63C495"}

example: retrieving the consituents of an email inside a text

C#
var result7 = "this string contains an email: toto@domain.com".Extract("*?@?*.?*");
// result is an array containing each part of the pattern: {"toto", "domain", "com"}

Here's the code

The simple trick here is that the 3 different public methods relies on GetRegex which transforms the simplified expression into a valid .net one 

C#
public static class StringExt
{
    public static bool Like(this string item, string searchPattern)
    {
        var regex = GetRegex("^" + searchPattern);
        return regex.IsMatch(item);
    }

    public static string Search(this string item, string searchPattern)
    {
        var match = GetRegex(searchPattern).Match(item);
        if (match.Success)
        {
            return item.Substring(match.Index, match.Length);
        }
        return null;
    }

    public static List<string> Extract(this string item, string searchPattern)
    {
        var result = item.Search(searchPattern);
        if (!string.IsNullOrWhiteSpace(result))
        {
            var splitted = searchPattern.Split(new[] { '?', '%', '*', '#' }, StringSplitOptions.RemoveEmptyEntries);
            var temp = result;
            var final = new List<string>();
            foreach(var x in splitted)
            {
                var pos = temp.IndexOf(x);
                if (pos > 0)
                {
                    final.Add(temp.Substring(0, pos));
                    temp = temp.Substring(pos);
                }
                temp = temp.Substring(x.Length);
            }
            if (temp.Length > 0) final.Add(temp);
            return final;
        }
        return null;
    }

    // private method which accepts the simplified pattern and transform it into a valid .net regex pattern:
    // it escapes standard regex syntax reserved characters 
    // and transforms the simplified syntax into the native Regex one
    static Regex GetRegex(string searchPattern)
    {
        return new Regex(searchPattern
                .Replace("\\", "\\\\")
                .Replace(".", "\\.")
                .Replace("{", "\\{")
                .Replace("}", "\\}")
                .Replace("[", "\\[")
                .Replace("]", "\\]")
                .Replace("+", "\\+")
                .Replace("$", "\\$")
                .Replace(" ", "\\s")
                .Replace("#", "[0-9]")
                .Replace("?", ".")
                .Replace("*", "\\w*")
                .Replace("%", ".*")
                , RegexOptions.IgnoreCase);
    }
}

Conclusion

As stated above the intent is not to replace Regex but to provide a very simple approach for solving about 80% of the cases I previously had the need for Regex. This approach keeps basic tasks very simple and makes the client code very easy to write and obvious to understand to anyone who is not expert with Regex syntax.  

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)