Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / regular-expression

CollectionRegex -- Regular Expressions for Collections

5.00/5 (3 votes)
16 Apr 2012CPOL4 min read 25.7K   186  
Find patterns in a sequence of objects using a familiar language.

Image 1

Prerequirements

A reader of this article is supposed to know basics of regular expressions. The article does not cover explaining regular expressions. The Regular Expression .info website is a fine regex resource, both as a starting point and a reference.

Introduction

This article presents a simple, strong-typed class library which provides methods for finding patterns in sequences of objects. Patterns are described with regular expressions, and a sequence is any instance of a class which implements the IEnumerable<T> generic interface [msdn]. This includes well known .NET framework classes like a List<T>, HashSet<T> or a Dictionary<K,V> and objects returned by LINQ queries.

Using the Code

A functionality of the library can be accessed by invoking a CollectionRegex.Match method. Both instance and static version of this method are availible. To make them usable, the following information is required:

  1. A pattern expressed as a regular expression,
  2. A relation between characters which appear in a pattern and collection items,
  3. (optionally) options to pass to the .NET regex engine.

When all of the above parameters are known, a Match method can be invoked and results acquired. The Match method returns an enumeration of all matches, that is, an IEnumerable<CollectionMatch<T>>. An implementation uses a yield return [msdn] construct, so operations are performed on-demand. The library supports both named and unnamed capturing groups.

Example 1

The following code analyses a sequence of numbers to find places where at least three consecutive numbers exceed a specified range. Each number is classified into one of three categories: "too low" (a), "ok" (b) or "too high" (c). A regular expression which will be used is:

[^b]{3,}

The expression finds a subsequence of three or more characters which are not b. In terms of a collection, it matches every three or more consecutive numbers that are not in a category "ok". Since defined categories are a, b and c, the expression is equivalent to [ac]{3,}. A required .NET code is very straightfoward. Of course, a relation between characters in a pattern and categories needs to be defined. In this example it is done with lambda expressions. They are anymous methods which take every number from a collection and return a boolean value, which is true when a number falls into a particular category.
C#
// create a regex object
var reg = new CollectionRegex<double>(@"[^b]{3,}");
double low = 3,
        high = 7;
// define predicates for "low", "ok" and "high"
// with lambda expressions
reg.AddPredicate(s => s <= low, 'a');
reg.AddPredicate(s => s > low && s < high, 'b');
reg.AddPredicate(s => s >= high, 'c');
Testing code:
// create a test sequence
var collection = new List<double>
{
4, 5, 9, 6,
7, 8, 9, 8,
6, 4, 3, 2,
4, 2, 2, 3, 3, 5, 5, 5,
3, 2, 7, 5
};
// run the regex
var actual = reg.Match(collection).ToList();
// check if results are as expected
Assert.AreEqual(3, actual.Count());
Assert.AreEqual(4, actual[0].Index);
Assert.AreEqual(4, actual[0].Count);
Assert.AreEqual(13, actual[1].Index);
Assert.AreEqual(4, actual[1].Count);
Assert.AreEqual(20, actual[2].Index);
Assert.AreEqual(3, actual[2].Count);

A test collection is internally represented by a string "

bbcb cccc bbaa baaa
                abbb aacb
" (without spaces). The regex matches three subsequences: {7, 8, 9, 8} (cccc), {2, 2, 3, 3} (aaaa) and {3, 2, 7} (aac).

LINQ alternative

For those who think in C#, it can be easier to understand the idea by egzamining an implementation which does not use the CollectionRegex library.
List<IEnumerable<double>> results = new List<IEnumerable<double>>();
for (int i = 0; i < collection.Count; i++)
{
    IEnumerable<double> enu = collection.Skip(i);
    enu = enu.TakeWhile(x => x <= low || x >= high);
    if (enu.Count() >= 3)
    {
        results.Add(enu);
        i += enu.Count() - 1;
    }
}

Example 2

The following code finds requests in a production process which weren't finished successfully. A collection contains objects of type ProductionEvent, which represent a request for producing an item (r), a success report (s) or a failure report (f). If an item couldn't be produced, a failure report is put into a collection. An item can be successfuly produced even if it failed a few times. The regex is supposed to find items which wasn't produced at all.

(?<item>r)f+(?=r|$)

  1. (?<item>r) matches a request for an item and defines a named group item.
  2. f+ matches one or more failure report.
  3. Finally, the last part (?=r|$) does a positive look-ahead for either another request (r) or the end of a collection ($), to ensure that there wasn't any success report after the analysed request.

For example, in a sequence rs rfff rffffs rff, it finds the second and the fourth request (rfff and rff). However, it does not match the third one (rffffs) because it was finished successfully eventually, despite there were four failures on the way. The source code assumes that there is a class ProductionEvent which exposes two properties: Type and ItemName.

C#
var reg = new CollectionRegex<ProductionEvent>(@"(?<item>r)f+(?=r|$)");
reg.AddPredicate(s => s.Type == ProductionEventTypes.Success, 's');
reg.AddPredicate(s => s.Type == ProductionEventTypes.ItemRequest, 'r');
reg.AddPredicate(s => s.Type == ProductionEventTypes.Failure, 'f');
And a testing code:
var collection = new List<ProductionEvent>
{
    new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="chocolade" },
    new ProductionEvent { Type= ProductionEventTypes.Success },
    new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="impossible1" },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
    new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="problematic" },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
    new ProductionEvent { Type= ProductionEventTypes.Success },
    new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="impossible2" },
    new ProductionEvent { Type= ProductionEventTypes.Failure },
};
var actual = reg.Match(collection).ToList();
foreach (CollectionMatch<ProductionEvent> e in actual)
{
    // access a named group "item"
    Assert.IsTrue(e.Groups["item"].Single().ItemName.StartsWith("impossible"));
}

LINQ alternative

In the below snippet, a requests list simulates a capture group "item".

List<IEnumerable<ProductionEvent>> results = new List<IEnumerable<ProductionEvent>>();
List<ProductionEvent> requests = new List<ProductionEvent>();
for (int i = 0; i < collection.Count; i++)
{
    IEnumerable<ProductionEvent> begin = collection.Skip(i);
    IEnumerable<ProductionEvent> enu = begin;
    if (enu.Count() >= 2)
    {
        var request = enu.First();
        if (request.Type == ProductionEventTypes.ItemRequest)
        {
            enu = enu.Skip(1);
            var x = enu.First();
            if (x.Type == ProductionEventTypes.Failure)
            {
                enu = enu.SkipWhile(p => p.Type == ProductionEventTypes.Failure);
                if (enu.Count() == 0)
                {
                    results.Add(begin);
                    requests.Add(request);
                    i += begin.Count();
                }
                else if (enu.First().Type == ProductionEventTypes.ItemRequest)
                {
                    // (dirty)
                    var result = begin.TakeWhile(p => !object.ReferenceEquals(p, enu.First()));
                    results.Add(result);
                    requests.Add(request);
                    i += result.Count();
                }
            }
        }
    }
}

Remarks

Predicates must be mutually exclusive. If any item matches more than one predicate, then the Match method would throw an ArgumentException.

Predicate is an instance of a class which inherits from an abstract class CollectionRegexPredicate. If you want to provide a custom mechanism of checking if an objects falls into a category, all you have to do is to implement a method IsMatch(T obj).

Items which wasn't matched by any predicate are represented by a comma (',') in a regular expression. Therefore, a negated group like [^ab] will match these items. A comma was selected after a long meditation and no other character could be accurate here. (a first printable, not-problematic and easily countable by a human character in an ASCII table).

Along with a regular class library, I have written an extension method for IEnumerable class, which allows using constructs like this:

C#
var predicates = new CollectionRegexDelegatePredicate<double>[] { 
            new CollectionRegexDelegatePredicate<double>(s => s <= low, 'a'),
            new CollectionRegexDelegatePredicate<double>(s => s > low && s < high, 'b'),
            new CollectionRegexDelegatePredicate<double>(s => s >= high, 'c'),
        };
var actual = collection.MatchRegex(predicates, @"[^b]{3,}")s;

Please use a message board below the article to report bugs or post feature requests.

History

  • 2012-04-14 -- article and API update
  • 2012-04-11 -- the original version posted

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)