Prerequirements
A reader of this article is supposed to know basics of regular expressions. The
article does not cover explaining regular expressions. The
Regular Expression .info website is a fine regex resource, both as a starting
point and a reference.
Introduction
This article presents a simple, strong-typed class library which provides methods
for finding patterns in sequences of objects. Patterns are described with regular
expressions, and a sequence is any instance of a class which implements the IEnumerable<T>
generic interface [msdn]. This includes well known .NET framework classes like a
List<T>
, HashSet<T>
or a Dictionary<K,V>
and objects returned by
LINQ queries.
Using the Code
A functionality of the library can be accessed by invoking a CollectionRegex.Match
method. Both instance and static version of this method are availible. To make them
usable, the following information is required:
- A pattern expressed as a regular expression,
- A relation between characters which appear in a pattern and collection items,
- (optionally)
options to pass to the
.NET regex engine.
When all of the above parameters are known, a Match
method can be invoked
and results acquired. The Match
method returns an enumeration of all
matches, that is, an IEnumerable<CollectionMatch<T>>
. An
implementation uses a yield return
[msdn]
construct, so operations are performed on-demand. The library supports both named
and unnamed capturing groups.
Example 1
The following code analyses a sequence of numbers to find places where at least
three consecutive numbers exceed a specified range. Each number is classified into
one of three categories: "too low" (a
), "ok" (b
)
or "too high" (c
). A regular expression which will be used
is:
[^b]{3,}
The expression finds a subsequence of three or more characters which are not
b
.
In terms of a collection, it matches every three or more consecutive numbers that
are
not in a category "ok". Since defined categories
are a, b and c, the expression is equivalent to
[ac]{3,}
. A required
.NET code is very straightfoward. Of course, a relation between characters in a
pattern and categories needs to be defined. In this example it is done with
lambda expressions. They are anymous methods which take every number from
a collection and return a boolean value, which is true when a number falls into a particular
category.
var reg = new CollectionRegex<double>(@"[^b]{3,}");
double low = 3,
high = 7;
reg.AddPredicate(s => s <= low, 'a');
reg.AddPredicate(s => s > low && s < high, 'b');
reg.AddPredicate(s => s >= high, 'c');
Testing code:
var collection = new List<double>
{
4, 5, 9, 6,
7, 8, 9, 8,
6, 4, 3, 2,
4, 2, 2, 3, 3, 5, 5, 5,
3, 2, 7, 5
};
var actual = reg.Match(collection).ToList();
Assert.AreEqual(3, actual.Count());
Assert.AreEqual(4, actual[0].Index);
Assert.AreEqual(4, actual[0].Count);
Assert.AreEqual(13, actual[1].Index);
Assert.AreEqual(4, actual[1].Count);
Assert.AreEqual(20, actual[2].Index);
Assert.AreEqual(3, actual[2].Count);
A test collection is internally represented by a string "
bbcb cccc bbaa baaa
abbb aacb
" (without spaces). The regex matches three subsequences:
{7, 8, 9, 8} (
cccc
), {2, 2, 3, 3} (
aaaa
) and {3, 2, 7}
(
aac
).
LINQ alternative
For those who think in C#, it can be easier to understand the idea by egzamining
an implementation which does not use the CollectionRegex library.
List<IEnumerable<double>> results = new List<IEnumerable<double>>();
for (int i = 0; i < collection.Count; i++)
{
IEnumerable<double> enu = collection.Skip(i);
enu = enu.TakeWhile(x => x <= low || x >= high);
if (enu.Count() >= 3)
{
results.Add(enu);
i += enu.Count() - 1;
}
}
Example 2
The following code finds requests in a production process which weren't finished
successfully. A collection contains objects of type ProductionEvent
,
which represent a request for producing an item (r
), a success report
(s
) or a failure report (f
). If an item couldn't be produced,
a failure report is put into a collection. An item can be successfuly produced even
if it failed a few times. The regex is supposed to find items which wasn't produced
at all.
(?<item>r)f+(?=r|$)
(?<item>r)
matches a request for an item and defines a named
group item
. f+
matches one or more failure report. - Finally, the last part
(?=r|$)
does a
positive look-ahead for either another request (r
) or the end of
a collection ($
), to ensure that there wasn't any success report after
the analysed request.
For example, in a sequence rs rfff rffffs rff
, it finds the second
and the fourth request (rfff
and rff
). However, it does
not match the third one (rffffs
) because it was finished successfully
eventually, despite there were four failures on the way. The source code assumes
that there is a class ProductionEvent
which exposes two properties:
Type
and ItemName
.
var reg = new CollectionRegex<ProductionEvent>(@"(?<item>r)f+(?=r|$)");
reg.AddPredicate(s => s.Type == ProductionEventTypes.Success, 's');
reg.AddPredicate(s => s.Type == ProductionEventTypes.ItemRequest, 'r');
reg.AddPredicate(s => s.Type == ProductionEventTypes.Failure, 'f');
And a testing code:
var collection = new List<ProductionEvent>
{
new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="chocolade" },
new ProductionEvent { Type= ProductionEventTypes.Success },
new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="impossible1" },
new ProductionEvent { Type= ProductionEventTypes.Failure },
new ProductionEvent { Type= ProductionEventTypes.Failure },
new ProductionEvent { Type= ProductionEventTypes.Failure },
new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="problematic" },
new ProductionEvent { Type= ProductionEventTypes.Failure },
new ProductionEvent { Type= ProductionEventTypes.Failure },
new ProductionEvent { Type= ProductionEventTypes.Failure },
new ProductionEvent { Type= ProductionEventTypes.Failure },
new ProductionEvent { Type= ProductionEventTypes.Success },
new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="impossible2" },
new ProductionEvent { Type= ProductionEventTypes.Failure },
};
var actual = reg.Match(collection).ToList();
foreach (CollectionMatch<ProductionEvent> e in actual)
{
Assert.IsTrue(e.Groups["item"].Single().ItemName.StartsWith("impossible"));
}
LINQ alternative
In the below snippet, a requests
list simulates a capture group "item".
List<IEnumerable<ProductionEvent>> results = new List<IEnumerable<ProductionEvent>>();
List<ProductionEvent> requests = new List<ProductionEvent>();
for (int i = 0; i < collection.Count; i++)
{
IEnumerable<ProductionEvent> begin = collection.Skip(i);
IEnumerable<ProductionEvent> enu = begin;
if (enu.Count() >= 2)
{
var request = enu.First();
if (request.Type == ProductionEventTypes.ItemRequest)
{
enu = enu.Skip(1);
var x = enu.First();
if (x.Type == ProductionEventTypes.Failure)
{
enu = enu.SkipWhile(p => p.Type == ProductionEventTypes.Failure);
if (enu.Count() == 0)
{
results.Add(begin);
requests.Add(request);
i += begin.Count();
}
else if (enu.First().Type == ProductionEventTypes.ItemRequest)
{
var result = begin.TakeWhile(p => !object.ReferenceEquals(p, enu.First()));
results.Add(result);
requests.Add(request);
i += result.Count();
}
}
}
}
}
Remarks
Predicates must be mutually exclusive. If any item matches more than one predicate,
then the Match
method would throw an ArgumentException.
Predicate is an instance of a class which inherits from an abstract class CollectionRegexPredicate
.
If you want to provide a custom mechanism of checking if an objects falls into a
category, all you have to do is to implement a method IsMatch(T obj)
.
Items which wasn't matched by any predicate are represented by a comma (','
)
in a regular expression. Therefore, a negated group like [^ab]
will
match these items. A comma was selected after a long meditation and no other character
could be accurate here. (a first printable, not-problematic and easily countable
by a human character in an ASCII table).
Along with a regular class library, I have written an extension method for IEnumerable
class, which allows using constructs like this:
var predicates = new CollectionRegexDelegatePredicate<double>[] {
new CollectionRegexDelegatePredicate<double>(s => s <= low, 'a'),
new CollectionRegexDelegatePredicate<double>(s => s > low && s < high, 'b'),
new CollectionRegexDelegatePredicate<double>(s => s >= high, 'c'),
};
var actual = collection.MatchRegex(predicates, @"[^b]{3,}")s;
Please use a message board below the article to report bugs or post feature requests.
History
- 2012-04-14 -- article and API update
- 2012-04-11 -- the original version posted