Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / XML

Short But Very Useful regex – lookbehind, lazy, group, and backreference

5.00/5 (6 votes)
19 Aug 2013CPOL3 min read 18.7K  
How to match text which is preceded by some other text? How to reference matched text to find closing tag? Read this post if you want to know the answers to these and few other questions.

Recently, I wanted to extract calls to external system from log files and do some LINQ to XML processing on obtained data. Here’s a sample log line (simplified, real log was way more complicated but it doesn’t matter for this post):

Call:<getName seqNo="56789"><id>123</id></getName> 
Result:<getName seqNo="56789">John Smith</getName>

I was interested in XML data of the call:

XML
<getName seqNo="56789">
  <id>123</id>
</getName>

Quick tip: Super-easy way to get such nicely formatted XML in .NET 3.5 or later is to invoke ToString method on XElement object:

C#
var xml = System.Xml.Linq.XElement.Parse(someUglyXmlString);     
Console.WriteLine(xml.ToString());

When it comes to log, some things were certain:

  • call’s XML will be logged after “Call:” text on the beginning of line
  • call’s root element name will contain only alphanumerical chars or underscore
  • there will be no line brakes in call’s data
  • call’s root element name may also appear in the “Result” section

Getting to the proper information was quite easy, thanks to Regex class:

C#
Regex regex = new Regex(@"(?<=^Call:)<(\w+).*?</\1>");
string call = regex.Match(logLine).Value;

This short regular expression has a couple of interesting parts. It may not be perfect but proved really helpful in log analysis. If this regex is not entirely clear to you - read on, you will need to use something similar sooner or later.

Here’s the same regex with comments (RegexOptions.IgnorePatternWhitespace is required to process expression commented this way):

C#
string pattern = @"(?<=^Call:) # Positive lookbehind for call marker
                   <(\w+)      # Capturing group for opening tag name
                   .*?         # Lazy wildcard (everything in between)
                   </\1>       # Backreference to opening tag name";   
Regex regex = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
string call = regex.Match(logLine).Value;

Positive lookbehind

(?<=Call:) is a lookaround or more precisely positive lookbehind. It’s a zero-width assertion that lets us check whether some text is preceded by another text. Here “Call:” is the preceding text we are looking for. (?<=something) denotes positive lookbehind. There is also negative lookbehind expressed by (?<!something). With negative lookbehind, we can match text that doesn’t have a particular string before it. Lookaround checks fragment of the text but doesn't become part of the match value. So the result of this:

C#
Regex.Match("X123", @"(?<=X)\d*").Value

Will be "123" rather than "X123".

.NET regex engine has lookaheads too. Check this awesome page if you want to learn more about lookarounds.

Note: In some cases (like in our log examination example), instead of using positive lookaround we may use non-capturing group...

Capturing Group

<(\w+) will match less-than sign followed by one or more characters from \w class (letters, digits or underscores). \w+ part is surrounded with parenthesis to create a group containing XML root name (getName for sample log line). We later use this group to find closing tag with the use of backreference. (\w+) is capturing group, which means that results of this group existence are added to Groups collection of Match object. If you want to put part of the expression into a group but you don’t want to push results into Groups collection, you may use non-capturing group by adding a question mark and colon after opening parenthesis, like this: (?:something)

Lazy wildcard

.*? matches all characters except newline (because we are not using RegexOptions.Singleline) in lazy (or non-greedy) mode thanks to question mark after asterisk. By default * quantifier is greedy, which means that regex engine will try to match as much text as possible. In our case, default mode will result in too long text being matched:

XML
<getName seqNo="56789"><id>123</id></getName> Result:<getName seqNo="56789">John Smith</getName>

Backreference

</\1> matches XML close tag where element's name is provided with \1 backreference. Remember the (\w+) group? This group has number 1 and by using \1 syntax we are referencing the text matched by this group. So for our sample log, </\1> gives us </getName>. If regex is complex, it may be a good idea to ditch numbered references and use named references instead. You can name a group by <name> or name syntax and reference it by using k<name> or k’name’. So your expression could look like this:

C#
@"(?<=^Call:)<(?<tag>\w+).*?</\k<tag>>"

or like this:

C#
@"(?<=^Call:)<(?'tag'\w+).*?</\k'tag'>"

The latter version is better for our purpose. Using < > signs while matching XML is confusing. In this case, regex engine will do just fine with < > version but keep in mind that source code is written for humans…

Regular expressions look intimidating, but do yourself a favor and spend few hours practicing them, they are extremely useful (not only for quick log analysis)!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)