Recently, I wanted to extract calls to external system from log files and do some LINQ to XML processing on obtained data. Here’s a sample log line (simplified, real log was way more complicated but it doesn’t matter for this post):
Call:<getName seqNo="56789"><id>123</id></getName>
Result:<getName seqNo="56789">John Smith</getName>
I was interested in XML data of the call:
<getName seqNo="56789">
<id>123</id>
</getName>
Quick tip: Super-easy way to get such nicely formatted XML in .NET 3.5 or later is to invoke ToString
method on XElement
object:
var xml = System.Xml.Linq.XElement.Parse(someUglyXmlString);
Console.WriteLine(xml.ToString());
When it comes to log, some things were certain:
- call’s XML will be logged after “
Call:
” text on the beginning of line - call’s root element name will contain only alphanumerical chars or underscore
- there will be no line brakes in call’s data
- call’s root element name may also appear in the “
Result
” section
Getting to the proper information was quite easy, thanks to Regex
class:
Regex regex = new Regex(@"(?<=^Call:)<(\w+).*?</\1>");
string call = regex.Match(logLine).Value;
This short regular expression has a couple of interesting parts. It may not be perfect but proved really helpful in log analysis. If this regex is not entirely clear to you - read on, you will need to use something similar sooner or later.
Here’s the same regex with comments (RegexOptions.IgnorePatternWhitespace
is required to process expression commented this way):
string pattern = @"(?<=^Call:) # Positive lookbehind for call marker
<(\w+) # Capturing group for opening tag name
.*? # Lazy wildcard (everything in between)
</\1> # Backreference to opening tag name";
Regex regex = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
string call = regex.Match(logLine).Value;
Positive lookbehind
(?<=Call:)
is a lookaround or more precisely positive lookbehind. It’s a zero-width assertion that lets us check whether some text is preceded by another text. Here “Call:
” is the preceding text we are looking for. (?<=something)
denotes positive lookbehind. There is also negative lookbehind expressed by (?<!something)
. With negative lookbehind, we can match text that doesn’t have a particular string
before it. Lookaround checks fragment of the text but doesn't become part of the match value. So the result of this:
Regex.Match("X123", @"(?<=X)\d*").Value
Will be "123
" rather than "X123
".
.NET regex engine has lookaheads too. Check this awesome page if you want to learn more about lookarounds.
Note: In some cases (like in our log examination example), instead of using positive lookaround we may use non-capturing group...
Capturing Group
<(\w+)
will match less-than sign followed by one or more characters from \w
class (letters, digits or underscores). \w+
part is surrounded with parenthesis to create a group containing XML root name (getName
for sample log line). We later use this group to find closing tag with the use of backreference. (\w+
) is capturing group, which means that results of this group existence are added to Groups
collection of Match
object. If you want to put part of the expression into a group but you don’t want to push results into Groups
collection, you may use non-capturing group by adding a question mark and colon after opening parenthesis, like this: (?:something)
Lazy wildcard
.*?
matches all characters except newline (because we are not using RegexOptions.Singleline
) in lazy (or non-greedy) mode thanks to question mark after asterisk. By default * quantifier is greedy, which means that regex engine will try to match as much text as possible. In our case, default mode will result in too long text being matched:
<getName seqNo="56789"><id>123</id></getName> Result:<getName seqNo="56789">John Smith</getName>
Backreference
</\1>
matches XML close tag where element's name is provided with \1 backreference. Remember the (\w+)
group? This group has number 1 and by using \1 syntax we are referencing the text matched by this group. So for our sample log, </\1>
gives us </getName>
. If regex is complex, it may be a good idea to ditch numbered references and use named references instead. You can name a group by <name>
or ‘name
’ syntax and reference it by using k<name>
or k’name’
. So your expression could look like this:
@"(?<=^Call:)<(?<tag>\w+).*?</\k<tag>>"
or like this:
@"(?<=^Call:)<(?'tag'\w+).*?</\k'tag'>"
The latter version is better for our purpose. Using < >
signs while matching XML is confusing. In this case, regex engine will do just fine with < >
version but keep in mind that source code is written for humans…
Regular expressions look intimidating, but do yourself a favor and spend few hours practicing them, they are extremely useful (not only for quick log analysis)!