Overview
Some data like MHTML[^] contain parts that are encoded as quoted-printable[^] data stream. That format is quite simple:
- All printable ASCII characters may be represented by themselves, except the equal sign
- Space and tab may remain as plain text unless they appear at the end of a line
- All other bytes are represented by an equal sign followed by two hex digits representing the byte value
- No line must be longer than 76 characters: if they were longer, they are broken by a trailing equal sign
Example
The following quoted-printable encoded text...
This is a long text with some line break and some encoding of the equal sig=
n (=3D). Any line longer than 76 characters are broken up into lines of 76 =
characters with a trailing equal sign.
...results in the following after decoding...
This is a long text with some line break and some encoding of the equal sign (=).
Any line longer than 76 characters are broken up into lines of 76 characters with a trailing equal sign.
The Trick
I came up with the following Regex since I could not find a suitable class in the .NET framework to decode quoted-printable data.
string raw = ...;
string txt = Regex.Replace(raw, @"=([0-9a-fA-F]{2})|=\r\n",
m => m.Groups[1].Success
? Convert.ToChar(Convert.ToInt32(m.Groups[1].Value, 16)).ToString()
: "");
Where to go from here
Once you have the decoded text, you can for example strip off all HTML tags, e.g.:
string textonly = HttpUtility.HtmlDecode(Regex.Replace(txt, @"<[\S\s]*?>", ""));
Console.WriteLine("{0}", textonly);
Input:
<a href=""#print_link"">Expression<Action<T>> expr = s => Console.WriteLine("{0}", s);
Output:
Expression<Action<T>> expr = s => Console.WriteLine("{0}", s);
Finally, the plain text can be searched for some pattern, e.g.:
var q = from m in Regex.Matches(textonly,
@"Expression\s*<\s*Action\s*<\s*\w+\s*>\s*>\s*(\w+)\s*=")
.Cast<Match>()
select m.Groups[1].Value;
q.Aggregate(0, (n, v) => { Console.WriteLine("{0}: Expression<Action<T>> {1}", ++n, v); return n; });
Possible output:
1: Expression<Action<T>> calculate
2: Expression<Action<T>> print
3: Expression<Action<T>> store
Summary
Performance may not be optimal, but it keeps me going with my other tasks... ;-)