Introduction
This article and its code sample aim to disconnect MIME parsing functionality from any mail protocol, i.e. it aims to implement RFC2045 without coupling it too tightly with either the POP3 or IMAP protocol.
Background
The motivation for me writing this code was originally that I needed support for mail download automation. I started to look around for free or open source alternatives. However, the projects or solutions I found either did not have full support for attachments or their implementation was not modular enough. I therefore decided to start writing my own POP3 client implementation. After fighting a while trying to do a fast hack I soon realised that I had to read the concerned RFC's. I then realised that POP3 (RFC 1939) as a protocol in turn relied on the concept of MIME (RFC 2045, 2046 etc) for attachments. When realizing this, I came up with the idea of trying to write a parser which could be used in both IMAP and POP3.
The main features are as briefly stated above; that the code aims to separate MIME functionality from any mail transfer protocol.
It is also an attempt to parse MIME messages on the fly i.e. it reads portions of the stream and then parses it. This behaviour will hopefully minimize memory consumption. As one might notice, the code takes advantage of a StringBuilder
to compile the whole message source, which is against this whole argument of minimizing memory consumption. However, this StringBuilder
could easily be removed if one does not need to be able to read the whole message source as such.
The library is also written with the aim of keeping it as "pluggable" as possible i.e. I have tried to keep the library and its classes as loosely coupled as possible. To achieve this I have tried to publish all functionality as Interfaces and used dependency injection as often as possible.
MIME
When I first started out with this project, I read many articles. Among the ones I read was this one written by Peter Huber SG here at Code Project. It covers much of the topic on MIME. However I found it too tightly coupled with the POP3 protocol to fit my needs. Nevertheless Peter explains the MIME concept in detail which helped me a lot in starting to grasp the concept. Other excellent sources of information are sites such as this and this.
Using the Code
When reading the RFC 2045 specification, one soon recognizes that the concept which everything revolves around is a concept called entity. Since the entity is so central to the MIME concept I have tried to model a class hierarchy which depicts concepts such as "Message", "Entity", "Body part" and "Body" as they are described in the RFC 2045. specification.
The main entry point for the library is the MIMER.RFC2045.MailReder
which implements the MIMER.IMailReader
. The IMailReader
only contains one method signature "Read
".
IMailMessage Read(ref System.IO.Stream dataStream, IEndCriteriaStrategy
endOfMessageCriteria);
The Read
function requires a System.IO.Stream
and a MIMER.IEndCriteriaStrategy
. The IEndCriteriaStrategy
should reference an object with a method which can determine when the stream has reached the end of a mailmessage
. Hence it should (even if not implemented yet) be possible to rather easily extend the functionality of this MIME parser to conform with IMAP as well. To extend with IMAP functionality would in theory only require one to write a class which implements the MIMER.IEndCriteriaStrategy
interface and then use this class when calling the MailReader
constructor. A worst case scenario could require one to write a new IMailReader
. Nevertheless much of the functionality spread among the supporting classes could probably be reused.
The IMailReader
interface is the most universal (RFC822) implementation of a MailReader
. Since the RFC822 specification came before the MIME (RFC2045 etc.) specification this Interface and its Read
method return an IMailMessage
which does not support attachments.
public interface IMailMessage
{
MailAddress From
{get; set;}
MailAddressCollection To
{get; set;}
MailAddressCollection CarbonCopy
{get; set;}
MailAddressCollection BlindCarbonCopy
{get; set;}
String Subject
{get; set; }
string Source
{get; set; }
string TextMessage
{get; set; }
bool IsNull();
}
However, the MIMER.RFC2045.MailReader
also has a ReadMimeMessage
method which returns an IMimeMailMessage
which is a specialization of the IMailMessage
interface, and this interface supports attachments.
IMimeMailMessage ReadMimeMessage(ref System.IO.Stream dataStream,
IEndCriteriaStrategy endOfMessageCriteria);
public interface IMimeMailMessage : IMailMessage
{
IDictionary<string, string> Body{}
IList<IAttachment> Attachments{}
IList<IMimeMailMessage> Messages{}
IList<ternateView> Views{}
System.Net.Mail.MailMessage ToMailMessage();
}
Decoders
The library has implemented decoder functionality for base64 encodings and QuotedPrinteable
encoding. The IDecoder
interface publishes the signatures expected by the MIMER.RFC2045.MailReader
which therefore can be easily extended with more decoders to support more encodings.
public interface IDecoder
{
bool CanDecode(string encodign);
byte[] Decode(ref System.IO.Stream dataStream);
byte[] Decode(ref string data);
}
public MailReader(IList<IDecoder> decoders)
Header Fields
Much of the work in parsing mail messages is done by reading and parsing Fields. The most basic field is defined in the RFC 822 specification. Conceptually it contains a "name
" and a "body
". This definition is implemented in the MIMER.RFC822.Field
.
From the RFC822 specification:
field = field-name ":" [ field-body ] CRLF field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":"> field-body = field-body-contents [CRLF LWSP-char field-body]
public class Field
{
public string Name{}
public string Body{}
}
The RFC2045 specification does however extend the RFC822 field definition with fields such as Content-type
etc. These definitions are implemented in the MIMER.RFC2045.ContentTypeField
and the MIMER.RFC2045. ContentTransferEncodingField
.
public class ContentTypeField : MIMER.RFC822.Field
{
public string Type{}
public string SubType{}
public StringDictionary Parameters{}
}
public class ContentTransferEncodingField : MIMER.RFC822.Field
{
public string Encoding{}
}
FieldParser
The logic of the field parsing is divided among the FieldParser
classes all of which implement the IFieldParser
interface.
public interface IFieldParser
{void Parse(ref IList<RFC822.Field> fields, ref stringfieldString);}
The parsing is implemented by using regular expressions as much as possible. This is done by imitating the definitions found in the RFC's as identically as possible.
public class FieldParser:IFieldParser
{
protected readonly string m_QuotedPairPattern = "\x5C\x5C[\x00-" +
"\x7F]";
protected readonly string m_DtextPattern =
"[^]\x0D\x5B\x5C\x5C\x80-\xFF]";
protected readonly string m_AtomPattern = "[^][()<>@,;:." +
"\x5C\x5C\x22\x00-\x20\x7F]+";
protected readonly string m_UnfoldPattern = "\x0D\x0A\x5Cs";
protected readonly string m_FieldPattern = "[^\x00-\x20\x7F:]{1,}:{1,1}.+";
protected readonly string m_FieldNamePattern = "[^\x00-\x20\x7F:]{1,}(?=:)";
protected readonly string m_QuotedStringPattern = "\x22(?:(?:(?:\x5C\x5C" +
"{2})+|\x5C\x5C[^\x5C\x5C]|[^\x5C\x5C\x22])*)\x22";
protected readonly string m_CtextPattern = "[^()\x5C\x5C]+";
...
Since the RFC2045 specification leaves room for future media subtypes, the parsing functionality needed some easy way to be extended. This I have attempted to resolve by defining a virtual CompilePattern()
method.
public class FieldParser:IFieldParser
{
public virtual void CompilePattern(){}
...
public class ContentTypeFieldParser:RFC822.FieldParser, IFieldParser
{
protected IList<string>
m_ApplicationSubtypes;
public override void CompilePattern()
{
m_ApplicationSubtypes.Add("octet-stream");
m_ApplicationSubtypes.Add("PostScript");
m_ApplicationSubtypes.Add("pdf");
…
m_SubType = new Regex("((?<=multipart/)"
+ m_MultipartSubtypesBuilder.ToString() + "|" +
"(?<=text/)" +
m_TextSubtypesBuilder.ToString() + "|" + "(?<=image/)" +
m_ImageSubtypesBuilder.ToString() + "|"+
"(?<=application/)" +
m_ApplicationSubtypesBuilder.ToString() + "|"+
"(?<=message/)" +
m_MessageSubtypesBuilder.ToString() + "|" +
"(?<=audio/)" +
m_AudioSubtypesBuilder.ToString() + ")",
RegexOptions.Compiled);
base.CompilePattern();
}
By defining theIList<string> m_ApplicationSubtypes;
as protected
it can be accessed by its child classes which means they could add new application subtypes not needing to rewrite the whole parsing logic. A child implementation might then look something like this:
Public class ExtendedContentTypeFieldParser:RFC2045.ContentTypeFieldParser
{
Public override void CompilePattern()
{
m_ApplicationSubtypes.Add("ms-word");
base.CompilePattern();
}
}
Additions
Since I first wrote this article a few Issues with the code have surfaced. Among these issues were the one pointed out by fellow coder "Lex1985". It turns out that I had embarrassingly enough forgotten to implement support for embedded messages (message/rfc822
).
Embedded messages are essentially messages within a message i.e. there can be any number of messages within another message. This is truly a recursive behaviour. Since an embedded message (message/rfc822
) is a type of Multipart-entity, it made me look for a boundary when parsing it from the stream. However a 'Content-Type
' header field does not have to have a boundary parameter, it was this assumption that made the code throw an exception stating that it "could not find the mandatory delimiter in multi part entity". Aside from this, the parsing of an embedded message differs from parsing of other multipart entities. An embedded message has as all other entities descriptive 'Content-
' headers but they also have their special message headers.
// These are the descriptive content headers of the entity
------_=_NextPart_006_01C7E5C1.06454400
Content-Type: message/rfc822
Content-Transfer-Encoding:7bit
// These are the message headers
Received: by server.smithimage.com
id 01C7E35C.64B60917@server.smithimage.com;
Mon, 20 Aug 2007 21:00:36 +0200
Content-class: urn:content-classes:message
Subject: VB:
Date: Mon, 20 Aug 2007 21:00:28 +0200
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----_=_NextPart_004_01C7E35C.64B60917"
Message-ID: <13176CE1A8A2C4428E514E5E603A56C0039BC7@
server.smithimage.com>
Thread-Index: AcfjWkdexeNWdEWXRm6O87G7fcacpwAAhhFE
References: <13176CE1A8A2C4428E514E5E603A56C06802@
server.smithimage.com>
From: "client" <client@smithimage.com>
To: client@smithimage.com
This is a multi-part message in MIME format.
------_=_NextPart_004_01C7E35C.64B60917
This forces the flow of parsing an embedded message to be a bit different from the parsing of a 'normal' multipart entity. When the parser finds a multipart entity with content-type defined as "Content-Type: message/rfc822
" it must create a new message and recursively call upon itself. The call-trace of the parsing is as follows:
public IMimeMailMessage ReadMimeMessage(ref System.IO.Stream dataStream,
IEndCriteriaStrategy endOfMessageCriteria)
calls:
private string ParseMessage(ref Stream dataStream,
ref Message message, IList<rfc822.field> fields)
calls:
private string CreateEntity(ref Stream dataStream,
ref IMultipartEntity parent, out IEntity entity)
It is within the CreateEntity
method we recursively have to call upon ourselves if we come upon a message/rfc822
entity.
private string CreateEntity(ref Stream dataStream,
ref IMultipartEntity parent, out IEntity entity)
{
entity = null;
IList<RFC822.Field> fields;
int cause = ParseFields(ref dataStream, out fields);
if (cause > 0)
{
foreach (RFC822.Field contentField in fields)
{
if (contentField is ContentTypeField)
{
ContentTypeField contentTypeField =
contentField as ContentTypeField;
if (m_FieldParser.CompositeType.IsMatch
(contentTypeField.Type))
{
MultipartEntity mEntity = new MultipartEntity();
mEntity.Fields = fields;
entity = mEntity;
entity.Parent = parent;
parent.BodyParts.Add(entity);
if (Regex.IsMatch(contentTypeField.Type,
"(?i)message") &&
Regex.IsMatch(contentTypeField.SubType,
"(?i)rfc822"))
{
Message message = new Message();
IList<RFC822.Field> messageFields;
cause = ParseFields(ref dataStream,
out messageFields);
message.Fields = messageFields;
mEntity.BodyParts.Add(message);
message.Parent = mEntity;
if(cause > 0)
return ParseMessage(ref dataStream,
ref message, messageFields);
break;
}
else
{
mEntity.Delimiter = ReadDelimiter
(ref contentTypeField);
return parent.Delimiter;
}
}
else if (m_FieldParser.DescriteType.IsMatch
(contentTypeField.Type))
{
entity = new Entity();
entity.Fields = fields;
entity.Parent = parent;
parent.BodyParts.Add(entity);
return parent.Delimiter;
}
}
}
}
return string.Empty;
}
It is this recursive call that has been added. However some changes were also needed in the RFC2045.IMIMEMailMessage
definition to support embedded messages. The RFC2045.IMIMEMailMessage
now looks like this:
public interface IMimeMailMessage : IMailMessage
{
IDictionary<string, string> Body{}
IList<IAttachment> Attachments{}
IList<IMimeMailMessage> Messages{}
IList<ternateView> Views{}
System.Net.Mail.MailMessage ToMailMessage();
}
This design makes it possible to read any number of recursively embedded message e.g.: A recursive MIME message structure like the one below will be possible to access through code. See example.
1Message
-1:1Message
--1:1:1Message
---1:1:1Message
----etc.
Message m = ReadMimeMessage(ref s, endCriteria);
string subject = m.Messages[0].Messages[0].Messages[0].Subject;
To Sum Up
This article and its code have aimed to explain my attempt of implementing a MIME competent parser which should not be too tightly coupled with the POP3 mail protocol. Hopefully you can use the source completely or partially in your own coding. Although the code is not stable and thoroughly tested and it can most definitely be extensively improved with regard to both architecture and performance. I do however think its overall architecture and idea are worth studying. Also look out for my next article which will describe the implementation of this library in a POP3 compliant client library.
History
- 2007-07-27: Article created
- 2007-08-10: Zip file updated
- 2007-09-03: Article and Zip file updated
- 2008-06-13: Zip file updated
- 2008-07-08: Zip file updated