Introduction
The Expat XML Parser is a fine and widely used event based XML parser. One of the nicer features of Expat is that it has an API capable of being used by C programs. Even though many programmers use Expat in a C++ environment, the C based API makes it easy to export this API from a DLL.
However, Expat being a C based API doesn't mean we have to live without our C++ classes. Luckily, Expat was designed with the ability to be augmented with classes.
(Definition: Event Based XML Parser - An XML parser which invokes methods (a.k.a. events) when XML constructs are parsed. This differs from the DOM (Document Object Model) style parsers that parse the XML and then present the application with XML data in its logical hierarchical format.)
Design Rational
The primary considerations when designing the Expat wrapper classes was completeness, simplicity, and extensibility. For completeness, almost all Expat API routines have been wrapped in the classes. This includes even API such as XML_ExpatVersionInfo
. For simplicity, the wrapper classes only wrap the Expat API and provide no other features. For extensibility, the wrapper classes make it easy to derive new classes the provide enhanced functionality.
Basics
This Expat wrappers consist of 2 classes, a template based class (CExpatImpl <class _T>
) and a virtual function based class (CExpat
). Each class has features the lend themselves to specific solutions.
The following table illustrates the relationship between the API and the two classes.
CExpat
|
CExpatImpl <class _T>
|
Expat C API
|
The template class CExpatImpl <class _T>
provides the base layer of translation between C++ and the Expat C API. The benefit to the template designed is that if the application only needs a few of the Expat event routines, then the code for the event routines are not compiled into the final executable. Admittedly, the amount of space wasted is minimal, but why waste it.
The CExpat
class is derived from the CExpatImpl <class _T>
template class. However, excluding the default constructor, the only methods contained within this class are all the event methods declared as virtual functions. CExpat
is intended for situations where virtual functions are more preferable than templates.
Within reason, the two classes are interchangeable. If you have a class that is derived from CExpat
, it could be easily modified to use CExpatImpl <class _T>
or visa-versa without having to modify any other source. See the "Implementation Notes" for more information about some implementation pitfalls with regard to more complex derived classes.
For the rest of this document, only the CExpatImpl <class _T>
class will be discussed. As stated previously, the two wrapper classes are almost 100 percent interchangeable. Documenting both would be redundant.
Getting Started
The first step in using CExpatImpl <class _T>
is deriving a new class that will provide the application specific implementation. Deriving a class is required. Like Expat, if there is no derived class then Expat would only verify that the XML is well formed.
As a starting point, let us define an XML parser that will display when an element begins, ends, and the data contained within the element.
class CMyXML : public CExpatImpl <CMyXML>
{
public:
CMyXML ()
{
}
void OnPostCreate ()
{
EnableStartElementHandler ();
EnableEndElementHandler ();
EnableCharacterDataHandler ();
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
printf ("We got a start element %s\n", pszName);
return;
}
void OnEndElement (const XML_Char *pszName)
{
printf ("We got an end element %s\n", pszName);
return;
}
void OnCharacterData (const XML_Char *pszData, int nLength)
{
printf ("We got %d bytes of data\n", nLength);
return;
}
};
The CMyXML::OnPostCreate
method will be invoked by CExpatImpl <class _T>
after the Expat parser has been created. This provides an easy method of enabling event routines. The CMyXML::OnStartElement
, CMyXML::OnEndElement
, and CMyXML::OnCharacterData
methods will be invoked by Expat while the XML text is being parsed. These routines will not be invoked unless they are enabled. The code inside CMyXML::OnPostCreate
enables the three event routines.
Creating a Parser
Now that we have a derived class, we can use it to create an Expat parser. Creating the parser is very easy. First create an instance of the parser class, then invoke the Create
method.
The Create
method has two arguments, the document encoding and the character used to separate namespaces a name. The encoding is the default encoding that will be used while parsing the XML document unless an encoding is specified by in the XML document itself. The namespace separator is used to separate the namespace from the name in calls such as OnStartElement
.
For example, if in the XML document there was the name SOAP_ENC:Envelope
, the SOAP_ENC
was defined as being "http://schemas.xmlsoap.org/soap/envelope/" and "#" was specified to Create
, then OnStartElement
would be invoked with the string "http://schemas.xmlsoap.org/soap/envelope/#Envelope".
bool ParseSomeXML (LPCTSTR pszXMLText)
{
CMyXML sParser;
sParser .Create ();
}
Parsing a Simple Text String
Next, we actually need to send the XML document to the parser. There are two different methods of sending the document to the XML parser, directly or by internal buffers. The easier of the two is sending the data directly to the parser. However, it is also just a bit slower.
To send a simple string to the parser, the application invokes the Parse (LPCTSTR pszBuffer, int nLength = -1, bool fIsFinal = true)
method. The first argument is a pointer to a string of data to be parsed. A routine has been defined for both ANSI and UNICODE strings. The second parameter is the length of the string in characters (char or wchar_t depending on ANSI or UNICODE). If nLength
is less than zero, then it is required that the string pointed to by pszBuffer
is a NUL terminated string and the length will be determined from the string. If nLength
is greater or equal to zero, then the string need not be NUL terminated and the length shouldn't include the NUL character if it exists. The third parameter lets the XML parser know when there is no more data. If the whole XML document can be contained within one simple string, then fIsFinal
can be set to true the first time. Otherwise, fIsFinal
should remain false while there is more data to be parsed. Parse
can be invoked with a nLength
set to zero and fIsFinal
set to true after all data has been read in.
bool ParseSomeXML (LPCTSTR pszXMLText)
{
CMyXML sParser;
sParser .Create ();
return sParser .Parse (pszXMLText);
}
Parsing Using Internal Buffers
To reduce the number of extra memory copies, buffers internal to the Expat parser can be used instead of passing data into the parser just to have the Expat parser copy the data to internal buffers. Using internal buffers takes 3 steps, requesting a buffer, reading data into the buffer, submitting the data to the parser.
bool ParseSomeXML (LPCSTR pszFileName)
{
CMyXML sParser;
if (!sParser .Create ())
return false;
FILE *fp = fopen (pszFileName, "r");
if (fp == NULL)
return false;
bool fSuccess = true;
while (!feof (fp) && fSuccess)
{
LPSTR pszBuffer = (LPSTR) sParser .GetBuffer (256);
if (pszBuffer == NULL)
fSuccess = false;
else
{
int nLength = fread (pszBuffer, 1, 256, fp);
fSuccess = sParser .ParseBuffer (nLength, nLength == 0);
}
}
fclose (fp);
return fSuccess;
}
As you can see, this method is more complicated that the other, but when you modify the example in the previous section to read a file, the differences in complexity are minimal.
Working With Event Routines
Event routines provide the actual information about what has been parsed to the application. The method names inside the CExpatImpl <class _T>
class have been selected to make it easy to know which routine applies to what Expat event.
In Expat:
Set the event handler routine | XML_Set[Event Name]Handler |
Name of the event handler | Application specific |
In CExpatImpl <class _T>
Enable the event handler routine | Enable[Event Name]Handler |
Name of the event handler | On[Event Name] |
Name of the internal event handler | [Event Name]Handler |
So, if you wish to receive StartElement events, you define a method called OnStartElement
with the proper arguments and invoke EnableStartElementHandler
with a true for the only argument. The event routine can be later disabled by invoking EnableStartElementHandler
again with false as the only argument.
The specifics about each of the event routines is beyond the scope of this document. For more information about the events and the Expat parser itself, see http://www.xml.com/pub/a/1999/09/expat/index.html. The most all information contained within this document has a counterpart of the same name in CExpatImpl <class _T>
.
Implementation Notes
As stated earlier, there are some pitfalls applications will have to be aware of when creating complex derived class hierarchies. Let us consider the example of an XML parser consisting of two classes, CMyXMLBase
and CMyXML
. CMyXML
is derived from CMyXMLBase
and CMyXMLBase
is derived from one of the Expat class wrappers.
Consider the case where the classes are derived from the CExpatImpl <class _T>
template class.
class CMyXMLBase : public CExpatImpl <CMyXMLBase>
{
public:
CMyXMLBase ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};
class CMyXML : public CMyXMLBase
{
public:
CMyXML ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};
In this case, the programmer expects the OnStartElement
to be invoked by the Expat parser. However, due to the design of the CExpatImpl <class _T>
class, only the methods of the class specified in the template argument list would be invoked. This is by design.
There are three different way to fix this problem. The first method would be to declare OnStartElement
as being virtual in CMyXMLBase
. The second would be to derive CMyXMLBase
from CExpat
instead of CExpatImpl <class _T>
. The third method requires the changing of CMyXMLBase
from a normal class to a template. This change provides CExpatImpl <class _T>
with the name of the class from which to locate the event routines.
template <class _T>
class CMyXMLBase : public CExpatImpl <_T>
{
public:
CMyXMLBase ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};
class CMyXML : public CMyXMLBase <CMyXML>
{
public:
CMyXML ()
{
}
void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
{
return;
}
};