Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

C++ Wrappers for the Expat XML Parser

0.00/5 (No votes)
17 Feb 2002 8  
The included class definitions provide complete and easy to use C++ wrappers for the Expat C API

Introduction

The Expat XML Parser is a fine and widely used event based XML parser.  One of the nicer features of Expat is that it has an API capable of being used by C programs.  Even though many programmers use Expat in a C++ environment, the C based API makes it easy to export this API from a DLL.

However, Expat being a C based API doesn't mean we have to live without our C++ classes.  Luckily, Expat was designed with the ability to be augmented with classes.

(Definition: Event Based XML Parser - An XML parser which invokes methods (a.k.a. events) when XML constructs are parsed.  This differs from the DOM (Document Object Model) style parsers that parse the XML and then present the application with XML data in its logical hierarchical format.)

Design Rational

The primary considerations when designing the Expat wrapper classes was completeness, simplicity, and extensibility.  For completeness, almost all Expat API routines have been wrapped in the classes.  This includes even API such as XML_ExpatVersionInfo.  For simplicity, the wrapper classes only wrap the Expat API and provide no other features.  For extensibility, the wrapper classes make it easy to derive new classes the provide enhanced functionality.

Basics

This Expat wrappers consist of 2 classes, a template based class (CExpatImpl <class _T>) and a virtual function based class (CExpat).  Each class has features the lend themselves to specific solutions.

The following table illustrates the relationship between the API and the two classes.

CExpat

CExpatImpl <class _T>

Expat C API

The template class CExpatImpl <class _T> provides the base layer of translation between C++ and the Expat C API.  The benefit to the template designed is that if the application only needs a few of the Expat event routines, then the code for the event routines are not compiled into the final executable.  Admittedly, the amount of space wasted is minimal, but why waste it.

The CExpat class is derived from the CExpatImpl <class _T> template class.  However, excluding the default constructor, the only methods contained within this class are all the event methods declared as virtual functions.  CExpat is intended for situations where virtual functions are more preferable than templates.

Within reason, the two classes are interchangeable.  If you have a class that is derived from CExpat, it could be easily modified to use CExpatImpl <class _T> or visa-versa without having to modify any other source.  See the "Implementation Notes" for more information about some implementation pitfalls with regard to more complex derived classes.

For the rest of this document, only the CExpatImpl <class _T> class will be discussed.  As stated previously, the two wrapper classes are almost 100 percent interchangeable.  Documenting both would be redundant.

Getting Started

The first step in using CExpatImpl <class _T> is deriving a new class that will provide the application specific implementation.  Deriving a class is required.  Like Expat, if there is no derived class then Expat would only verify that the XML is well formed.

As a starting point, let us define an XML parser that will display when an element begins, ends, and the data contained within the element.

class CMyXML : public CExpatImpl <CMyXML> 
{
public:

	// Constructor 

	
	CMyXML () 
	{
	}
	
	// Invoked by CExpatImpl after the parser is created

	
	void OnPostCreate ()
	{
		// Enable all the event routines we want

		EnableStartElementHandler ();
		EnableEndElementHandler ();
		// Note: EnableElementHandler will do both start and end

		EnableCharacterDataHandler ();
	}
	
	// Start element handler


	void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs)
	{
		printf ("We got a start element %s\n", pszName);
		return;
	}

	// End element handler


	void OnEndElement (const XML_Char *pszName)
	{
		printf ("We got an end element %s\n", pszName);
		return;
	}

	// Character data handler


	void OnCharacterData (const XML_Char *pszData, int nLength)
	{
		// note, pszData is NOT null terminated

		printf ("We got %d bytes of data\n", nLength);
		return;
	}
};

The CMyXML::OnPostCreate method will be invoked by CExpatImpl <class _T> after the Expat parser has been created.  This provides an easy method of enabling event routines.  The CMyXML::OnStartElement, CMyXML::OnEndElement, and CMyXML::OnCharacterData methods will be invoked by Expat while the XML text is being parsed.  These routines will not be invoked unless they are enabled.  The code inside CMyXML::OnPostCreate enables the three event routines.

Creating a Parser

Now that we have a derived class, we can use it to create an Expat parser.  Creating the parser is very easy.  First create an instance of the parser class, then invoke the Create method. 

The Create method has two arguments, the document encoding and the character used to separate namespaces a name.  The encoding is the default encoding that will be used while parsing the XML document unless an encoding is specified by in the XML document itself.  The namespace separator is used to separate the namespace from the name in calls such as OnStartElement

For example, if in the XML document there was the name SOAP_ENC:Envelope, the SOAP_ENC was defined as being "http://schemas.xmlsoap.org/soap/envelope/" and "#" was specified to Create, then OnStartElement would be invoked with the string "http://schemas.xmlsoap.org/soap/envelope/#Envelope".

bool ParseSomeXML (LPCTSTR pszXMLText)
{
	CMyXML sParser;
	sParser .Create ();
	
	// do something useful

}

Parsing a Simple Text String

Next, we actually need to send the XML document to the parser.  There are two different methods of sending the document to the XML parser, directly or by internal buffers.  The easier of the two is sending the data directly to the parser.  However, it is also just a bit slower.

To send a simple string to the parser, the application invokes the Parse (LPCTSTR pszBuffer, int nLength = -1, bool fIsFinal = true) method.  The first argument is a pointer to a string of data to be parsed.  A routine has been defined for both ANSI and UNICODE strings.  The second parameter is the length of the string in characters (char or wchar_t depending on ANSI or UNICODE).  If nLength is less than zero, then it is required that the string pointed to by pszBuffer is a NUL terminated string and the length will be determined from the string.  If nLength is greater or equal to zero, then the string need not be NUL terminated and the length shouldn't include the NUL character if it exists.  The third parameter lets the XML parser know when there is no more data.  If the whole XML document can be contained within one simple string, then fIsFinal can be set to true the first time.  Otherwise, fIsFinal should remain false while there is more data to be parsed.  Parse can be invoked with a nLength set to zero and fIsFinal set to true after all data has been read in.

bool ParseSomeXML (LPCTSTR pszXMLText)
{
	CMyXML sParser;
	sParser .Create ();
	
	// Send this simple string to the parser

	
	return sParser .Parse (pszXMLText);
}

Parsing Using Internal Buffers

To reduce the number of extra memory copies, buffers internal to the Expat parser can be used instead of passing data into the parser just to have the Expat parser copy the data to internal buffers.  Using internal buffers takes 3 steps, requesting a buffer, reading data into the buffer, submitting the data to the parser. 

bool ParseSomeXML (LPCSTR pszFileName)
{

	// Create the parser 

	
	CMyXML sParser;
	if (!sParser .Create ())
		return false;
	
	// Open the file

	
	FILE *fp = fopen (pszFileName, "r");
	if (fp == NULL)
		return false;
	
	// Loop while there is data

	
	bool fSuccess = true;
	while (!feof (fp) && fSuccess)
	{
		LPSTR pszBuffer = (LPSTR) sParser .GetBuffer (256); // REQUEST

		if (pszBuffer == NULL)
			fSuccess = false;
		else
		{
			int nLength = fread (pszBuffer, 1, 256, fp); // READ

			fSuccess = sParser .ParseBuffer (nLength, nLength == 0); // PARSE

		}	
	}

	// Close the file

	
	fclose (fp);
	return fSuccess;
}

As you can see, this method is more complicated that the other, but when you modify the example in the previous section to read a file, the differences in complexity are minimal.

Working With Event Routines

Event routines provide the actual information about what has been parsed to the application.  The method names inside the CExpatImpl <class _T> class have been selected to make it easy to know which routine applies to what Expat event.

In Expat:

Set the event handler routineXML_Set[Event Name]Handler
Name of the event handlerApplication specific

In CExpatImpl <class _T>

Enable the event handler routineEnable[Event Name]Handler
Name of the event handlerOn[Event Name]
Name of the internal event handler[Event Name]Handler

So, if you wish to receive StartElement events, you define a method called OnStartElement with the proper arguments and invoke EnableStartElementHandler with a true for the only argument.  The event routine can be later disabled by invoking EnableStartElementHandler again with false as the only argument.

The specifics about each of the event routines is beyond the scope of this document.  For more information about the events and the Expat parser itself, see http://www.xml.com/pub/a/1999/09/expat/index.html.  The most all information contained within this document has a counterpart of the same name in CExpatImpl <class _T>.

Implementation Notes

As stated earlier, there are some pitfalls applications will have to be aware of when creating complex derived class hierarchies.  Let us consider the example of an XML parser consisting of two classes, CMyXMLBase and CMyXMLCMyXML is derived from CMyXMLBase and CMyXMLBase is derived from one of the Expat class wrappers.

Consider the case where the classes are derived from the CExpatImpl <class _T> template class.

class CMyXMLBase : public CExpatImpl <CMyXMLBase> 
{
public:

	CMyXMLBase () 
	{
	}
	
	void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs) 
	{
		// do useful stuff here... 

		return;
	}
};

class CMyXML : public CMyXMLBase
{ 
public:

	CMyXML ()
	{
	}
	
	void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs) 
	{
		// do derived useful stuff here...

		return;
	}
};

In this case, the programmer expects the OnStartElement to be invoked by the Expat parser.  However, due to the design of the CExpatImpl <class _T> class, only the methods of the class specified in the template argument list would be invoked.  This is by design.

There are three different way to fix this problem.  The first method would be to declare OnStartElement as being virtual in CMyXMLBase.  The second would be to derive CMyXMLBase from CExpat instead of CExpatImpl <class _T>.  The third method requires the changing of CMyXMLBase from a normal class to a template.  This change provides CExpatImpl <class _T> with the name of the class from which to locate the event routines.

template <class _T>
class CMyXMLBase : public CExpatImpl <_T> 
{
public:

	CMyXMLBase () 
	{
	}
	
	void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs) 
	{
		// do useful stuff here... 

		return;
	}
};

class CMyXML : public CMyXMLBase <CMyXML>
{ 
public:

	CMyXML ()
	{
	}
	
	void OnStartElement (const XML_Char *pszName, const XML_Char **papszAttrs) 
	{
		// do derived useful stuff here...

		return;
	}
};

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here