CExtensibleMarkupLanguageDocument

$Revision: 44 $

Description

This class is the mother of all XML classes. It holds the things like the element tree and settings that apply to the entire document. It is designed to help application developers handle XML-like data. It will parse (and construct) well formed, standalone XML documents. It will also allow you to loosen the parsing rules when dealing with XML from sources you can't control.

Construction

CExtensibleMarkupLanguageDocument() CExtensibleMarkupLanguageDocument( const CExtensibleMarkupLanguageDocument& source ): Creates another CExtensibleMarkupLanguageDocument.

Methods

BOOL AddCallback( const char * element_name, XML_ELEMENT_CALLBACK callback, void * callback_parameter )

Allows you to specify a function (and a parameter for that function) that will be called when an element with a tag matching element_name has been successfully parsed. The element_name comparison is not case sensitive.

void Append( const CExtensibleMarkupLanguageDocument& source )

Appends the elements of source to this document.

void Copy( const CExtensibleMarkupLanguageDocument& source )

Copies the contents of source to this object. It will not copy the callback functions as this may cause unintentional results.

void CopyCallbacks( const CExtensibleMarkupLanguageDocument& source )

Copies the callback functions from source to this object. If you are a careful programmer, this is perfectly safe to do. Generally speaking, you shouldn't have to copy the callbacks of source because parsing should have already taken place.

DWORD CountElements( const CString& element_name ) const

Counts the number of elements. element_name takes much the same form as used in the GetElement() method. Consider the following XML snippet:

<Southpark>
   <Characters>
      <Boy>Cartman</Boy>
      <Boy>Kenny</Boy>
      <Boy>Kyle</Boy>
      <Boy>Stan</Boy>
   </Characters>
   <Characters>
      <Girl>Wendy</Girl>
      <Boy>Chef</Boy>
      <Girl>Ms. Ellen</Girl>
   </Characters>
</Southpark>

If you wanted to know how many "Boy" elements there are in the first set of characters, you would use an element name of "SouthPark.Characters" If you wanted to know how many "Girl" elements there are in the second set of characters, you would use this for element_name: "Southpark.Characters(1).Girl"

void Empty( void )

Empties the contents of the document. The object is reset to an intial state. All elements are deleted. All callbacks are deleted.

BOOL EnumerateCallbacks( DWORD& enumerator ) const

Initializes the enumerator in preparation for calling GetNextCallback(). If there are no callbacks (i.e. AddCallback() has not been called), FALSE will be returned. If there are callbacks, TRUE will be returned.

void ExecuteCallbacks( CExtensibleMarkupLanguageElement * element_p )

This is generally called during the parsing of a document by the CExtensibleMarkupLanguageElement that just parsed itself. However, you can pull an element out of the document and call ExecuteCallbacks() yourself.

void GetAutomaticIndentation( BOOL& automatically_indent, DWORD& indentation_level, DWORD& indent_by ) const

Retrieves the automatic indentation parameters. Automatic indentation does nothing but make the XML output look pretty. It makes it easier for humans to read. If your application is sensitive to white space, don't use automatic indentation.

DWORD GetConversionCodePage( void ) const

Returns the code page that will be used for conversion from UNICODE.

CExtensibleMarkupLanguageElement * GetElement( const CString& element_name ) const

Searches and finds the specified element in the document. The element_name is in the form of "Parent(0).Child(0)" Consider the following XML snippet:

<Southpark>
   <Characters>
      <Boy>Cartman</Boy>
      <Boy>Kenny</Boy>
      <Boy>Kyle</Boy>
      <Boy>Stan</Boy>
   </Characters>
   <Characters>
      <Girl>Wendy</Girl>
      <Boy>Chef</Boy>
      <Girl>Ms. Ellen</Girl>
   </Characters>
</Southpark>

To retrieve the element for Cartman, element_name should be "Southpark.Characters.Boy" If you want Ms. Ellen (even though she doesn't play for the home team) you would use "Southpark.Characters(1).Girl(1)"

void GetEncoding( CString& encoding ) const

Returns the encoding of the document.

const CExtensibleMarkupLanguageEntities& GetEntities( void ) const

Returns a const reference to the entities for this document. Basically all you can do with it is enumerate the entries.

BOOL GetIgnoreWhiteSpace( void ) const

Returns whether or not the document will suppress the output of elements that contain only space characters. This output occurs when you call WriteTo().

BOOL GetNextCallback( DWORD& enumerator, CString& element_name, XML_ELEMENT_CALLBACK& callback, void *& callback_parameter )

Retrieves the next callback. It will return TRUE if the callback has been retrieved or FALSE if you are at the end of the list. If FALSE is returned, all parameters are set to NULL. Callbacks are added via the AddCallback() method.

DWORD GetNumberOfElements( void ) const

Returns the number of elements in this document.

TCHAR GetParentChildSeparatorCharacter( void ) const

Returns the character that will be used to separate parent element names from child element names in the GetElement() method.

DWORD GetParseOptions( void ) const

Returns the parse options. This is a bit field (32 wide) that controls the sloppiness of the parser.

void GetParsingErrorInformation( CString& tag_name, CParsePoint& beginning, CParsePoint& error_location, CString * error_message = NULL ) const

If Parse() returns FALSE, you can call this method to find out interesting information as to where the parse failed. This will help you correct the XML. If error_message is not NULL, it will be filled with a human readable error message. The beginning parameter is filled with the location in the document where the element began. The error_location parameter is filled with the location where the parser encountered the fatal problem.

CExtensibleMarkupLanguageElement * GetRootElement( void ) const

Returns the pointer to the ultimate parent element. This will be the element that contains the data from the <?xml ... ?> line.

void GetVersion( CString& version ) const

Returns the version of the document.

DWORD GetWriteOptions( void ) const

Returns the writing options. This is a bit field (32 wide) that controls how the XML documents are written.

BOOL IsStandalone( void ) const

Returns TRUE if this is a standalone document.

BOOL Parse( const CDataParser& source )

Parses the data from source. This will construct the document tree.

BOOL RemoveCallback( const char * element_name, XML_ELEMENT_CALLBACK callback, void * callback_parameter )

This will remove the specified callback from the list. All parameters must match for the callback to be removed.

BOOL ResolveEntity( const CString& entity, CString& resolved_to ) const

This method will resolve the entity and put the result into resolved_to. If the entity cannot be resolved, it will return FALSE.

void SetAutomaticIndentation( BOOL automatically_indent = TRUE, DWORD starting_column = 0, DWORD indent_by = 2 )

This will turn automatic indentation on or off.

BOOL SetConversionCodePage( DWORD new_page )

When you must convert from UNICODE to something else, this is the code page that will be used. See the WideCharToMultiByte() Win32 API for more information. If the code is run on a real operating system (NT), the default code page is CP_UTF8. If you are running on a piece of crap (Windows 95) the default code page is CP_ACP.

void SetEncoding( LPCTSTR encoding )

Sets the encoding of the document. You will usuall do this when you are about to write the document.

BOOL SetIgnoreWhiteSpace( BOOL ignore_whitespace )

Tells the document whether or not to ignore text segments that contain only space characters. It returns what the previous setting was.

BOOL SetParentChildSeparatorCharacter( TCHAR separator )

Allows you to specify the character that will separate parent and child names in the GetElement() call.

DWORD SetParseOptions( DWORD new_options )

Sets the parsing options. This allows you to customize the parser to be as loose or as strict as you want. The default is to be as strict as possible when parsing. SetParseOptions() returns the previous options. Here are the current parse options that can be set:

WFC_XML_IGNORE_CASE_IN_XML_DECLARATION - When set, this option will allow uppercase letters in the XML declaration. For example: <?XmL ?> will be allowed even though it does not conform to the specification.
WFC_XML_ALLOW_REPLACEMENT_OF_DEFAULT_ENTITIES - Though the XML specification doesn't talk about it, what should a parser do if default entities are replaced? If you set this option, the parser will allow replacement of the default entities. Here is a list of the default entities:
- &
- '
- >
- <
- "
WFC_XML_FAIL_ON_ILL_FORMED_ENTITIES - Not yet implemented. It will allow the parser to ignore ill formed entities such as
<!ENTITY amp "&">
WFC_XML_IGNORE_ALL_WHITE_SPACE_ELEMENTS - Tells the parser to ignore elements (of type typeTextSegment) that contain nothing but white space characters. WARNING! If you use this option, it will not be possible to reproduce that input file exactly. Elements that contain nothing but white spaces will be deleted from the document.
WFC_XML_IGNORE_MISSING_XML_DECLARATION - Tells the parser to ignore the fact that the <?xml ?> element is missing. If it was not specified in the data stream, one will be automatically added to the document. This is the default behavior.
WFC_XML_DISALLOW_MULTIPLE_ELEMENTS - Tells the parser to allow multiple elements to be present in the document. The first rule (Rule 1) of the XML specification says (like Connor MacLeod of the clan MacLeod) There can be only one element in an XML document. That element can have a billion child elements but there can be only one root element. If this option is set (it is not set by default), the parser will strictly enforce this rule. This rule really gets in the way of using XML for things like log files (where you want to open the file, append a record to it and close the file).
WFC_XML_LOOSE_COMMENT_PARSING - Tells the parser to allow double dashes (--) to appear in comment tags.
WFC_XML_ALLOW_AMPERSANDS_IN_ELEMENTS - Tells the parser to allow &'s to appear in the contents of an element without being a reference of some kind..

void SetParsingErrorInformation( const CString& tag_name, const CParsePoint& beginning, const CParsePoint& error_location, LPCTSTR error_message = NULL )

This method is usually called by the element that cannot parse itself. There is logic that prevents the information from being overwritten by subsequent calls to SetParsingErrorInformation(). This means you can call SetParsingErrorInformation() as many times as you want but only information from the first call will be recorded (and reported via GetParsingErrorInformation()) for each call to Parse().

void SetStandalone( BOOL standalone )

Sets the standalone attribute of the document. This is usually done just before you start writing the document.

void SetVersion( LPCTSTR version )

Sets the version of the document. This is usually done just before you start writing the document.

DWORD SetWriteOptions( DWORD new_options )

Sets the writing options. This allows you to customize how the XML is written. The default is to be as strict as possible when writing. SetWriteOptions() returns the previous options. Here are the current options that can be set:

WFC_XML_INCLUDE_TYPE_INFORMATION - Not Yet Implemented. XML is woefully inept at handling data. They use things called DTD's but they have a "the world is flat" outlook on life. DTD's lack the ability to scope. It would be like programming where all variables have to have unique names no matter what function they exist in. DTD's are used to give HTML browsers the ability to correctly display XML. They also give you the ability to do some lame data validation. In the future, I will include the ability to write type information in a programmer friendly fashion. This type information is intended to be a programmer-to-programmer communication medium.
WFC_XML_DONT_OUTPUT_XML_DECLARATION - This allows you to skip writing the XML declaration when you output. For example, this XML document:
```
<?xml version="1.0" ?>
<TRUTH>Sam loves Laura.</TRUTH>
```
Would look like this when WFC_XML_DONT_OUTPUT_XML_DECLARATION is set:
```
<TRUTH>Sam loves Laura.</TRUTH>
```
WFC_XML_WRITE_AS_UNICODE - This tells the document to write output as UNICODE (two bytes per character). It will default to writing in little endian format.
WFC_XML_WRITE_AS_BIG_ENDIAN - This tells the document to write UNICODE or UCS4 characters in big endian format.
WFC_XML_WRITE_AS_UCS4 - This will write the data in UCS4 (four bytes per character). The default is to write in little endian format. For example, the < character would come out as these bytes 3C 00 00 00
WFC_XML_WRITE_AS_UCS4_UNUSUAL_2143 - This will format the output in an unusual 4 bytes per character format. For example, the < character would come out as these bytes 00 00 3C 00
WFC_XML_WRITE_AS_UCS4_UNUSUAL_3412 - This will format the output in another unusual 4 bytes per character format. For example, the < character would come out as these bytes 00 3C 00 00
WFC_XML_WRITE_AS_ASCII - This will format the output in ASCII format. This is the default.
WFC_XML_WRITE_AS_UTF8 - Ths will write the data out in UTF-8 format.

void WriteTo( CByteArray& destination )

Write the data to destination in XML form.

Operators

CExtensibleMarkupLanguageDocument& operator = ( const CExtensibleMarkupLanguageDocument& source ): Calls Copy().
CExtensibleMarkupLanguageDocument& operator += ( const CExtensibleMarkupLanguageDocument& source ): Calls Append().

Example

#include <wfc.h>
#pragma hdrstop

BOOL get_bytes( const CString& filename, CByteArray& bytes )
{
   WFCTRACEINIT( TEXT( "get_bytes()" ) );

   bytes.RemoveAll();

   CFile file;

   if ( file.Open( filename, CFile::modeRead ) == FALSE )
   {
      return( FALSE );
   }

   bytes.SetSize( file.GetLength() );

   file.Read( bytes.GetData(), bytes.GetSize() );

   file.Close();

   return( TRUE );
}

BOOL parse_document( const CString& filename, CExtensibleMarkupLanguageDocument& document )
{
   WFCTRACEINIT( TEXT( "parse_document()" ) );

   CByteArray bytes;

   if ( get_bytes( filename, bytes ) != TRUE )
   {
      return( FALSE );
   }

   CDataParser parser;

   parser.Initialize( &bytes, FALSE );

   if ( document.Parse( parser ) == TRUE )
   {
      _tprintf( TEXT( "Parsed OK\n" ) );
   }
   else
   {
      _tprintf( TEXT( "Can't parse document\n" ) );
   }

   return( TRUE );
}

void stanza_callback( void * parameter, CExtensibleMarkupLanguageElement * element_p )
{
   WFCTRACEINIT( TEXT( "stanza_callback()" ) );

   _tprintf( TEXT( "Got a stanza with %lu children\n" ), (DWORD) element_p->GetNumberOfChildren() );
}

int _tmain( int number_of_command_line_arguments, LPCTSTR command_line_arguments[] )
{
   WFCTRACEINIT( TEXT( "_tmain()" ) );

   CExtensibleMarkupLanguageDocument document;

   document.AddCallback( TEXT( "stanza" ), stanza_callback, NULL );

   if ( parse_document( TEXT( "poem.xml" ), document ) == TRUE )
   {
      CByteArray bytes;

      document.SetWriteOptions( WFC_XML_DONT_OUTPUT_XML_DECLARATION );

      document.WriteTo( bytes );

      _tprintf( TEXT( "Wrote %d bytes\n" ), bytes.GetSize() );

      CFile file;

      if ( file.Open( TEXT( "xml.out" ), CFile::modeCreate | CFile::modeWrite ) != FALSE )
      {
         file.Write( bytes.GetData(), bytes.GetSize() );
         file.Close();
      }
   }

   return( EXIT_SUCCESS );
}

Copyright, 2000, Samuel R. Blackburn
$Workfile: CExtensibleMarkupLanguageDocument.cpp $
$Modtime: 1/17/00 9:01a $