Introduction
Presented is a small, fast, non-validating DOM XML parser, contained in a single
header, having no dependencies other than the standard C libraries, and <iostream>
(KERNEL32.DLL
with WIN32). This XML parser segments a given string
in situ (like strtok
), performing scanning/tokenization, and
parsing in a single pass. Preliminary analysis shows a best-case of 22 X86 CPU
clock cycles average per input byte, and a worst-case of 108 CPU cycles, for an
executable built using VC7 with the /Og /Ob2 /Oi /Ot /Oy
compiler optimization
flags. This may be further improved by modifying the parser to discard
comments, processing instruction (PI), DOCTYPE
, and CDATA
sections, as well as using faster memory management routines, (notice how the Microsoft malloc
,
realloc
and free
implementations seem to be rather
slow).
Why Write Another XML Parser?
Of the many XML parsers available, most present significant concerns, such as
complexity, size, portability, etc. It is no small task to correctly implement
XML-reading capabilities in a C or C++ application. Of the C, C++ and COM
implementations one must gain familiarity with, and work with a large set of
functions, structures, and/or classes. In addition, the basic function of many
of these implementations is opaque, or obscured by a hairy call graph, so
implementation-specific modification or optimization is not realistically
possible.
As the use of XML gains ground in diverse industries, year after year, one finds
people doing all kinds of bizarre things with XML that were perhaps never
intended by the original working group (see for example:
HumanML, XML markup for human emotions, etc.; and
SuperX++, an XML-based programming language). The bottom line is that
these diverse needs are not best met by one 'universal' XML parsing solution,
but occasionally the requirements may indicate a customized approach.
The implementation presented here served as a study model for an XML-parsing
algorithm that was later rewritten in assembly, for an embedded implementation.
The intention here was to develop a lightweight, fully transparent XML parser,
from which the essential process of parsing XML markup could be examined, and
then pared down as needed. The source code and description may be used for
didactic purposes, as a drop-in solution for light-weight DOM-style XML parsing
in your own application, or may be used as a model or starting point for
developing your own custom parser.
Approach
The PugXML parser performs string scanning, tokenization, parsing, and
construction of the document tree structure in a single pass. A common approach
in designing a parser would be to scan and tokenize the string in an initial
pass, build the parse tree structure in the next pass, then validate in
additional passes. While this approach has many advantages in terms of
simplicity of implementation, error detection/tolerance/recovery, etc., it can
easily result in a larger, slower, more opaque implementation, as the
construction and maintenance of a token table requires additional memory and
instruction steps. Eliminating the use of the tokenization pass, while
significantly smaller, and faster, results in code that can be difficult to
follow, as some goto
statements are needed.
In our approach we avoid traversing the tree as it is being built, in order to
improve performance, although this has certain drawbacks. The XML format is
cleverly designed with redundancy and error-correction features. In developing
a single-pass XML parser we rely heavily on these in order to
minimize look-ahead and look-back scanning of the input string. A brief
examination of the nature of XML markup will permit us to address this
approach in further detail. The structure of an XML document is that of a
tree, where each tree node or branch may have certain associated data. In
our implementation, each branch may be one of the following standalone or
hierarchical types:
Standalone Types
1.) <!ATTLIST ...>
2.) <!ELEMENT ...>
3.) <!ENTITY ...>
4.) <!NOTATION ...>
5.) <?name ... ?>
6.) <name ... />
7.) <!--text-->
8.) <![CDATA[text]]>
9.) <![INCLUDE[text]]>
10.) text
Hierarchical
11.) <!DOCTYPE [...]>
12.) <name ...>...</name>
Representation
All branches, and their attributes are represented using the structures:
-
XMLATTR
is a name=value attribute structure. The *_insitu
(in situ) flags keep track of whether the string is a segment of the original
parse string (this helps with garbage collection).
-
XMLENTITY
specifies the kind of branch, as in the above discussion of "standalone" and
"hierarchical" types.
-
XMLBRANCH
is a the branch itself, which may have zero or more
attributes (pointers to XMLATTR
structures), zero or more children
(pointers to XMLBRANCH
structures), an element name, data, and a
classification. The *_space
scalars keep track of how much pointer
space is available in the array.
During the parsing process we maintain a cursor to our current branch (like a
stack). Upon hitting a standalone region, we just append a new branch as a
child of the cursor branch. Upon hitting an hierarchical region, we append a
new branch as a child of the cursor branch, then set our cursor to this new
branch (like a stack push). On leaving a hierarchical region (e.g. '</name>
'),
we set our cursor to its parent (like a stack pop). To save time, and avoid
traversing the tree as we build it (such as verifying that we are popping the
right branch), we ignore the end-tag name (e.g. '</name>
'),
and just step to the parent of the current branch. This should be OK if the
document is well-formed, but may cause problems otherwise (for example, parsing
HTML). Using this approach, the following malformed structure:
<example>
<foo>Oops!</bar>
<bar>Ouch!</foo>
<strange>Huh?</>
<goo><gaa></goo></gaa>
</example>
would parse as:
<example>
<foo>Oops!</foo>
<bar>Ouch!</bar>
<strange>Huh?</strange>
<goo><gaa /></goo>
</example>
This seems a half-way usable interpretation, however parsers checking for
well-formedness would choke on it. Because this is not compliant behavior, the
PugXML parser, and parsers taking a similar approach would be
indicated for situations requiring the parsing of machine-generated,
well-formed fully-compliant XML documents, and not human-generated documents.
See W3C Compliance below.
The Inner Loop
In developing the PugXML parser, the intention was to try to condense everything
into a single function with a single loop, removing all non-essential
operations (within a reasonable margin). The PugXML parser inner loop may be
understood most easily using the finite state machine (FSM) formalism, as
depicted in the simplified diagram below. The parser functions as a giant while()
loop on the condition that a string zero-terminator is not hit. Within this
loop a number of states are possible. Depending upon the current state:
-
A new node is appended to the current node's list of children and the cursor is
moved to this position (
::GraftBranch()
).
-
A new attribute is appended to the current node's list of attributes (
::AddAttribute()
).
-
The starting position of a string segment is stored in the current context
(e.g. a node or attribute), then the end position is located and
zero-terminated.
-
The cursor is moved to the current node's parent (
pCursor =
pCursor->parent
).
-
Any time we hit a sequence we don't like, we
continue
, in which
case we obliviously gobble up the string until we enter another valid segment.
Note: if the document is very badly formed, we may end up rejecting the entire
input.
Each vertex in the diagram below represents a character, or set of characters we
expect to encounter (see the inline static BOOL Is*(TCHAR c)
member
functions of the CPugXmlParser
class). Edges between vertices
represent control flow as each successive character, or classified segment of
characters is matched. The stippled arrows signify that you can get to the
indicated vertex from just about anywhere.
Legend |
|
|
# |
Symbol, e.g.: '<symbol ' -or- '<... symbol='
-or- '<... ...=symbol '
|
$ |
Data segment; e.g. '>data<' -or- '<... ...="data"'
-or- '<![CDATA[data]]>'
|
" |
' -or- "
|
_ |
Whitespace |
0 |
Null character |
! |
All other chars match themselves |
String Segmentation Process
Below is a graphical depiction and narration of the incremental string
segmentation process for a sample XML string, with reference to the above
diagram. Each numbered rectangle represents the state of the input string as
the parsing proceeds.
For the following string:
<example version="1.0">Text</example>
segmentation would proceed:
such that, at:
-
A
'<'
, then a symbol char is hit (see ::IsSymbol()
)
so we are now in the '<#'
state. A new node is grafted onto the
tree (pCursor= ::GraftBranch()
). The position is stored in the LPTSTR
XMLBRANCH::name
member.
-
Whitespace is hit, so the parser is now in the
'<#_'
state. The
character at this offset is set to null. We expect to hit an attribute, but we
could just as well hit a '/'
or a '>'
.
-
A symbol char is hit so we assume an attribute (
'<#_#'
) state.
An attribute is appended to the XMLATTR** XMLBRANCH::attribute
member.
The string position is stored in the attribute char* XMLATTR::name
member.
-
We scan ahead for a
'='
, ' '
, '/'
, or '>'
.
Finding an '='
, we expect attribute data to follow.
-
Finding a
'"'
, we remember this quote character, then store the
next character position in the LPTSTR XMLATTR::value
member of the current attribute.
-
We now scan ahead until we hit the terminating
'"'
, then mark
it as null.
-
We have hit a
'>'
, and there is following non-whitespace data
so we graft a new node onto the tree (pCursor= ::GraftBranch()
),
store the following position in the LPTSTR
XMLBRANCH::data
member, then restore the cursor to the current node's parent (pCursor =
pCursor->parent
).
-
We now search ahead for a
'<'
, then mark it as null,
terminating the PCDATA
segment. Finding a following '/'
,
we are in a '</'
state, so we step to the current node's parent
(pCursor = pCursor->parent
).
resulting in the document tree:
Usage
First and foremost, because the generated document tree structure is directly
indexed to the input string, you must ensure that the input string will exist
for as long as you wish to traverse the tree. In addition, the string is
altered by the parsing process (it is zero-terminated for every element name,
attribute name, PCDATA
section, etc.), so you cannot expect to use
it for anything after the parsing pass is completed. If you want to alter any
string segment (e.g. element name, attribute value, etc.) the wrapper class, CPugXmlBranch
, will allocate memory for that segment if the resulting string will not fit in
the existing space. This can result in greater memory use.
Following is a listing for a minimal application using the GnatXMLParser. Note:
See the demo project for more extensive examples.
#include <iostream>
#include "pugxml.h"
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
if(argc < 2) return 0;
CPugXmlParser* pXml = new CPugXmlParser();
pXml->ParseFile(argv[1]);
cout << pXml->GetRoot() << endl;
delete pXml;
return 0;
}
Classes (Note: See also 'pugxml.h' or the provided manual)
The PugXML parser uses only a limited set of classes, the methods of which are
all implemented inline in order to shoehorn everything into a single header.
The classes you need mainly be concerned with are CPugXmlParser
,
which provides an interface to the underlying parser function (static LPTSTR
CPugXmlParser::Parse()
), and CPugXmlBranch
, which
provides a lightweight interface to the generated document tree. If you would
like to minimize code size, cut & paste the inline functions to a new .cpp
file, and remove the inline
declarations. If you would like to use
the basic parsing facility in a C
project, dispense with the
classes, and copy the static
member functions of CPugXmlParser
,
the defines, enums, structures, and the global functions, NewAttribute()
,
NewBranch()
, GraftBranch()
, AddAttribute()
,
and FreeTree()
to your C
implementation file. The
entire implementation uses only the following externs, so it should be
relatively easy to port:
-
malloc
, memset
, realloc
, free
,
memcpy
.
-
strlen
, isalnum
, strncpy
(or their wide
equivalents via <wtypes.h>
and <tchar.h>
).
-
strtol
, strtod
, strtok
, strcpy
(ditto, for CPugXmlBranch
).
-
CreateFile
, ReadFile
, CloseHandle
, ZeroMemory
(for CPugXmlParser::ReadFileData()
, could as easily be fopen
,
fread
, fclose
).
-
ostream
(for CPugXmlBranch::Serialize()
- wanton profligacy: who needs
streams, when you have handles!).
CPugXmlBranch Class - Provides a
light-weight wrapper for manipulating XMLBRANCH structures.
|
|
|
Public Member |
Summary |
|
|
Construction/Destruction |
|
|
|
CPugXmlBranch() |
Default constructor. |
CPugXmlBranch(XMLBRANCH* p) |
Construct, wrapping the given XMLBRANCH pointer. |
CPugXmlBranch(const CPugXmlBranch& r) |
Copy constructor. |
~CPugXmlBranch() |
Destructor. |
|
|
Overloaded Operators |
|
|
|
operator XMLBRANCH*() |
Cast as XMLBRANCH pointer. |
operator LPVOID() |
Cast as void pointer. |
CPugXmlBranch& operator=(const CPugXmlBranch& r) |
Assign internal branch pointer, with no copy. |
BOOL operator==(const CPugXmlBranch& r) |
Compare internal branch pointer. |
CPugXmlBranch operator[](UINT_PTR i) |
Access the child at subscript. |
|
|
Branch Classification |
|
|
|
BOOL IsNull() |
Branch pointer is not null.
|
BOOL IsElement() |
Branch is an element.
|
BOOL IsComment() |
Branch is a comment.
|
BOOL IsPCDATA() |
Branch is PCDATA .
|
BOOL IsCDATA() |
Branch is CDATA .
|
BOOL IsPI() |
Branch is a processing instruction.
|
BOOL IsDOCTYPE() |
Branch is DOCTYPE . If TRUE , parameter
symbols will be stored as name-only attributes (e.g. SYSTEM or PUBLIC ),
and quoted sections as value-only attributes. If an inline DTD section is
present, it will be stored in the XMLBRANCH::data member (access
this using GetData() or SetData() ). You can then
parse this with a fresh CPugXmlParser to get access to the DTD
entities (see IsDTD*() ), or set the parser flags PARSE_DOCTYPE|PARSE_DTD . |
BOOL IsDTD()
BOOL IsDTD_ATTLIST()
BOOL IsDTD_ELEMENT()
BOOL IsDTD_ENTITY()
BOOL IsDTD_NOTATION() |
Branch is a DTD element, e.g. ATTLIST , ELEMENT ,
or ENTITY . If TRUE , the name, e.g. '<!ATTLIST name
...> ' will be stored in the XMLBRANCH::name member,
and all other string up to the terminating '> ' ('...> '
as above), will be stored in the XMLBRANCH::data member. An ENTITY
name of the form '<!ENTITY % name... ', or of the form '<!ENTITY
%name... ' will be stored as '%name '.
Further parsing of this data is up to you.
|
BOOL IsNamed(LPCTSTR szName) |
Branch is named szName .
|
BOOL IsRoot() |
Branch is tree root.
|
|
|
Member Inventory |
|
|
|
BOOL HasData() |
Branch has data (comment, CDATA or PCDATA).
|
BOOL HasChildren() |
Branch has 1 or more children.
|
BOOL HasAttributes() |
Branch has 1 or more attributes.
|
BOOL HasSiblings() |
Branch has one or more siblings.
|
BOOL HasName() |
Branch has a name.
|
BOOL HasAttribute(LPCTSTR szName) |
Test if branch has an attribute named szName . |
|
|
Member Accessors |
|
|
|
LPCTSTR GetName() |
Access pointer to branch name if any, else empty string.
|
XMLENTITY GetType() |
Access branch entity type.
|
LPCTSTR GetData() |
Access pointer to data if any, else empty string.
|
UINT_PTR GetChildrenCount() |
Access branch's child count.
|
CPugXmlBranch GetChildAt(UINT_PTR i) |
Access child branch at subscript as CPugXmlBranch .
|
UINT_PTR GetAttributesCount() |
Access branch's attribute count.
|
const XMLATTR* GetAttributeAt(UINT_PTR i) |
Access attribute at subscript if any, else empty attribute.
|
UINT_PTR GetSiblingsCount() |
Access branch's sibling count (parent's child count).
|
CPugXmlBranch GetSiblingAt(UINT_PTR i) |
Access sibling branch at subscript as CPugXmlBranch .
|
CPugXmlBranch GetParent() |
Access branch's parent if any, else null CPugXmlBranch . |
LPCTSTR GetAttribute(LPCTSTR szName,LPCTSTR tDefault) |
Access attribute value as LPCTSTR for attribute named
szName . If not found return tDefault . |
LONG GetAttribute(LPCTSTR szName,LONG tDefault) |
Access attribute value as LONG for attribute named szName .
If not found return tDefault . |
DOUBLE GetAttribute(LPCTSTR szName,DOUBLE tDefault) |
Access attribute value as DOUBLE for attribute named
szName . If not found return tDefault . |
BOOL GetAttribute(LPCTSTR szName,BOOL tDefault) |
Access attribute value as BOOL for attribute named szName .
If not found return tDefault . |
|
|
Name-To-Object Mapping |
|
|
|
XMLATTR* MapStringToAttributePtr(LPCTSTR szName) |
Brute force map an attribute name to a pointer to that attribute,
if found. |
UINT_PTR MapStringToAttributeIndex(LPCTSTR szName) |
Brute force map an attribute name to the index of that attribute,
if found. |
XMLBRANCH* MapStringToChildPtr(LPCTSTR szName) |
Brute force map a child name to a pointer to the first matching
child, if found. |
UINT_PTR MapStringToChildIndex(LPCTSTR szName) |
Brute force map a child name to the index of the first matching
child, if found. |
|
|
Traversal Helpers |
|
|
|
CPugXmlBranch FindFirstElement(LPCTSTR szName) |
Recursively-implemented depth-first find the first matching child
element having name szName . Use for shallow drill-downs. Returns
branch, or CPugXmlBranch(NULL) if not found. |
CPugXmlBranch FindFirstBranch(XMLENTITY eType) |
Recursively-implemented depth-first find the first matching
child branch having type eType . Use for shallow drill-downs.
Returns branch, or CPugXmlBranch(NULL) if not found. |
CPugXmlBranch FindFirstElementWhere(LPCTSTR szName,LPCTSTR
szAttr,LPCTSTR szValue) |
Recursively-implemented depth-first find the first matching
child element having name szName , an attribute named szAttr ,
and value of said attribute being szValue . Returns branch, or CPugXmlBranch(NULL)
if not found. |
CPugXmlBranch FindFirstElemData(LPCTSTR szName,LPCTSTR szData) |
Recursively-implemented depth-first find the first matching
element also having matching PCDATA. |
CPugXmlBranch FindFirstElemAttr(LPCTSTR szName,LPCTSTR
szAttr,LPCTSTR szValue) |
Recursively-implemented depth-first find the first matching
element also having matching attribute. |
void FindAllElements(LPCTSTR szName,CPugXmlBranchArray&
rFound) |
Find all elements having the given name. |
LPTSTR CompilePath(LPCTSTR szDelim) |
Compile an absolute branch directory-style path from root as a
text string (e.g. '/document/branch '). |
CPugXmlBranch FindByPath(LPCTSTR szPath,LPCTSTR szDelim) |
Search for a branch by directory-style path (e.g. '/document/branch '
-or- '../foo ' -or- './../bar ', etc.). |
BOOL Traverse(CPugXmlFilter& rFilter)
|
Recursively traverse the tree, receiving notification in CPugXmlFilter -derived
class OnBranch() callback. |
BOOL MoveToRoot() |
Move the cursor (_m_pRoot ) to the absolute root of
the document tree. |
BOOL MoveToParent() |
Move the cursor (_m_pRoot ) to the current branch's
parent. |
BOOL MoveToSibling(UINT_PTR i) |
Move the cursor (_m_pRoot ) to the current branch's
sibling at subscript. Equivalent to MoveToChild following MoveToParent . |
BOOL MoveToFirstSibling(LPCTSTR szName) |
Move the cursor (_m_pRoot ) to the current branch's
sibling matching given name. |
BOOL MoveToNextSibling(LPCTSTR szName) |
Move to the current branch's next sibling by position and name. |
BOOL MoveToNextSibling() |
Move to the current branch's next sibling by position. |
BOOL MoveToChild(UINT_PTR i) |
Move the cursor (_m_pRoot ) to the current branch's
child at subscript. |
BOOL MoveToChild(LPCTSTR szName) |
Move the cursor (_m_pRoot ) to the current branch's
child matching given name. |
|
|
Editorial Helpers |
|
|
|
void Create() |
Create a new, empty root, for de novo tree construction. |
static BOOL SetStringMember(LPTSTR* pDest,LPCTSTR szSrc,LPBOOL
pInSitu) |
Set structure string member to given value. |
BOOL SetAttributeName(UINT_PTR i,LPCTSTR newVal) |
Set attribute name at subscript. |
BOOL SetAttributeName(LPCTSTR szName,LPCTSTR newVal) |
Set attribute name where name is now szName . |
BOOL SetAttributeValue(UINT_PTR i,LPCTSTR newVal) |
Set attribute value at subscript. |
BOOL SetAttributeValue(LPCTSTR szName,LPCTSTR newVal) |
Set attribute value to newVal where name is szName . |
BOOL SetAttributeValue(LPCTSTR szName,LONG newVal) |
Set attribute value to newVal where name is szName . |
BOOL SetAttributeValue(LPCTSTR szName,DOUBLE newVal) |
Set attribute value to newVal where name is szName . |
BOOL SetAttributeValue(LPCTSTR szName,BOOL newVal) |
Set attribute value to newVal where name is szName . |
BOOL SetName(LPCTSTR newVal) |
Set element name. |
BOOL SetData(LPCTSTR newVal) |
Set branch data (PCDATA , CDATA , or
comment). |
BOOL DeleteAttributeAt(UINT_PTR i) |
Remove attribute at the given subscript. |
BOOL DeleteAttributeAt(LPCTSTR szName) |
Remove attribute having the given name. |
BOOL AddAttribute(LPCTSTR szName,LPCTSTR szValue) |
Append a new attribute to the branch list of attributes. |
BOOL AddAttribute(LPCTSTR szName,LONG lValue) |
Append a new attribute of type LONG to the branch
list of attributes. |
BOOL AddAttribute(LPCTSTR szName,DOUBLE dValue) |
Append a new attribute of type DOUBLE to the branch
list of attributes. |
BOOL AddAttribute(LPCTSTR szName,BOOL bValue) |
Append a new attribute of type BOOL to the branch
list of attributes.
|
XMLENTITY SetType(XMLENTITY eType) |
Set the current branch entity type. |
CPugXmlBranch AppendChild(XMLENTITY eType) |
Allocate & append a child branch of the given type at the end
of the current branch array of children.
|
CPugXmlBranch InsertChildAt(UINT_PTR i,XMLENTITY eType) |
Allocate & insert a child branch of the given type at
subscript. |
BOOL DeleteChildAt(UINT_PTR i) |
Delete the child branch at the given subscript. |
|
|
Output Helpers |
|
<hr noShade SIZE="1">
|
|
static void Serialize(ostream& rOs,CPugXmlIndent&
rIndent,XMLBRANCH* pBranch,BOOL bLineBreaks) |
Stream output. Recursively writes the given XMLBRANCH
structure to the given stream. |
void Serialize(ostream& rOs,TCHAR cIndentChar,BOOL
bLineBreaks) |
Stream output. Recursively writes the internal XMLBRANCH
structure to the given stream. |
friend ostream& operator<<(ostream&
rOs,CPugXmlBranch& rBranch) |
Stream output operator. Wraps Serialize . |
CPugXmlParser Class - Provides a
high-level interface to the XML parser.
|
|
|
Public Member |
Summary |
|
|
Construction/Destruction |
|
|
|
CPugXmlParser(LONG lGrowSize,BOOL bAutoDelete) |
Default constructor. |
CPugXmlParser(LPTSTR s,LONG lGrowSize,BOOL
bAutoDelete) |
Direct parse constructor. |
~CPugXmlParser()
|
Destructor.
|
|
|
Accessors/Operators |
|
|
|
operator XMLBRANCH*()
|
Cast as XMLBRANCH pointer to root. |
operator CPugXmlBranch()
|
Cast as CPugXmlBranch (same as GetRoot() ). |
CPugXmlBranch GetRoot()
|
Returns the root wrapped by an CPugXmlBranch . |
void Clear() |
Clear any existing tree or string data.
|
XMLBRANCH* Attach(XMLBRANCH* pRoot) |
Attach an externally-generated root to the parser. |
XMLBRANCH* Detach() |
Detach the current root from the parser. |
|
|
Parsing Helpers |
|
|
|
LPTSTR Parse(LPTSTR s,DWORD dwOpts)
|
Parse the given XML string in-situ, using the specified options. Returns any remaining string or null.
|
BOOL ParseFile(LPCTSTR szPath,DWORD dwOpts) |
Load into memory and parse the contents of the file at the given path, using
the specified options. |
|
|
Utility Functions |
|
|
|
static LPTSTR Parse(register LPTSTR s,XMLBRANCH*
pRoot,LONG lGrow,DWORD dwOpts)
|
The parser inner loop. Parse the given XML string in-situ, over the given root,
using the specified growth increment, and options. Returns any remaining string not parsed, or null.
See Parser Options below. |
inline static void FreeTreeRecursive(XMLBRANCH*
pRoot)
|
Recursively free a tree. |
static BOOL ReadFileData(LPCTSTR
szPath,LPTSTR*,LPDWORD,DWORD) |
Read data from the file at at the given path into a buffer. |
|
|
Parser Options |
|
|
|
PARSE_MINIMAL |
No trim or normalize, no DTD, no PI, no CDATA, no comments. |
PARSE_PI |
Parse '<?...?> '. |
PARSE_DOCTYPE |
Parse '<!DOCTYPE name ... [...]> ' |
PARSE_COMMENTS |
Parse '<!----> ' |
PARSE_CDATA |
Parse '<![CDATA[...]]> ' -or- '<![INCLUDE[...]]> '. |
PARSE_TRIM_PCDATA |
Trim '>...< '. |
PARSE_TRIM_ATTRIBUTE |
Trim 'foo="..." '. |
PARSE_TRIM_CDATA |
Trim '<![CDATA[...]]> ' -or- '<![INCLUDE[...]]> '. |
PARSE_TRIM_ENTITY |
Trim '<!ENTITY name ...> ', etc. |
PARSE_TRIM_DOCTYPE |
Trim '<!DOCTYPE name ... [...]> '. |
PARSE_TRIM_COMMENT |
Trim '<!----> '. |
PARSE_NORMALIZE |
Normalize all that are flagged to be trimmed (e.g. translate multiple
whitespace segments to single space char). |
PARSE_DTD |
If PARSE_DOCTYPE set, then parse whatever is in data member (' [...]> '). |
PARSE_DTD_ONLY |
If (PARSE_DOCTYPE|PARSE_DTD) set, then abort after parsing DTD. If using CPugXmlParser::ParseFile()
access the final string position through CPugXmlParser::GetParseFilePos() . If using CPugXmlParser::Parse() ,
access the final string position as the value returned from the function call. |
PARSE_DEFAULT |
Parse all, trim and normalize. |
Global Methods - Used primarily by the parser
inner loop.
|
|
|
Method |
Summary |
static BOOL StrCatGrow(LPTSTR* pLhs,LPCTSTR pRhs) |
Concatenate pRhs to pLhs , growing pRhs
if neccessary. |
static BOOL StrWtrim(LPTSTR* s) |
In situ trim leading and trailing whitespace. |
static BOOL StrWnorm(LPTSTR* s) |
In situ trim leading and trailing whitespace, then convert all consecutive whitespace to a single space char. |
inline static XMLATTR* NewAttribute(void) |
Allocate & init an XMLATTR structure. |
inline static XMLBRANCH* NewBranch(XMLENTITY eType) |
Allocate & init an XMLBRANCH structure. |
inline static XMLBRANCH* GraftBranch(XMLBRANCH* pParent,LONG
lGrow,XMLENTITY eType) |
Allocate & graft a new XMLBRANCH onto the given
parent. |
inline static XMLATTR* AddAttribute(XMLBRANCH* pBranch,LONG
lGrow) |
Allocate & append a new attribute to the given XMLBRANCH . |
inline static void FreeTree(XMLBRANCH* pRoot) |
Non-recursively free a tree, given its root (also frees the root). |
Performance
Note: For all performance tests, the parser was configured with following flags
set: (PARSE_PI|PARSE_DOCTYPE|PARSE_COMMENTS|PARSE_CDATA|PARSE_DTD)
.
Operating Efficiency
The "http://www.codeproject.com/datetime/ccputicker.asp">CCPUTicker
class by J.M. McGuiness and P.J. Naughter was used to measure the elapsed count
of non-exclusive Pentium clock cycles for the static Parse
inner
loop. This value was divided by the count input bytes. In this summary test, a
set of seven well-formed XML documents were parsed, representing a random mix
of sizes, features and complexity. Following is a table of these measurements,
and two plots, Input Bytes/Elapsed CPU Cycles (cn-c0),
and Input Bytes/Elapsed Milliseconds (tn-t0).
Note: See Test Platform below for the test platform specs.
Bytes |
tn-t0 |
(tn-t0)/Bytes |
cn-c0 |
(cn-c0)/Bytes |
670
|
0
|
0.00000
|
34,707
|
51
|
29,504
|
0
|
0.00000
|
2,144,866
|
72
|
97,000
|
10
|
0.00010
|
2,711,453
|
27
|
161,000
|
20
|
0.00012
|
6,264,292
|
38
|
281,772
|
40
|
0.00014
|
15,667,883
|
55
|
1,054,981
|
170
|
0.00016
|
79,446,024
|
75
|
6,998,822
|
350
|
0.00005
|
157,064,600
|
22
|
An identical test performed using the MSXML 4.0, and .NET parsers showed a 4.5X
increase in elapsed milliseconds.
Memory Overhead
In an additional test, the parser's memory overhead was measured as a function
of input bytes, and peak memory usage. This gains cogency in the context
of the input data markup-to-data ratio. In the table below, 'Saturated' markup
refers to markup that has a high markup-to-data ratio, e.g.:
<a><b><c><d><e/></d></c></a></a>
or:
<a b="b" c="c" d="d" e="e">f</a>
'Sparse' markup refers to the opposite, e.g.:
<aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/>
or:
<a>bbbbbbbbbbbbbbbbbbbbbbbbb</a>
and, 'Mixed' markup would be somewhere in between.
Bytes |
Markup Type |
Peak (Bytes) |
Overhead |
68,039 |
Saturated |
1,798,144 |
26 X |
281,772 |
Mixed |
1,826,816 |
6 X |
1,819,636 |
Mixed |
8,724,480 |
5 X |
1,958,486 |
Mixed |
10,395,648 |
5 X |
6,998,822 |
Sparse |
16,158,720 |
2 X |
As you can see, 'Saturated' markup has the highest memory overhead
(understandably, since the parser must allocate many more XMLBRANCH
or XMLATTR
structures). Although it is not implemented in PugXML,
it should be possible to take a random sampling of the input string by counting
the markup character to data character ratio, and thereby estimate both memory
overhead, and knowing the input string length, peak memory usage. This might
prove useful for embedded implementations where one would desire to know
a priori whether the potential document tree could fit into memory.
An identical test performed using the MSXML 4.0, and .NET parsers showed a
memory overhead of approximately 3.5X that of the PugXML parser.
Interpretation
OK, so the PugXML parser is comparatively fast, and doesn't use absurd amounts of memory,
but what does this really mean? By analogy, if you took your average gas guzzling SUV
(MSXML 4.0), and stripped everything off the chassis, except for one seat, the steering
wheel, and the engine, you could probably get the thing going really fast. The question you have to ask
yourself is: would you drive your children to school in such a vehicle? You bet!
(Note: For some serious performance in a fully-validating XML parser, take a look at
RXP).
OS Name Microsoft Windows 2000
Version 5.0.2195 Service Pack 3, RC 3.136 Build 2195
Processor x86 Family 5 Model 9 Stepping 1 AuthenticAMD ~451 Mhz
Total Physical Memory 261,680 KB
Total Virtual Memory 1,024,820 KB
PugXML was designed to perform robust parsing of potentially malformed XML
markup, and as such is non-compliant. Using a fully W3C-compliant parser
adhering to the strictures of well-formedness, validity, DTDs, XSDs, etc., one often
hits a document that simply cannot be parsed with significant manual editing. PugXML never rejects
a document. PugXML will parse a wide variety of malformed fragments, and
attempt to generate a well-formed document tree. Using the W3C Conformance Test
Suite 20020606, PugXML will parse all well-formed (valid) stand-alone documents
correctly (without entity expansion). Of the invalid tests, PugXML will, in 99%
of the cases yield a well-formed document tree from which one can actually
extract some data. Despite not being fully W3C compliant, PugXML could easily be used as a
starting point to build a fully compliant parser (although it would be sure to bite you if you tried).
Known Problems
-
Entities: Although the DOCTYPE section and subsidiary DTD
entities may be parsed using this parser, entities declared therein are not
expanded in the document. In addition, build-in entities (e.g.
' ') are not expanded however they are retained in whatever name, data
or value section, so it would not be difficult to apply them after the document
tree is constructed. Alternatively, you could first parse the DTD section,
build an entity table, expand instances of the entities in the input string,
then parse the input string as normally.
-
Character Encodings: Although the parser can be compiled and
will run correctly with
UNICODE
or _MBCS
defined, it
does not understand character encoding as specified in the XML PI section. If
you are attempting to parse a non-standard codepage, the parser may fail, or
corrupt the data. Currently the parser will work correctly for UTF-8 (65001),
and various flavors of Latin and Western European ISO/ASCII with _MBCS
defined, and for UNICODE (1200) with UNICODE
defined (but not both
in the same build). If your requirements include the need to parse
ad-hoc data that may use one or more different codepages, you might consider
using another parser.
-
Recursion: Some functions such as the traversal helpers, and
the static stream output function,
CPugXmlBranch::Serialize
, are
implemented using recursion. In some cases, particularly if you are traversing
a very deep tree, and depending on your compiler, you may encounter a stack
overflow. It would be relatively easy to make safer non-recursive
implementations (see the ::FreeTree
function for an example of
non-recursive traversal).
-
XML Namespaces: The parser does not tokenize '
<ns:tag...
'
or '<... ns:attribute...
'. If namespace prefixes are
present they will be left as is, as part of the object name, e.g. 'ns:tag
'
or 'ns:attribute
'.
Errata
Since the initial performance measurements, there have been some minor changes made to the static CPugXmlParser::Parse()
, NewBranch()
and GraftBranch()
functions (verification of memory allocation, and the string normalization option). Until new tests can be performed it is hard to say whether or how these changes may impact
the overall performance.
History
-
01/15/2003: Bug Fix: A memory exception at
CPugXmlParser::~CPugXmlParser()
,
or CPugXmlParser::Clear(),
when parser option PARSE_DTD_ONLY
is not set. Note: The source download, but not the demo project, has been updated with this fix.
-
01/10/2003: Initial Release.