Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / XML

An XML parser and editor with shades of a Design Pattern

4.86/5 (14 votes)
16 Aug 2010CPOL6 min read 57.5K   2.9K  
A very generic XML parser whose internal implementation can be changed without affecting the rest of the source code.

screenshot.GIF

Introduction

XML has become a very popular language for data representation/GUI Design/Communication between servers etc., but the tools used to parse this language varies from COM based (e.g., MS-XML), to SAX parsers, to Xerces, and so on. Every other week or so, a new tool for parsing is coming, providing a better and more efficient solution, but unfortunately, you cannot go around fiddling around with the parsing codes just because the new kid (sorry, I meant kit) in parsing has arrived. Hence, I came up with (if I may dare call it) a variation of the Bridge Pattern where you will use a simple parser which would act as a wrapper for any XML parsing library, but you don't have to touch the rest of your source code if the internal implementation of the parser changes.

The Problem

Actually, at first glance, it may seem a trivial issue, since somebody having a good grasp on Design Patterns can easily come up with a solution for a generic parser (whose internal change wouldn't reflect in the upper layers). But looking deep into it, you will find a problem, and that is, an XML parser (provided by any 3rd party library) will have a good number of classes to represent document, element, node, and attributes. Now, if you want to build your own parser following the GOF (Gang of Four's Design Pattern) way, you have to provide a wrapper for each of these classes, and then write a number of methods for all those classes, and export an interface which is independent of the wrapped classes, but a call to the interface methods will end up invoking the necessary internal methods.

For example, if a Xerces parser has data structures called XercesDOMParser, XercesDOMDocument, and XercesDOMNode, with each of them having methods such as virtual DOMDocument* getDocument() of XercesDOMParser, DOMElement *getDocumentElement() of DOMDocument, and virtual DOMNodeList *getChildNodes() const of DOMNode, then you have to provide the corresponding wrappers for these classes having mappings to the above mentioned methods; which means, you have to virtually create another heavily loaded parser, except the parsing part will be done by the internal (wrapped) parsers. Then, there lies this small problem of how to align your own defined classes with minimal effort to actually represent the XML parse tree.

The examples I have given are actually assuming I would be using some sort of combination of the Abstract Factory and Bridge Pattern to represent the XML data structures. Maybe somebody can point me to a better and easy to use pattern which I cannot think of at the moment to build a simple generic XML parser.

The Solution

The solution lies in keeping everything super simple. Instead of writing wrappers for the individual classes such as XercesDOMParser, DOMDocument, DOMNode etc., what needs to be done is create some useful data structures such as TXMLNode to represent a node, TXMLAttrib to represent an attribute, TXMLRoot to represent the root element, and last but not least, TXMLParser, a fully abstract class (with pure virtual methods only) to represent the parser. The parser will contain most of the methods for setting/getting nodes, setting/getting underlying properties/values etc.

Following this idea, the virtual DOMElement *getDocumentElement() of DOMDocument will map to a method such as errno_t GetRoot(/*OUT*/TXMLRoot& root) of TXMLParser itself, where TXMLRoot represents the document element.

We all know that there must be a root item to all XML files which is usually referred as the document element. In parsing any XML file, we first look for the document element or root, which is exposed through virtual errno_t GetRoot(/*OUT*/TXMLRoot& root) of the parser. On the other hand, the creation of the internal document, as well as the document itself, will be fully hidden inside the parser. The document gets created by invoking methods such as:

C++
// creates a new document
errno_t CreateNewDocument(); 

// loads and parse the XML file 
errno_t OpenDocument(const TCHAR* path)

of our simple XML parser.

Likewise, other operations are similar to the first example I gave of GetRoot(...).

Now, let's have a some-what detailed look at the simple data structures we need to represent an XML tree:

C++
class TXMLNode
//In our simplified view, an element and a node more or less means the same
{
public:
    enum NodeType
    {
        UNKNOWN                        = 0,
        ELEMENT_NODE                   = 1,
        ATTRIBUTE_NODE                 = 2,
        TEXT_NODE                      = 3,
        CDATA_SECTION_NODE             = 4,
        ENTITY_REFERENCE_NODE          = 5,
        ENTITY_NODE                    = 6,
        PROCESSING_INSTRUCTION_NODE    = 7,
        COMMENT_NODE                   = 8,
        DOCUMENT_NODE                  = 9,
        DOCUMENT_TYPE_NODE             = 10,
        DOCUMENT_FRAGMENT_NODE         = 11,
        NOTATION_NODE                  = 12,        
        NODE_TYPE_COUNT                = 13
    };    

public:
    TXMLNode()
    {
        _nodeName = NULL;
        _nodeValueString = NULL;
        _nodeType = 0;
        _internalNode = NULL;
    }        
    
    ~TXMLNode()
    {
        if(_nodeName != NULL)
        {
            delete[] _nodeName;            
        }
        if(_nodeValueString != NULL)
        {
            delete[] _nodeValueString;            
        }                
    }    

    const TCHAR* GetName()
    {
        return _nodeName;
    }
    void SetName(const TCHAR* name)
    {        
        CopyString(_nodeName, name);
    }
    
    const TCHAR* GetValueString() 
    {
        return _nodeValueString;
    }
    void SetValueString(const TCHAR* val)
    {
        CopyString(_nodeValueString, val);        
    }
    
    unsigned short GetNodeType()
    {
        return _nodeType;
    }
    
    const TCHAR* GetTextContent();
    
    void* GetInternalNode()
    {
        return _internalNode;
    }
    void SetInternalNode(void* internalNode)
    {
        _internalNode = internalNode;
    }
    
    void SetNodeType(unsigned short type)
    {
        _nodeType = type;
    }    

    void CopyString(TCHAR*& des, const TCHAR* src)
    {
        if(des != NULL )
        {
            delete[] des;
            des = NULL;
        }

        if(src != NULL)
        {
            size_t len = _tcslen(src);        
            des = new TCHAR[len + 1];            
            memcpy(des, src, sizeof(TCHAR) * len);
            des[len] = 0;
        }        
    }
    
    TCHAR* _nodeName;
    TCHAR* _nodeValueString;    
    unsigned short _nodeType;
    
    void* _internalNode;    
};

class TXMLRoot : public TXMLNode
{
public:
    TXMLRoot()
    {
        TXMLNode();
    }    
};

class TXMLAttrib : public TXMLNode
{
};

As you can see, these classes are not actually acting as direct wrappers for the Xerces parser's DOMNode, or DOMElement. Rather, all of those classes has a void pointer to hold the corresponding internal data (for example, TXMLNode has DOMNode pointer inside as a void pointer). The responsibility falls upon the TXMLParser derived classes, to actually make sense out of the void pointer data, and populate the TXMLNode, TXMLRoot, or TXMLAttrib info according to the internal implementation. Here is an example of how the derived class uses the internal data:

C++
bool TXMLParserXerces::GetParentNode(TXMLNode* childNode, TXMLNode& node)
{
    bool bRet = false;
    INTERNAL_NODE* internalNode = (INTERNAL_NODE*)childNode->GetInternalNode();
    if(internalNode != NULL)        
    {
        INTERNAL_NODE* xmlNode = internalNode->getParentNode();    
        if(xmlNode != NULL)
        {
            PopulateNodeData(xmlNode, &node);
            bRet = true;
        }
    }        
    return bRet;
}

You can guess from the above example that TXMLParserXerces is a TXMLParser derived class.

The TXMLParser (abstract) class has the following structure, and the method names are explicit enough to understand what their functionalities are so that they can be easily implemented in the concrete classes:

C++
class TXMLParser
{
public: 
    enum ParserError
    {
        NO_XML_ERROR = 0, 
        INVALID_XML_DOCUMENT,
        INVALID_ROOT_ELEMENT,
        DETACHED_XML_NODE,
        INVALID_XML_ELEMENT, 
        INVALID_NODE_NAME,
        CREATE_IMPL_FAILED,
        CREATE_DOC_FAILED,
        SAVETO_FILE_FAILED,
        CREATE_ELEMENT_FAILED,
        SET_ATTRIB_FAILED, 
        REMOVE_CHILD_FAILED,
        INSERT_BEFORE_FAILED,
        REPLACE_CHILD_FAILED,
        REMOVE_ATTRIB_FAILED,
        UPDATE_DATA_FAILED
    };

    TXMLParser(bool simplified = true)
    {
    }

    virtual ~TXMLParser(void)
    {
    }

    virtual void Release() = 0;
    virtual errno_t CreateNewDocument() = 0;
    //loads and parse the XML file 
    virtual errno_t OpenDocument(const TCHAR* path) = 0;
    virtual errno_t SaveDocument(const TCHAR* path) = 0;
    virtual errno_t SaveToStream(TCHAR*& buf, unsigned int& len) = 0;
    virtual errno_t CloseDocument() = 0;
    virtual errno_t GetRoot(/*OUT*/TXMLRoot& root) = 0;

    virtual unsigned int GetChildCount(TXMLNode* parentNode) = 0;
    virtual bool GetChild(unsigned int index, 
            TXMLNode* parentNode, /*OUT*/TXMLNode& node) = 0; 

    // to use in conjunction with GetPrevSibling and GetNextSibling
    virtual bool GetFirstChild(TXMLNode* parentNode, /*OUT*/TXMLNode& node) = 0;
    // to use in conjunction with GetPrevSibling and GetNextSibling
    virtual bool GetLastChild(TXMLNode* parentNode, /*OUT*/TXMLNode& node) = 0;

    virtual bool GetNextSibling(TXMLNode* curNode, /*OUT*/TXMLNode& node) = 0; 
    virtual bool GetPrevSibling(TXMLNode* curNode, /*OUT*/TXMLNode& node) = 0; 

    virtual bool GetParentNode(TXMLNode* childNode, /*OUT*/TXMLNode& node) = 0; 
    virtual unsigned int GetAttribCount(TXMLNode* node) = 0;
    virtual bool GetAttrib(unsigned int index, 
            TXMLNode* node, /*OUT*/TXMLAttrib& attrib) = 0;

    virtual bool FindNode(const TCHAR* nodeName, /*OUT*/TXMLNode& node) = 0;
    virtual bool FindNode(const TCHAR* nodeName, 
            TXMLNode* curNode, /*OUT*/TXMLNode& node) = 0;

    //The following two functions are very unlikely 
    //to be used, put in there just incase
    virtual bool FindNodeByNameVal(const TCHAR* nodeName, 
            const TCHAR* nodeVal, /*OUT*/TXMLNode& node) = 0;
    virtual bool FindNodeByNameVal(const TCHAR* nodeName, 
            const TCHAR* nodeVal, TXMLNode* curNode, /*OUT*/TXMLNode& node) = 0; 

    virtual bool FindAttrib(const TCHAR* nodeName, 
            const TCHAR* attribName, TXMLAttrib& attrib) = 0;
    virtual bool FindAttrib(TXMLNode* node, 
            const TCHAR* attribName, /*OUT*/TXMLAttrib& attrib) = 0;

    virtual bool IsLoaded() = 0;

    // Do not delete the returned node externally, parser will take care of it
    virtual TXMLNode* CreateNodeInstance() = 0;
    // Do not delete the returned node externally, parser will take care of it 
    virtual TXMLRoot* CreateRootInstance() = 0;
    // Do not delete the returned node externally, parser will take care of it
    virtual TXMLAttrib* CreateAttribInstance() = 0; 
    virtual errno_t AddRoot(TXMLRoot* rootNode) = 0;
    virtual errno_t RemoveRoot() = 0;
    virtual errno_t AddChild(TXMLNode* parentNode, TXMLNode* childNode) = 0;
    virtual errno_t RemoveChild(TXMLNode* parentNode, 
                    /*INOUT*/TXMLNode*& childNode) = 0; 
    virtual errno_t InsertBefore(TXMLNode* parentNode, 
                    TXMLNode* newNode, TXMLNode* refNode) = 0;
    virtual errno_t ReplaceChild(TXMLNode* parentNode, 
                    TXMLNode* newNode, TXMLNode* oldNode) = 0; 
    virtual errno_t ReplaceNodeName(TXMLNode* node, const TCHAR* name) = 0;
    virtual errno_t ReplaceNodeValue(TXMLNode* node, const TCHAR* value) = 0;
    //if attrib already exists then sets the attrib otherwise adds the attrib
    virtual errno_t SetAttrib(TXMLNode* node, TXMLAttrib* attrib) = 0;
    virtual errno_t RemoveAttrib(TXMLNode* node, const TCHAR* attribName) = 0;

protected:
    virtual const TCHAR* GetTextContent(TXMLNode* xmlNode) = 0;
    virtual bool UpdateUnderlyingData(TXMLNode* xmlNode) = 0;
};

All through out the code, you would be using a reference (or a pointer) to the TXMLParser class data type only, and define an implementation class such as TXMLParserXerces to do the internal processing and retrieving of XML data.

Using the Code

Following is a quick look at opening an XML format document and saving it using our simple parser:

C++
// CTXMLEditorDoc commands
BOOL CTXMLEditorDoc::OnOpenDocument(LPCTSTR lpszPathName)
{
    if (!CDocument::OnOpenDocument(lpszPathName))
        return FALSE;

    if(GetXMLParser() == NULL)
        return FALSE;

    // TODO: Add your specialized creation code here
    bool bRet = (GetXMLParser()->OpenDocument(lpszPathName) == 
                     TXMLParser::NO_XML_ERROR) ? true : false; 

    if(!bRet)
        AfxMessageBox(L"Sorry, unable to open file");

    return bRet ? TRUE : FALSE;
}

BOOL CTXMLEditorDoc::OnSaveDocument(LPCTSTR lpszPathName)
{
    if(GetXMLParser() == NULL)
        return FALSE;

    bool bRet = (GetXMLParser()->SaveDocument(lpszPathName) == 
                       TXMLParser::NO_XML_ERROR) ? true : false; 

    if(!bRet)
    {
        AfxMessageBox(L"Sorry, unable to save file"); 
    } 
    else
    {
        CDocument::SetModifiedFlag(FALSE); 
    }
    return bRet ? TRUE : FALSE;
}

Following is the snippet to populate a tree control with the XML data of an opened document:

C++
void CTXMLEditorView::OnUpdate(CView* /*pSender*/, LPARAM /*lHint*/, CObject* /*pHint*/)
{
    // TODO: Add your specialized code here and/or call the base class
    GetTreeCtrl().SetBkColor(RGB(255, 255, 255));
    GetTreeCtrl().SetTextColor(RGB(80, 60, 240)); 
    GetTreeCtrl().SetLineColor(RGB(200, 100, 100)); 
    GetTreeCtrl().SetItemHeight(16);
    CTXMLEditorDoc* pDoc = GetDocument();
    ASSERT_VALID(pDoc);

    if (!pDoc)
        return;

    TXMLParser* pParser = pDoc->GetXMLParser();
    if(pParser && pParser->IsLoaded())
    {
        //Populate Tree Ctrl from the root
        TXMLRoot* root = pParser->CreateRootInstance();
        if(pParser->GetRoot(*root) == TXMLParser::NO_XML_ERROR) 
        {
            PopulateTreeCtrl(*pParser, root, NULL);
        }         
    } 
}
C++
void CTXMLEditorView::PopulateTreeCtrl(TXMLParser& parser, 
                      TXMLNode* curNode, HTREEITEM hParent)
{
    TXMLNode* orgNode = curNode;
    CString nodeName = orgNode->GetName(); 
    CString nodeValue = orgNode->GetValueString();
    TCHAR buf[2] = {0, 0};

    if(nodeValue.GetLength() > 1)
    {
        buf[0] = nodeValue.GetAt(0);
        buf[1] = nodeValue.GetAt(1);
    }

    CString nodeNameVal = nodeName;

    if(nodeValue.GetLength() && buf[0] != '\n' && buf[1] != '\r')
    {
        nodeNameVal += NAME_VALUE_SEPARATOR;
        nodeNameVal += nodeValue; 
    } 

    TVINSERTSTRUCT tvInsert; 
    tvInsert.hParent = hParent;
    tvInsert.hInsertAfter = TVI_LAST; 
    tvInsert.item.mask = TVIF_TEXT;
    tvInsert.item.pszText = (TCHAR*)(LPCTSTR)nodeNameVal; 
    tvInsert.item.lParam = NULL;
    HTREEITEM hItem = GetTreeCtrl().InsertItem(&tvInsert);

    tvInsert.hParent = hItem;
    tvInsert.hInsertAfter = TVI_LAST; 
    tvInsert.item.mask = TVIF_TEXT;
    tvInsert.item.pszText = ATTRIBUTES_TAG; 
    HTREEITEM hAttrib = GetTreeCtrl().InsertItem(&tvInsert); 

    unsigned int attribCount = parser.GetAttribCount(orgNode);
    if(attribCount) 
    {
        TXMLAttrib* attrib = parser.CreateAttribInstance(); 

        for(unsigned int i = 0; i < attribCount; i++)
        {
            if(parser.GetAttrib(i, orgNode, *attrib))
            {
                tvInsert.hParent = hAttrib; 

                CString attribNameVal;
                attribNameVal = attrib->GetName();
                attribNameVal += NAME_VALUE_SEPARATOR;
                attribNameVal += attrib->GetValueString(); 
                tvInsert.item.pszText = (TCHAR*)(LPCTSTR)attribNameVal;

                HTREEITEM hAttribNameVal = GetTreeCtrl().InsertItem(&tvInsert); 
            }
        }
    }

    tvInsert.hParent = hItem;
    tvInsert.hInsertAfter = TVI_LAST; 
    tvInsert.item.mask = TVIF_TEXT;
    tvInsert.item.pszText = CHILD_NODES_TAG;
    HTREEITEM hChildren = GetTreeCtrl().InsertItem(&tvInsert); 
    unsigned int childCount = parser.GetChildCount(curNode); 

    if(childCount)
    { 
        TXMLNode* nextNode = parser.CreateNodeInstance(); 
        bool bRet = parser.GetFirstChild(curNode, *nextNode);

        while(bRet)
        {
            PopulateTreeCtrl(parser, nextNode, hChildren);
            curNode = nextNode; 
            bRet = parser.GetNextSibling(curNode, *nextNode);
        }         
    }
}

The full implementation of the pure virtual methods of TXMLParser can be found in the TXMLParserXerces class which is available in the uploaded source code.

Points of Interest

While implementing the XML file reading, I also added the XML file saving mechanism, so in the end, the parser has become more or less of an XML editor more than anything. In the source code uploaded, a tree control is used to represent the XML nodes, and each tree node has two default child items: Attribute, Child Nodes. Under the Attribute child item, the attributes for the nodes are listed in the format AttributeName=Value, and under the Child Nodes item, child XML nodes are represented just like the parent node. If a node has a value, then the node is represented in the nodename=value format in the tree item.

Acknowledgements

Jacques Raphanel: For introducing me to the Xerces Parser library (easily portable in OSs other than Windows), which I found a little bit easier to implement than the COM based MS-XML library.

John Adams: For helping me rectify the mistake of wrongly using the term Adapter instead of Bridge. Wherever you are seeing the words Bridge Pattern in this article, it used to be wrongly termed as Adapter Pattern.

History

  • Article uploaded: 16 August, 2010.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)