Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C#

Parsing Markup to Represent it as Objects

4.20/5 (3 votes)
28 Mar 2009CPOL1 min read 18.4K   101  
An interesting problem is parsing a markup document to represent it as an object. This would be very helpful, for example, if you want to generate valid markup code.

Introduction

An interesting problem is parsing a markup document to represent it as an object. This would be very helpful, for example, if you want to generate valid markup code, like placing quotes in HTML attribute values and placing the end tags. Another use for it is to replace some element sections. Traversing through the object model for making whatever needed logic would be much easier and usable in an object model. Here, I place a project I made that takes any markup and turns it in the form of a MarkupDocument with elements and attributes, content etc. Below is a screenshot for a representation or a markup in a tree format, in which I used the MarkupDocument to build the tree.

Image 1

Using the code

Here is the class diagram for the markup representation:

Image 2

The MarkupDocument class contains the whole document consisting of ChildElements and Content and methods for parsing the markup string given to the object representation. The ToString method is overridden in all classes to provide the markup of the represented element in text format. Below is the method I made for parsing the markup; it is a static method that takes a MarkupDocument and a markup string and loads the parsed string in the document. Here, I use Regular Expressions for parsing the document, and for retrieving the element names, the attributes etc.

C#
public static void ParseString(MarkupDocument document, string markup)
{
    List<string /> result = new List<string />();
    document.ChildElements.Clear();
    Regex r;
    Match m;
    string[] markups = markup.Split('<');
    MarkupElement parentElement = null;
    foreach (string str in markups)
    {
        string workingMarkup = str;
        if (str.Trim().Length == 0)
            continue;
        #region Closing Tag
        //Check if this is a closing element or not
        if (workingMarkup.TrimStart().StartsWith("/"))
        {
            //Check if a parent element exists or not
            if (parentElement != null)
            {
                //Navigate up one level
                if (document.IsSpecial(workingMarkup,">",1))
                {
                    document.InsertContent(parentElement, workingMarkup, ">", 1);
                    continue;
                }
                parentElement = parentElement.ParentElement;
                //Insert Markup in the parentElement content
                document.InsertContent(parentElement, workingMarkup, ">", 1);
            }
            else
            {
                if (document.IsSpecial(workingMarkup, ">", 1))
                {
                    document.InsertContent(workingMarkup, ">", 1);
                    continue;
                }
                //Adding an Element in case a closing tag in the beginning o the document
                #region Adding The Element
                r = new Regex("^\\s*\\w*", RegexOptions.IgnoreCase 
                | RegexOptions.Compiled);
                m = r.Match(workingMarkup);
                if (m.Success && m.Groups[0].Value.Trim().Length > 0)
                {
                    MarkupElement initElement = new MarkupElement();
                    initElement.ParentElement = parentElement;
                    initElement.Name = m.Groups[0].Value;
                    initElement.Document = document;
                    initElement.IsSelfClosed = true;
                    document.ChildElements.Add(initElement);
                }
                #endregion
                //Insert Markup in the document content
                document.InsertContent(workingMarkup, ">", 1);

            }
            continue;
        }
        #endregion

        MarkupElement currentElement = new MarkupElement();
        currentElement.Document = document;
        #region Element Name
        currentElement.ParentElement = parentElement;
        //This regular expression will extract the element name from the tag.
        r = new Regex("^\\s*\\w*", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        m = r.Match(workingMarkup);
        if (m.Success && m.Groups[0].Value.Trim().Length > 0)
            currentElement.Name = m.Groups[0].Value;
        else
            continue;
        workingMarkup = workingMarkup.Replace(currentElement.Name, "");
        #endregion

        #region Retrieve Element Attributes
        //This regular expression will extract an attribute with its value at a time
        r = new Regex("\\S*\\s*=\\s*\\S*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))", 
                      RegexOptions.IgnoreCase | RegexOptions.Compiled);
        for (m = r.Match(workingMarkup); m.Success; m = m.NextMatch())
        {
            string tag = m.Groups[0].Value;
            string[] tagSplit = tag.Split('=');
            MarkupAttribute attribute = new MarkupAttribute();
            attribute.Name = tagSplit[0];
            attribute.Value = tagSplit[1];
            currentElement.Attributes.Add(attribute);
        }

        #endregion

        //Setting the element parent
        currentElement.ParentElement = parentElement;
        #region Add Element
        if (parentElement == null)
            document.ChildElements.Add(currentElement);
        else
            parentElement.ChildElements.Add(currentElement);
        #endregion

        #region Add Content
        if (!str.Contains("/>"))
        {
            if (!document.SpecialElements.Contains(currentElement.Name))
            {
                parentElement = currentElement;
                document.InsertContent(currentElement, workingMarkup, ">", 1);
            }
            else if (parentElement != null)
            {
                document.InsertContent(parentElement, workingMarkup, ">", 1);
            }
            else
            {
                document.InsertContent(workingMarkup, ">", 1);
            }
        }
        else
        {
            currentElement.IsSelfClosed = true;
            document.InsertContent(parentElement, workingMarkup, "/>", 2);
        }
        #endregion
    }
}

OK, now, let's assume you know that there are some special markup elements that are meant to be single elements. All you have to do is add them to the SpecialElements list in the Document object.

C#
MarkupLibrary.MarkupDocument document = new MarkupLibrary.MarkupDocument();
document.SpecialElements.Add("br");
document.SpecialElements.Add("hr");
document.SpecialElements.Add("img");

To load the document with the markup string, call the Load method:

C#
document.Load(markup);

Points of interest

Finally, I hope that this would help you to create your object model from a markup representation.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)