Introduction
An interesting problem is parsing a markup document to represent it as an object. This would be very helpful, for example, if you want to generate valid markup code, like placing quotes in HTML attribute values and placing the end tags. Another use for it is to replace some element sections. Traversing through the object model for making whatever needed logic would be much easier and usable in an object model. Here, I place a project I made that takes any markup and turns it in the form of a MarkupDocument
with elements and attributes, content etc. Below is a screenshot for a representation or a markup in a tree format, in which I used the MarkupDocument
to build the tree.
Using the code
Here is the class diagram for the markup representation:
The MarkupDocument
class contains the whole document consisting of ChildElement
s and Content
and methods for parsing the markup string given to the object representation. The ToString
method is overridden in all classes to provide the markup of the represented element in text format. Below is the method I made for parsing the markup; it is a static method that takes a MarkupDocument
and a markup string and loads the parsed string in the document. Here, I use Regular Expressions for parsing the document, and for retrieving the element names, the attributes etc.
public static void ParseString(MarkupDocument document, string markup)
{
List<string /> result = new List<string />();
document.ChildElements.Clear();
Regex r;
Match m;
string[] markups = markup.Split('<');
MarkupElement parentElement = null;
foreach (string str in markups)
{
string workingMarkup = str;
if (str.Trim().Length == 0)
continue;
#region Closing Tag
if (workingMarkup.TrimStart().StartsWith("/"))
{
if (parentElement != null)
{
if (document.IsSpecial(workingMarkup,">",1))
{
document.InsertContent(parentElement, workingMarkup, ">", 1);
continue;
}
parentElement = parentElement.ParentElement;
document.InsertContent(parentElement, workingMarkup, ">", 1);
}
else
{
if (document.IsSpecial(workingMarkup, ">", 1))
{
document.InsertContent(workingMarkup, ">", 1);
continue;
}
#region Adding The Element
r = new Regex("^\\s*\\w*", RegexOptions.IgnoreCase
| RegexOptions.Compiled);
m = r.Match(workingMarkup);
if (m.Success && m.Groups[0].Value.Trim().Length > 0)
{
MarkupElement initElement = new MarkupElement();
initElement.ParentElement = parentElement;
initElement.Name = m.Groups[0].Value;
initElement.Document = document;
initElement.IsSelfClosed = true;
document.ChildElements.Add(initElement);
}
#endregion
document.InsertContent(workingMarkup, ">", 1);
}
continue;
}
#endregion
MarkupElement currentElement = new MarkupElement();
currentElement.Document = document;
#region Element Name
currentElement.ParentElement = parentElement;
r = new Regex("^\\s*\\w*", RegexOptions.IgnoreCase | RegexOptions.Compiled);
m = r.Match(workingMarkup);
if (m.Success && m.Groups[0].Value.Trim().Length > 0)
currentElement.Name = m.Groups[0].Value;
else
continue;
workingMarkup = workingMarkup.Replace(currentElement.Name, "");
#endregion
#region Retrieve Element Attributes
r = new Regex("\\S*\\s*=\\s*\\S*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
for (m = r.Match(workingMarkup); m.Success; m = m.NextMatch())
{
string tag = m.Groups[0].Value;
string[] tagSplit = tag.Split('=');
MarkupAttribute attribute = new MarkupAttribute();
attribute.Name = tagSplit[0];
attribute.Value = tagSplit[1];
currentElement.Attributes.Add(attribute);
}
#endregion
currentElement.ParentElement = parentElement;
#region Add Element
if (parentElement == null)
document.ChildElements.Add(currentElement);
else
parentElement.ChildElements.Add(currentElement);
#endregion
#region Add Content
if (!str.Contains("/>"))
{
if (!document.SpecialElements.Contains(currentElement.Name))
{
parentElement = currentElement;
document.InsertContent(currentElement, workingMarkup, ">", 1);
}
else if (parentElement != null)
{
document.InsertContent(parentElement, workingMarkup, ">", 1);
}
else
{
document.InsertContent(workingMarkup, ">", 1);
}
}
else
{
currentElement.IsSelfClosed = true;
document.InsertContent(parentElement, workingMarkup, "/>", 2);
}
#endregion
}
}
OK, now, let's assume you know that there are some special markup elements that are meant to be single elements. All you have to do is add them to the SpecialElements
list in the Document
object.
MarkupLibrary.MarkupDocument document = new MarkupLibrary.MarkupDocument();
document.SpecialElements.Add("br");
document.SpecialElements.Add("hr");
document.SpecialElements.Add("img");
To load the document with the markup string, call the Load
method:
document.Load(markup);
Points of interest
Finally, I hope that this would help you to create your object model from a markup representation.