A Simple SGML Parser and Visitor Pattern Loveliness

Sacha Barber

4.95/5 (23 votes)

11 Jan 2009CPOL8 min read

96.2K

356

A look at the Visitor pattern, and a Reflective version.

Download demo project - 49.6 KB

Introduction
Standard Visitor Pattern
Reflective Visitor
Conclusion
So What Do You Think ?

Introduction

As software developers, it is essentially our job to try and familiarize ourselves with the tips/tricks that make our development easier. Patterns are one area that can aid in the development process. This article will discuss a Design Pattern called the "Visitor" pattern. Now, some of you may know about the Visitor already, but for those that don't, this article may be of use.

I also want to mention the reason I am choosing to write about the Visitor pattern right now. The reason to that actually lies in the next article that I will be writing, which will have something to do with extending LINQ. You see, I didn't want to embark on a LINQ Extension mission before I had covered what I would consider to be a fundamental building block in the way certain parts of LINQ work under the surface.

To this end, this article will describe both the standard Visitor pattern and a somewhat better, more flexible solution.

Standard Visitor Pattern

For those of you that don't know what the Visitor pattern is all about, this excerpt from Wikipedia does a pretty good job of explaining the pros/cons of the Visitor pattern:

"In object-oriented programming and software engineering, the Visitor Design Pattern is a way of separating an algorithm from an object structure upon which it operates. A practical result of this separation is the ability to add new operations to existing object structures without modifying those structures. Thus, using the Visitor pattern helps conformance with the open/closed principle.

In essence, the visitor allows one to add new virtual functions to a family of classes without modifying the classes themselves; instead, one creates a visitor class that implements all of the appropriate specializations of the virtual function. The visitor takes the instance reference as input, and implements the goal through double dispatch.

While powerful, the Visitor pattern does have limitations as compared with conventional virtual functions. It is not possible to create visitors for objects without adding a small callback method inside each class, and the callback method in each of the classes is not inheritable to the level of the new subclass."

-- http://en.wikipedia.org/wiki/Visitor_pattern

So how does that translate into code? Well, a class diagram for a simple implementation looks like this, where the implementation is a number of different cheeses that need to be examined, but one could imagine a better structure that required visiting, such as elements within an SGML document, or parts of an Expression within an Expression body (possibly for an Expression within one of the System.Linq.Expressions Expression classes... possibly).

The basic idea is that there is some object that implements the Visitor interface. And there are n-many other objects all of which implement a Visitable interface which allows the acceptance of a Visitor type. When a Visitor type is accepted, a call is made back to Visitor where the Visit method within the Visitor object is passed the current Visitable object. This mechanism is usually repeated for all items within some structure, such as a List/Tree of Visitable objects.

The beauty of this pattern is that you can simply visit, and you can be sure the structure of the initial object(s) that are being visited are intact. Thus this pattern is quite well suited to traversing some sort of structure object, such as SGML for example.

Let's examine some code.

Let's start with looking at some objects that implement a common Visitable interface. It can be seen that these classes (very simple classes, for the sake of this article) all have a public void Accept(Visitor v) { v.Visit(this); }, so what happens when the Accept() method is called is that the correct Visit() method (the one that matched the called signature) within the Visitor object is then called.

/// <summary>
/// A Cheese Interface
/// </summary>
interface Visitable 
{ 
    void Accept(Visitor v); 
}

/// <summary>
/// Wensleydale cheese that accepts a Visitor
/// </summary>
sealed class Wensleydale : Visitable 
{
    public String CheeseName { get { return "This is Wensleydale"; } }
    public void Accept(Visitor v) { v.Visit(this); }
}

/// <summary>
/// Gouda cheese that accepts a Visitor
/// </summary>
sealed class Gouda : Visitable 
{
    public String CheeseName { get { return "This is Gouda"; } }
    public void Accept(Visitor v) { v.Visit(this); }
}

/// <summary>
/// Brie cheese that accepts a Visitor
/// </summary>
sealed class Brie : Visitable 
{
    public String CheeseName { get { return "This is Brie"; } }
    public void Accept(Visitor v) { v.Visit(this); }
}

And this is an implementation of a Visitor type. In this small example, I have used a simple class, but more commonly, it may be a Form or some sort of crazy DSL creator that implements the Visitor interface.

/// <summary>
/// I have used a simple class here, but you can imagine that
/// this could be a form, or some more meaningful class
/// that does more than print the Name of the Visited item
/// </summary>
class VisitorImplementation : Visitor 
{
    /// <summary>
    /// Visit Wensleydale
    /// </summary>
    public void Visit(Wensleydale w) { Console.WriteLine(w.CheeseName); }

    /// <summary>
    /// Visit Gouda
    /// </summary>
    public void Visit(Gouda g) { Console.WriteLine(g.CheeseName); }

    /// <summary>
    /// Visit Brie
    /// </summary>
    public void Visit(Brie b) { Console.WriteLine(b.CheeseName); }
}

And here is a small test, to put all this to the test.

class Program
{
    static void Main(string[] args)
    {
        //Create some items
        List<Visitable> cheeses = new List<Visitable>()
        {
            new Wensleydale(),
            new Gouda(),
            new Brie(),

        };

        //Now Visit them
        Visitor v = new VisitorImplementation();
        cheeses.ForEach(c => c.Accept(v));


        Console.ReadLine();
    }
}

And here is the output:

Now this all looks pretty cool, so what are the downfalls of this approach? Well, using the standard Visitor pattern, you need to create a Visitable class per item you need to Visit, and a small callback code section within the Visitor implementing object.

This may not sound like much work, but imagine you are working against a huge SGML document with literally thousands of possible tag types, this approach would not be that fun to implement in that case, would it?

So what can we do about it? That is something that I myself asked some time ago. Actually, it was when I was at Uni and we had to write an HTML parser, and we looked into it and found Reflection had the answer. I have decided to resurrect the concept that we used in order to show you another possible approach which would not require the use of so many Visitable classes and so many callbacks. As I also said, the reason that I am writing this article is that it fits in with another article I will be writing soon.

Reflective Visitor

You have now seen the standard Visitor pattern in all its glory, how about seeing another go at the Visitor pattern, but this time using Reflection?

For this example, I have chosen to create a small Parser/Reflective Visitor that is used to parse and visit a small SGML document, that has limited support. The Tags the SGML will support are SGML_Tag/H_Tag/P_Tag/Pre_Tag/Title_Tag/Text_Tag.

SGML_Tag is the base class for all other supported tags. SGML_Tag looks like this:

/// <summary>
/// Base class for all SGML_Tag subclasses
/// Provides a very minimal implementation of
/// a SGML_Tag
/// </summary>
public class SGML_Tag
{
    #region Ctor
    public SGML_Tag(String name)
    {
        this.Name = name;
        Children = new List<SGML_Tag>();
    }
    #endregion

    #region Public Properties
    public String Name { get; private set; }
    public List<SGML_Tag> Children { get; private set; }
    public String CodeType
    {
        get
        {
            return String.Empty;
        }
    }
    #endregion

    #region Public Methods
    /// <summary>
    /// Adds a child to the current SGML_Tag
    /// </summary>
    public void AddChild(SGML_Tag child)
    {
        this.Children.Add(child);
    }
    #endregion

    #region Overrides
    public override string ToString()
    {
        return String.Format(Name);
    }
    #endregion
}

Where a typical subclass may look like this:

/// <summary>
/// A Title Tag
/// </summary>
public class Title_Tag : SGML_Tag
{
    #region Ctor
    public Title_Tag(String name) : base(name)
    {

    }
    #endregion

    #region Public Properties
    public String Text { get; set; }
    #endregion
}

It can be seen that SGML_Tag has a list of child elements. This is inline with an SGML structure where we essentially have a tree. The parser will actually create a root SGML_Tag that will serve as the source of items that need to be visited.

What the Parser is Not

The parser I have written for this task is deliberately dumb, and is by no means a workable parser. The only thing it does is recognize supported tags as outlined above and create a tree of these as the SGML document is parsed. That's it. The reason for this is that this is all that is required in order to illustrate the use of Reflection and the Visitor pattern.

The basic idea is that the input SGML document is parsed into a tree of SGML_Tag objects that are then visited using Reflection by some other object (the Visitor in the standard Visitor pattern), where there is always only one root which holds the child SGML_Tag objects.

Let us have a look at the parser.

public class SimpleGMLParser
{
    private SGML_Tag root;

    /// <summary>
    /// Constructor
    /// </summary>
    public SimpleGMLParser() 
    {
        root = null;
    }

    /// <summary>
    /// Gets the parsed tree of SGML_Tag(s) by returning
    /// the root SGML_Tag which holds child SGML_Tag(s)
    /// </summary>
    [MethodImpl(MethodImplOptions.Synchronized)]
    public SGML_Tag GetParsedTree() 
    {
        return root;
    }

    /// <summary>
    /// Parse input document
    /// </summary>
    [MethodImpl(MethodImplOptions.Synchronized)]
    public Boolean Parse() 
    {
        XElement rawHtml = null;
        using (StreamReader fileReader = new StreamReader(Program.CurrentFile))
            rawHtml = XElement.Parse(fileReader.ReadToEnd());

        root = new SGML_Tag(rawHtml.Name.LocalName);
        
        //loop through 1st level elements
        foreach (XElement xmlItem in rawHtml.Elements())
        {
            DealWithSingleNode(xmlItem, root);
        }

        return true;
    }

    /// <summary>
    /// Does a recursive call by obtaining the elements for the current
    /// XElements Descendants
    /// </summary>
    private void GetElements(XElement xmlParent, SGML_Tag tagParent)
    {
        foreach (XElement xmlItem in xmlParent.Descendants())
        {
            DealWithSingleNode(xmlItem, tagParent);
        }
    }

    /// <summary>
    /// Creates a Tag for a single element, and adds it to the tree of created
    /// SGML_Tag(s)
    /// </summary>
    private void DealWithSingleNode(XElement xmlItem, SGML_Tag tagParent)
    {
        if (xmlItem.HasElements)
        {
            SGML_Tag child = CreateTag(xmlItem.Name.LocalName);
            tagParent.Children.Add(child);
            GetElements(xmlItem, child);
        }
        else
        {
            SGML_Tag child = CreateTag(xmlItem.Name.LocalName);
            tagParent.Children.Add(child);
        }
    }

    /// <summary>
    /// Attempts to create a SGML_Tag or one of its subclasses by examining
    /// the incoming tag value. The tag value is used to do a lookup
    /// against method names in this class using reflection, and if an 
    /// appropriate method is found it is called, 
    /// otherwise a Text_Tag is returned
    /// </summary>
    private SGML_Tag CreateTag(String tag)
    {
        String methodName = tag.ToLower(); ;
        MethodInfo mi;

        // Turn the input into a potential method name.
        methodName = methodName.ReplaceAll(new string[] {"\\D"}, "").ToLower();
        methodName = methodName.Substring(0, 1).ToUpper() 
            + methodName.Substring(1);
        methodName = "Make_" + methodName + "_Tag";

        //Attempt to find the method in this class with methodName name.
        mi = (from MethodInfo m in this.GetType().GetMethods(
                  BindingFlags.Instance | BindingFlags.NonPublic)
              where m.Name.Equals(methodName)
              select m).SingleOrDefault();

        if (mi  == null)
            return Make_Text_Tag(tag);
        else
        {
            //Do a Reflective method invocation call to 
            //correct Make_XXX_Tag method
            return (SGML_Tag)mi.Invoke(this, new Object[] { tag });
        }
    }

    /// <summary>
    /// Creates a new Title_Tag object which is added the current parent
    /// </summary>
    private Title_Tag Make_Title_Tag(String tag)
    {
        Title_Tag newTag = new Title_Tag(tag);
        return newTag;
    }

    /// <summary>
    /// Creates a new H_Tag object which is added the current parent
    /// </summary>
    private H_Tag Make_H_Tag(String tag)
    {
        int level = Int32.Parse(
            tag.ReplaceAll(new string[] { "<","h", ">" }, ""));
        H_Tag newTag = new H_Tag(tag, level);
        return newTag;
    }

    /// <summary>
    /// Creates a new Pre_Tag object which is added the current parent
    /// </summary>
    private Pre_Tag Make_Pre_Tag(String tag)
    {
        Pre_Tag newTag = new Pre_Tag(tag);
        return newTag;
    }

      //Other Make_XXX_Tag methods left out for clarity
    ....
    ....
    ....
 
    #endregion
}

As I say, it is a very naive implementation of an SGML parser. But that's OK, as that is not the point of this article. All that you need to know is that at the time of the parse process, the root SGML_Tag property will end up with a tree of parsed SGML_Tag(s), which are then ready to be visited. To those of you that are interested, it can be seen that the parser also uses Reflection to work out what SGML_Tag tag(s) to create.

In order to reflectively visit this root SGML_Tag property and all its associated SGML_Tag children, we need something that will actually do something with the root SGML_Tag.

I am using a standard WinForms app, but there is a ViewController that deals with most of the form interaction. The ViewController is pretty simple, let's see its code:

/// <summary>
/// A simple controller for FormParsedItems
/// form
/// </summary>
public class ViewController
{
    private TreeNode root = null;

    /// <summary>
    /// Ctor
    /// </summary>
    public ViewController(IView currentView)
    {
        this.CurrentView = currentView;
        this.TreeOfVisitedNodes = this.CurrentView.GetControlByName(
                                       "tvParsedItems") as TreeView;

    }

    public IView CurrentView { get; private set; }
    public TreeView TreeOfVisitedNodes { get; private set; }

    /// <summary>
    /// Create a Parser and parse input document, then
    /// Reflectively visit the tree of parsed SGML_Tag(s)
    /// </summary>
    public void Run()
    {
        SimpleParser.SimpleGMLParser parser = 
        new SimpleParser.SimpleGMLParser();

        if (parser.Parse())
        {
            SGML_Tag rootTag = parser.GetParsedTree();
            if (rootTag != null)
            {
                root = TreeOfVisitedNodes.Nodes.Add("Root Node", rootTag.Name);
                TreeOfVisitedNodes.ExpandAll();
                this.TraverseVisitableNodes(rootTag, root);
            }
        }
    }

    /// <summary>
    /// Recursivly traverses current SGML_Tag children and Visits
    /// each in turn
    /// </summary>
    private void TraverseVisitableNodes(SGML_Tag htmlParent, TreeNode treeParent)
    {
        foreach (SGML_Tag htmlTag in htmlParent.Children)
        {
            if (htmlTag.Children.Count > 0)
            {
                TreeNode childNode = Visit(htmlTag, treeParent);
                TraverseVisitableNodes(htmlTag, childNode);
            }
            else
            {
                Visit(htmlTag, treeParent);
            }
        }
    }

    /// <summary>
    /// Attempts to visit a SGML_Tag or one of its subclasses by examining
    /// the incoming tag value. The tag value is used to do a lookup
    /// against method names in this class using reflection, and if an
    /// appropriate method is found it is called.
    /// </summary>
    private TreeNode Visit(SGML_Tag tag, TreeNode parent)
    {
        String methodName = tag.Name.ToLower();
        MethodInfo mi;

        // Turn the input into a potential method name.
        methodName = methodName.ReplaceAll(
                         new string[] { "\\D" }, "").ToLower();
        methodName = methodName.Substring(0, 1).ToUpper() + methodName.Substring(1);
        methodName = "Visit_" + methodName + "_Tag";

        //Attempt to find the method in this class with methodName name.
        mi = (from MethodInfo m in this.GetType().GetMethods(
            BindingFlags.Instance | BindingFlags.NonPublic)
              where m.Name.Equals(methodName)
              select m).SingleOrDefault();

        if (mi == null)
            return null;
        else
        {
            return (TreeNode)mi.Invoke(this, new Object[] { tag, parent});
        }
    }

    /// <summary>
    /// Visits a H_Tag
    /// </summary>
    private TreeNode Visit_H_Tag(H_Tag ht, TreeNode parent)
    {
        TreeNode newParent = parent.Nodes.Add(ht.Name, ht.Name);
        TreeOfVisitedNodes.ExpandAll();
        Console.WriteLine(String.Format("Visiting H_Tag {0}", ht.ToString()));
        return newParent;
    }

    /// <summary>
    /// Visits a Text_Tag
    /// </summary>
    private TreeNode Visit_Text_Tag(Text_Tag tt, TreeNode parent)
    {
        TreeNode newParent = parent.Nodes.Add(tt.Name, tt.Name);
        TreeOfVisitedNodes.ExpandAll();
        Console.WriteLine(String.Format("Visiting Text_Tag {0}", tt.ToString()));
        return newParent;
    }

       //Other Make_XXX_Tag methods left out for clarity
    ....
    ....
    ....
}

As you can see, there was no Visitor/Visitable interfaces; instead, we rely on the types themselves to give us a name of a method that should be used to do the Visiting. Now, this approach may not suit everyone, but it is worth having a think to see if this approach would work for you.

Running the app for the following SGML document:

XML

<html>
    <p>
        <h1>Heading1</h1>
        <h2>Heading2</h2>
    </p>
    <h2>Heading2</h2>
</html>

We end up with the following:

Other Nice Stuff

There is a small extension method that I wrote that I found quite useful for replacing all occurrences of certain values within a string; this is as shown below:

/// <summary>
/// String Extension methods, and associayed helpers
/// </summary>
public static class StringExtensionMethods
{
    #region String.ReplaceAll(..)
    /// <summary>
    /// Replaces all instances of the strings in the unwanted string 
    /// array within the inputString with the replacement string 
    /// </summary>
    public static String ReplaceAll(this String inputString, 
        String[] unwanted, String replacement)
    {
        for (int i = 0; i < unwanted.Length; i++)
        {
            if (unwanted[i].Equals("\\D"))
            {
                inputString =StringExtensionMethods.StripNumbers(inputString);
            }
            else
            {
                inputString = inputString.Replace(unwanted[i], replacement);
            }
        }
        return inputString;
    }
    #endregion

    #region Private Helper Methods
    /// <summary>
    /// Strips all numbers from an input string, and returns
    /// the stripped string
    /// </summary>
    private static String StripNumbers(string input)
    {
        Regex regEx = new Regex("[0-9]+");
        StringBuilder sb = new StringBuilder();
        foreach (char a in input)
        {
            if (!regEx.IsMatch(a.ToString()))
            {
                sb.Append(a);
            }
        }

        return sb.ToString();
    }
    #endregion

}

Conclusion

As you can see, we can simplify the standard Visitor pattern somewhat by using Reflection. Next time, we are going to look at the role the Visitor pattern has in LINQ. In particular, we will examine how the System.Linq.Expressions namespace makes use of the Visitor pattern.

Anyway, that is pretty much all I wanted to say about this pattern this time, but I hope that if you liked this article, you will want to read more about it and some LINQness in the next article I write.

So What Do You Think?

I would just like to ask, if you liked this article, or think it would be useful, leave a vote and a comment. Many thanks.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)