Parser Schemas– Easy and Powerful parsing of XML-based languages

Tomas Deml

4.00/5 (5 votes)

18 Oct 20058 min read

495

An article on parsing XML files according to the specified schema.

Introduction

This article focuses on parsing custom XML-based languages using a user-defined schema. (Sorry for possible language mistakes, I am not a native speaker).

If you want to parse an XML document with a specific structure, you don't need to deal with the XmlDocument/XPathNavigator class to load values using XPath expressions (and then check the value for null reference, validate it etc.); all you have to do is to define a parser schema. You can specify the elements and attributes the element must have, the maximum number of optional nodes that should be parsed (or the minimum number of optional nodes that must be present) or you can validate the node’s value at the time of schema evaluation. The most advanced "feature" of parser schemas is the possibility to transform parsed mark-up into an object at the time of schema evaluation. You provide a transformer (an instance of a class implementing the specific interface) and when the evaluation is done, you get the results.

Here comes Wayku

Then, I decided to develop my own XML-based language for the specific purpose. To parse the mark-up, I used to use the XPathNavigator class. Therefore, I had to load every value using an XPath query and then use its value. And here comes the problematic part – I had to check for the missing nodes, and then validate them. The resulting code was quite ugly. It was not very elegant because of all the validation logic.

So I did a small brainstorming on how to simplify document parsing and node validation. Voila, Wayku was born.

I am sure you are wondering what "Wayku" means. It is a nation (house) from the Dune universe. Waykus look after passengers on space ships while staying free on them. Therefore, Parser Schemas will serve you if you provide them with the schema. You pay (provide), they serve.

Possible usage

You may use parser schemas for:

parsing of custom XML-based languages.
validation of the XML document structure.
XML document transformation.
parsing of custom XML-based configuration files.

Parser schemas principles

General information

A parser schema consists of a couple of rules. A rule specifies node name, namespace and the type of XML mark-up it can match (parse).

Some rules hold information about the nodes that must or should be present as their children. These rules are called parental rules. Parental rules match elements. All parental rules are, surprisingly, derived from the ParentalRule class.

Another kind of rules are called non-parental rules. These rules do not hold information about the child nodes, thus these rules are suitable for attributes, processing instructions and text node parsing. All non-parental rules are children of the NonParentalRule class.

The origin of all rules, parental or non-parental, is the Rule class.

Usage

To construct a parser schema, you have to define a rule that will represent (and parse) the parent (usually the root) node of your XML document (note that you do not need to parse the document from its root). The parent node must be an instance of a class derived from the ParentalRule class.

Then you assign child rules to the parent node. Child rules can have their child rules and those can have their child rules and those can have their child rules…ad infinitum. (Actually, the limit is the number of available slots on the call stack because there is a recursive call in the child rules evaluation process. Don’t scare, the stack is deep enough to handle various bizarrely structured documents.)

After you construct the schema, you evaluate it. If your document does not conform to the schema, exception will be thrown. A few exceptions can occur; like the document does not contain the required node, contains less optional elements than required - I mean asked for - or a value of node is not valid (it’s your handler that judges the validity).

When the schema evaluation is finished (it’s a moment, trust me), you can retrieve the required information.

Advanced principles

Say you want to validate the node value and if you find it invalid, stop the evaluation – done; say you want to transform evaluated data into an object when the evaluation of the node is finished – done; say you want to parse exotic kinds (for example comments) of XML structures but there are no rules matching them – done, just define your own rule.

Node value validation

If you want to validate the node value at the evaluation time, you can use a rule implementing the IValidatingRule interface. Then you can provide an instance of a class implementing the IValueValidator interface and that’s all.

IValueValidator interface declares a single method – its signature is bool IsValidValue(string). I was thinking about implementing the validator as a delegate but I chose the interface because of a future need to provide other validation functionalities.

When the evaluation engine calls the ValidateValue method, it will get the node value that you can validate. If your method returns true, everybody is happy and the evaluation continues. Therefore, if you decide to punish that nasty element with invalid value and return false, the evaluation aborts.

Rules supporting the value validation are rules parsing attributes, text content and text elements. All these rules are derived from the ValidatingNonParentalRule class (implementing the IValidatingRule interface). As you can see, these rules cannot demand any child nodes.

Node transformation

If you want to make your life easier, you can use the so-called transforming rules. Transforming rules represent a special sort of rules; when these rules are evaluated, the evaluation engine calls the methods from the INodeTransformer interface assigned to them. You provide the transformation logic and the engine does the rest. Note that the transformation is initiated after all the child rules of the transforming rule are evaluated so that you can access them and their value from your transformation code.

Transforming rules are generic classes so after evaluation, you get strong-typed results.

Instead of retrieving and interpreting values of child rules of ordinary rules, you can get a single, ready-to-use object from the transforming rule.

Imagine that your XML document describes a bookshelf. You can construct the transforming rule describing the structure of the mark-up representing a book and then evaluate it. When everything goes right, you will get the instance of the Book class (you had designed before). Easy to use, powerful solution, isn’t it?

Custom rules

As I wrote earlier, you can define your own rules. If you are not satisfied with built-in rules, you can define your own parental, non-parental, validating or transforming rules.

Just derive your rule from one of the abstract classes parser schemas come with. For parental and non-parental rules, as you know, you can derive your class from the ParentalRule class and NonParentalRule class, respectively.

The only thing that must be specified, when defining a custom rule, is the type of a node that will be matched by the rule. Therefore, if you would like to match comments, you specify the comment type. You can find a list of all the available node types in the System.Xml.XPath.XPathNodeType enumerator.

After evaluation

When a rule is evaluated, you get an instance of either EvaluatedParentalRule or EvaluatedNonParentalRule class when the rule is derived from ParentalClassRule or NonParentalRule class, respectively.

Under the hood

Parser schemas use the XPathNavigator class to navigate through the document. When the evaluation engine iterates through node children, it calls the bool IsMatch(XPathNavigator) method to determine whether the child can be parsed (matched) by the rule. If so, the engine calls the T Evaluate(XPathNavigator) method of the EvaluableRule<T> where T : EvaluatedRule class derived from the Rule class.

Using the code

Bookshelf

Imagine you are developing a bookshelf catalogue. To store the book metadata, you save them as the XML document with a defined structure:

XML

<?xml version="1.0" ?>
<bookshelf>
    <information>
        <owner>Bob</owner>
    </information>
    <books>
        <book>
            <title>God emperor of Dune</title>
            <author>Frank Herbert</author>
            <rating>FiveStars</rating>
        </book>
        <book>
            <title>Heretics of Dune</title>
            <author>Frank Herbert</author>
            <rating>FiveStars</rating>
        </book>
        <book>
            <title>Neutron star</title>
            <author>Larry Niven</author>
            <rating>FiveStars</rating>
        </book>
    </books>
</bookshelf>

We also have the Book class representing the book metadata.

namespace TomasDeml.Samples.Bookshelf
{
    /// <summary>
    /// Defines book rating.
    /// </summary>
    enum Rating
    {
        /// <summary>
        /// *
        /// </summary>
        OneStar,

        /// <summary>
        /// **
        /// </summary>
        TwoStars,

        /// <summary>
        /// ***
        /// </summary>
        ThreeStars,

        /// <summary>
        /// ****
        /// </summary>
        FourStars,

        /// <summary>
        /// *****
        /// </summary>
        FiveStars
    }

    /// <summary>
    /// Represents a book metadata.
    /// </summary>
    class Book
    {
        /// <summary>
        /// Get the book title.
        /// </summary>
        public string Title; 

        /// <summary>
        /// Gets the book author.
        /// </summary>
        public string Author;

        /// <summary>
        /// Gets the book rating.
        /// </summary>
        public Rating Rating;

        /// <summary>
        /// Initializes a new instance of the 
        /// <see cref="Book"/> class.
        /// </summary>
        public Book() { }

        /// <summary>
        /// Initializes a new instance of the <see cref="Book"/>
        /// class with provided information.
        /// </summary>
        /// <param name="title">Book title.</param>
        /// <param name="author">Book author.</param>
        /// <param name="rating">Book rating.</param>
        public Book(string title, string author, Rating rating)
        {
            this.Title = title;
            this.Author = author;
            this.Rating = rating;
        }

        /// <summary>
        /// Return the string representation 
        /// of the book metadata.
        /// </summary>
        /// <returns></returns>
        public override string ToString()
        {
            return String.Format(
               "Book '{0}' by '{1}' rated {2}.", 
               this.Title, this.Author, this.Rating);
        }
    }    
}

Now imagine you want load and validate the book metadata. Just create the schema describing the document structure, evaluate it and get the results.

namespace TomasDeml.Samples.Bookshelf
{
    class Program
    {
        /// <summary>
        /// Parses the books xml using ordinary 
        /// (non-transforming) schema.
        /// </summary>
        /// <param name="navigator">Document navigator.</param>
        /// <param name="booksList">Output for Book objects.
        /// </param>
        private static void LoadBooksUsingOrdinarySchema(
                XPathNavigator navigator, List<Book> booksList)
        {
            // Create the schema
            // Root element named bookshelf
            ElementRule bookshelf = new ElementRule("bookshelf");

            // Element named information
            ElementRule information = new ElementRule("information");

            // Element information requires 
            // child element(s) contaning
            // text-only content with name owner
            information.RequiredChildren.Add("owner", 
                                       new TextElementRule("owner"));

            // Element bookshelf requires child 
            // element(s) named information
            bookshelf.RequiredChildren.Add("information", 
                                                   information);

            // Element books
            ElementRule books = new ElementRule("books");

            // Element bookshelf requires child 
            // element(s) named books
            bookshelf.RequiredChildren.Add("books", books);

            // Element book
            ElementRule book = new ElementRule("book");

            // book element required children
            book.RequiredChildren.Add("title", 
                             new TextElementRule("title"));
            book.RequiredChildren.Add("author", 
                            new TextElementRule("author"));

            // On evaluation time, the value 
            // of the rating element
            // will be validated with an 
            // instance of built-in class
            // implementing the IValueValidator interface
            book.RequiredChildren.Add("rating", 
               new TextElementRule("rating", 
               EnumValueValidator.Create(typeof(Rating))));

            // Element books don’t have to 
            // contain any book elements
            books.OptionalChildren.Add("book", book);

            // Construct the schema
            ParserSchema schema = new ParserSchema(bookshelf);

            // Evaluate it
            EvaluatedParserSchema results = 
              schema.Evaluate(navigator, 
               ParserSchema.ParentNodeMatchOption.NavigatorPointsToParentNode);

            // Get evaluated books element
            EvaluatedParentalRule eBooks = 
              results.ParentNode.LookupParentalChild("books", 
              EvaluatedParentalRule.LookupLocation.RequiredChildren);

            // Publish all books...
            foreach (EvaluatedParentalRule eBook in 
                     eBooks.LookupChildren("book", 
                     EvaluatedParentalRule.LookupLocation.OptionalChildren))
                booksList.Add(new Book(eBook.RequiredChildren["title"].Value,
                    eBook.RequiredChildren["author"].Value, 
                      (Rating)Enum.Parse(typeof(Rating), 
                      eBook.RequiredChildren["rating"].Value)));
        }
}

You can also use transforming rules and create Book objects at the evaluation time. To do this, you have to implement the INodeTransformer interface.

namespace TomasDeml.Samples.Bookshelf
{
    /// <summary>
    /// Transforms the 'book' element into 
    /// the Book object on the evaluation time.
    /// </summary>
    class BookTransformer : INodeTransformer<Book>
    {
        /// <summary>
        /// Transforms the 'book' element.
        /// </summary>
        /// <param name="evaluatedNodeRule">Evaluated
        ///                   'book' element.</param>
        /// <returns></returns>
        public Book Transform(EvaluatedRule evaluatedNodeRule)
        {
            // Cast it
            EvaluatedParentalRule rule = 
                      (EvaluatedParentalRule)evaluatedNodeRule;

            // New book
            Book book = new Book();

            // Title
            book.Title = rule.RequiredChildren["title"].Value;

            // Author
            book.Author = rule.RequiredChildren["author"].Value;

            // Get transformed rating
            book.Rating = 
              ((EvaluatedElementTransformationRule<Rating>)
                rule.RequiredChildren["rating"]).TransformationResult;

            // Return book object
            return book;
        }
    }
}

Then you can construct and evaluate the schema with transforming rules. Changes are given in bold.

namespace TomasDeml.Samples.Bookshelf
{
    class Program
    {
        /// <summary>
        /// Parses the books xml using transforming schema.
        /// </summary>
        /// <param name="navigator">Document navigator.</param>
        /// <param name="booksList">Output for Book objects.</param>
        private static void LoadBooksUsingTransformingRules(
                   XPathNavigator navigator, List<Book> booksList)
        {
            // Create the schema
            // Root element named bookshelf
            ElementRule bookshelf = new ElementRule("bookshelf");

            // Element named information
            ElementRule information = new ElementRule("information");

            // Element information requires child element(s)
            // contaning text-only content with name owner
            information.RequiredChildren.Add("owner", 
                        new TextElementRule("owner"));

            // Element bookshelf requires child 
            // element(s) named information
            bookshelf.RequiredChildren.Add("information", information);

            // Element books
            ElementRule books = new ElementRule("books");

            // Element bookshelf requires child element(s) named books
            bookshelf.RequiredChildren.Add("books", books);

            // Rule transforming element 'book' into the Book object
            ElementTransformationRule<Book> book = 
                 new ElementTransformationRule<Book>("book", 
                 new BookTransformer());

            // book element required children
            book.RequiredChildren.Add("title", 
                          new TextElementRule("title"));
            book.RequiredChildren.Add("author", 
                         new TextElementRule("author"));

            // On evaluation time, the value of the rating element
            // will be validated and transformed with 
            // an instance of built-in class
            // implementing the IValueValidator interface
            book.RequiredChildren.Add("rating", 
                 new ElementTransformationRule<Rating>("rating", 
                 NodeValueOptions.GetValue | NodeValueOptions.RequireValue, 
                 new EnumValueTransformer<Rating>()));

            // Element books don’t have to contain any book elements
            books.OptionalChildren.Add("book", book);

            // Construct the schema
            ParserSchema schema = new ParserSchema(bookshelf);

            // Evaluate it
            EvaluatedParserSchema results = schema.Evaluate(navigator, 
              ParserSchema.ParentNodeMatchOption.NavigatorPointsToParentNode);

            // Get evaluated books element
            EvaluatedParentalRule eBooks = 
              results.ParentNode.LookupParentalChild("books", 
              EvaluatedParentalRule.LookupLocation.RequiredChildren);

            // Publish all books...
            foreach (EvaluatedElementTransformationRule<Book> eBook 
                     in eBooks.LookupChildren("book", 
                     EvaluatedParentalRule.LookupLocation.OptionalChildren))
                booksList.Add(eBook.TransformationResult);
        }
    }
}

There are also rules parsing processing instructions, attributes and text content.

Building the code

To build the code, you must have at least Microsoft Visual Studio 2005/Microsoft Visual C# Express 2005 BETA 2 installed.

To build the code without Visual Studio, you can use the msbuild tool supplied in the Microsoft .NET Framework 2.0 package. Just open cmd.exe and type:

msbuild /p:Configuration=Release ParserSchemas.sln

Licence

Parser schemas are distributed under the LGPL license.

History

7/30/05 – first Alfa.
9/1/05 – v1.0.0.0 released.
9/1/05 - v1.0.0.1 released (minor bug fix in the evaluation engine).
9/3/05 - v1.0.0.2 released (minor perf improvement).
10/18/05 - v1.1.0.0 released (added the rule linking support and some events).

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here