Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Programming XPath with VTD-XML

0.00/5 (No votes)
10 Jun 2010 1  
Learn to take advantage of VTD-XML's stateless XPath capability to achieve unmatched efficiency and flexibility

Introduction

Originally seen as an element for W3C's XSLT and XPointer, XPath has been gaining popularity as an effective tool for developing XML applications. A significant benefit of XPath is its simplicity: Instead of manually navigating the hierarchical data structure, you can use compact, "file-system"-like expressions to address any nodes in XML documents. However, most of existing XPath engines work with DOM trees or similar object models, which are slow to build and modify, and consume excessive amount of memory. This presents a dilemma for anyone looking to take advantage of XPath for SOA applications that either (1) are performance sensitive or (2) routinely deal with large XML documents.

My last several articles on VTD-XML (http://vtd-xml.sourceforge.net) as the next generation XML processing model that goes beyond DOM and SAX in performance, memory usage and ease of use. VTD-XML is simultaneously:

  • Memory-efficient: VTD-XML typically consumes an amount of memory somewhere between 1.3~1.5 times the size of the XML document, meaning it works very well with large XML documents.
  • High-performance: VTD-XML typically outperforms DOM parsers by 5~10 times, and it typically outperforms SAX parsers with null content handler by about 100%.
  • Easy to use: Applications written in VTD-XML are more compact and readable than those written in DOM or SAX.

What is VTD-XML's not-so-secret sauce? Unlike traditional XML parsers which create string-based tokens as the first step of parsing, VTD-XML internally uses linear buffers to store 64-bit integers containing the starting offsets and lengths of XML tokens, while keeping XML document intact and undecoded. All VTD-XML's benefits result, one way or the other, from this "non-extractive" tokenization. At the API level, VTD-XML consists of the following core classes:

  • VTDGen is the class that encapsulates the main parsing, index writing and index loading functions.
  • VTDNav exports a cursor-based API that contains the methods that navigate the XML hierarchy.
  • AutoPilot is the class that allows for document-order element traversal similar to Xerces' NodeIterator.

UnderStanding VTD-XML's XPath Capability

VTD-XML's XPath implemenation, which supports the full W3C XPath 1.0 spec, was introduced since version 1.0. It builds upon VTDNav's concept of cursor-based navigation. AutoPilot is the main class that exports all the XPath related methods. As described in one of the early articles, to manually navgiate an XML document's hiearchical structure, you obtain an instance of VTDNav and repeatedly call "toElement()" to move the cursor to various parts of the documents. Using XPath, you essentially ask AutoPilot to *automatically* move VTDNav's cursor to all the qualified nodes in the document.

AutoPilot's XPath-related methods are:

  • declareXPathNameSpace(...): This method binds a namespace prefix (used in an XPath expression) to a URL
  • selectXPath(...): This method compiles a XPath expression into internal representation.
  • evalXPath(...): This method moves the cursor to a qualified node in the node set.
  • evalXPathToBoolean(...), evalXPathToNumber(...), evalXpathToString(...): These three methods evalutes an XPath expression respectively to a boolean, a double and a string.
  • resetXPath(): Reset the internal state so the XPath can be re-used.
  • getExprString(): Call this method to verify the correctness of the compiled expression.

In addtion, two more exception classes are introduced for XPath:

  • XPathParseException: It is thrown when there is a syntax error in the XPath expression.
  • XPathEvalException: It is thrown when there is an exception condition during XPath evalution.

The most important innovation of VTD-XML's XPath implementation is its stateless, non-blocking evaluation of the node set. Unlike other XPath engines (e.g. XmlDocuemnt, XPathDocument JAXEN, XALAN, etc) that return the entire nodeset all at once, VTD-XML returns a qualified node immediately after the XPath engine finds a match. The key method is AutoPilot's evalXPath(), which moves the cursor to the next node (if there is at least one remaining in the nodeset) every time it is called. To illustrate the differences between those two styles of XPath evalution, let us look at the two short snippets written respectively in JAXEN and VTD-XML that execute the same application logic on "catalog.xml" as shown below.

<CATALOG>
    <CD>
        <TITLE>Empire Burlesque</TITLE>
        <ARTIST>Bob Dylan</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>Columbia</COMPANY>
        <PRICE>10.90</PRICE>
        <YEAR>1985</YEAR>
     </CD>
     <CD>
        <TITLE>Still got the blues</TITLE>
        <ARTIST>Gary More</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>Virgin redords</COMPANY>
        <PRICE>10.20</PRICE>
        <YEAR>1990</YEAR>
     </CD>
     <CD>
        <TITLE>Hide your heart</TITLE>
        <ARTIST>Bonnie Tyler</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>CBS Records</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1988</YEAR>
     </CD>
     <CD>
        <TITLE>Greatest Hits</TITLE>
        <ARTIST>Dolly Parton</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>RCA</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1982</YEAR>
     </CD>
</CATALOG>

The XPathDocument based code is shown below:

using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.Xml.XPath;

namespace app0
{
    class Program
    {
        static void Main(string[] args)
        {
            XPathDocument myXPathDocument = new XPathDocument(args[0]);
            XPathNavigator myXPathNavigator = myXPathDocument.CreateNavigator();
            XPathExpression myXPathExpr = myXPathNavigator.Compile(
                "/CATALOG/CD/TITLE/text()");
            XPathNodeIterator myXPathNodeIterator = myXPathNavigator.Select(myXPathExpr);

            // iterate and print out the nodes
            while (myXPathNodeIterator.MoveNext())
                Console.WriteLine(myXPathNodeIterator.Current.Value);
        }
    }
}

The same application written in VTD-XML is shown below:

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app1
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();
            ap.selectXPath("/CATALOG/CD/TITLE/text()");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {
                    Console.WriteLine(" text ==> " +
                           vn.toString(i));
                }
                ap.resetXPath();
            }
        }
   }
}

The code sample shown above deals with only one XPath expression. If you deal with multiple XPath expressions, you will need to instantiate multiple AutoPilot objects, each of which corresponds to a different XPath expression. If you call selectXPath(...) multiple times, the AutoPilot object remembers only the last XPath expression. Once an AutoPilot object selects an XPath expression, you call evalXPath() which moves the cursor of the embedded VTDNav instance to the next available node and returns its VTD record index. If no more node remains in the node set, evalXPath() returns -1.

What if you want to execute same XPath query over multiple XML documents? The key, in this case, is to combine XPath reuse with delayed binding for AutoPilot. After executing a query on one XML document, you need to first bind (by calling bind(...) ) the same AutoPilot object to a new VTDNav object corresponding to a different XML document. After evaluating the XPath, you also need to call resetXPath() which resets the internal state of XPath so it is ready for reuse.

The following example shows you how apply two XPath expressions over two separate XML documents.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app2
{
   class test {
     static void Main(string[] args)
     {
         VTDGen vg = new VTDGen();
         int i;
         AutoPilot ap = new AutoPilot();
         ap.selectXPath("/CATALOG/CD/TITLE/text()");
         AutoPilot ap2 = new AutoPilot();
         ap2.selectXPath("/CATALOG/CD[PRICE < 10]/TITLE/text()");

         // apply both XPath expressions to "catalog.xml"
         if (vg.parseFile("catalog.xml", false))
         {
             VTDNav vn = vg.getNav();
             ap.bind(vn);
             //XPath eval returns one node at a time
             while ((i = ap.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap.resetXPath();

             ap2.bind(vn);
             while ((i = ap2.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap2.resetXPath();
         }
         // apply both XPath expressions to a different file
         if (vg.parseFile("another.xml", false))
         {
             VTDNav vn = vg.getNav();
             ap.bind(vn);
             //XPath eval returns one node at a time
             while ((i = ap.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap.resetXPath();

             ap2.bind(vn);
             while ((i = ap2.evalXPath()) != -1)
             {
                 Console.WriteLine(" text ==> " +
                        vn.toString(i));
             }
             ap2.resetXPath();
         }
     }
  } 
}

Evaluate XPath to String, Number and Boolean

XPath 1.0 also defines that an expression may be evaluated to a string, a double, or a boolean. VTD-XML's AutoPilot class contains evalXPathToString(), evalXPathToNumber() and evalXPathToBoolean(). Notice that those methods don't need resetXPath() in order to be reused as they perform reset internally.

Benefits

When comparing with other XPath engines, VTD-XML's non-blocking XPath evaluation approach has a number of benefits. Without sacraficing any inherent benefits of XPath, VTD-XML offers unsurpassed flexibility and puts more control in your hands. Consider the use case in which you want to use XPath to process a large XML document and the node set contains thousands of nodes. With other XPath engines, your application logic will have to wait until the entire node set to be returned before doing anything. With VTD-XML, you application logic can start as soon as the first node gets returned. If your application only processes a small number of nodes in the node set, or you have prior knowledge of the XML document (e.g. you know there is only one occurrence of the matching node at the beginning of a large XML file), why wait until the XPath finishes searching the entire document? In this case, other XPath engine's node evaluation unnecessarily blocks the program flow and seem more like the wrong approach! In other use cases, VTD-XML's non-blocking XPath can even be used to walk the nodes, which simply isn't possible with other types of XPath engines.

When combined with other benefits/features of VTD-XML, the comparison with DOM and SAX is about to get lopsided in a hurry. Regardless of the file sizes, whether it is parsing, updating or XPath evaluation (shown later), VTD-XML delivers the kind of across-the-board performance improvements never before thought possible by many. As the processing model of choice for your transactional SOA applications dealing with small to medium size XML files, VTD-XML will easily blow away DOM . And the benefit only gets bigger for large XML files. You literally moves from long, ugly, and unmaintainable SAX code to clean, short VTD-XML code, while enjoying the full power of VTD-XML's random access and a 10 to 50 times speedup along the way. If you regularly deal with large XML documents or XPath expressions returning huge node sets, VTD-XML is a must-have!

Common Usage Patterns

There are a few common patterns to follow in order to base your application on VTD-XML.

Embed evalXPath(..) in while's control statement

The most common pattern is to embed evalXPath() in the control statement of a while loop. So far all the code uses this pattern to navigate to the qualified nodes in the document. The return value of EvalXPath(), if equal to -1, indicates there is no node remaining in the node set. Otherwise, the return value equals to the VTD record index of the node selected by the XPath, to which the cursor of the VTDNav object (bound to the AutoPilot object) also points. The following two short code fragments are completely equivalent.

int i;

ap.bind(vn);
ap.selectXPath(...); 
while( (i=ap.evalXPath())!=-1){ 

}
int i;

ap.bind(vn); 
ap.selectXPath(...);

while(ap.evalXPath()!>=0){ 
i = vn.getCurrentIndex();

}

Combining with Manual Navigation

But sometimes you want to navigate the cursor a bit further within the while loop. To do that, the rule is that the code segment nested in the loop must not alter the node position. You can either move the cursor back to the original location, or use the VTDNav object's push() and pop() to enforce the rule. For the same XML data shown above, the first example, performing a little manual navigation in the while loop, moves the cursor back to the starting position before the next iteration.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app3
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();

            ap.selectXPath("/CATALOG/CD[PRICE < 10]");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {// get to the first child
                    if (vn.toElement(VTDNav.FIRST_CHILD, "TITLE"))
                    {
                        int j = vn.getText();

                        if (j != -1)

                        Console.WriteLine(" text node ==>" + vn.toString(j));
                        vn.toElement(VTDNav.PARENT); // move the cursor back 
                    }
                }

                ap.resetXPath();
            }
        }
    }
}

The second example uses push() and pop(), which is slightly less efficient.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app4
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();
            ap.selectXPath("/CATALOG/CD[PRICE < 10]");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {
                    vn.push();
                    // get to the first child
                    if (vn.toElement(VTDNav.FIRST_CHILD, "TITLE"))
                    {
                        int j = vn.getText();
                        if (j != -1)
                            Console.WriteLine(" text node ==>" + vn.toString(j));
                    }
                    vn.pop();
                }
                ap.resetXPath();
            }
        }
    }
}

Nested XPath support

If the navigation you want to perform is more complicated, you can in fact nest XPath queries. The above code re-written in XPath is shown below. The rule that ensures the consistency of the cursor position still applies.

using System;
using System.Collections.Generic;
using System.Text;
using com.ximpleware;
namespace app6
{
    class Program
    {
        static void Main(string[] args)
        {
            VTDGen vg = new VTDGen();
            int i;
            AutoPilot ap = new AutoPilot();
            ap.selectXPath("/CATALOG/CD[PRICE < 10]");
            AutoPilot ap2 = new AutoPilot();
            ap2.selectXPath("TITLE/text()");
            if (vg.parseFile("catalog.xml", false))
            {
                VTDNav vn = vg.getNav();
                ap.bind(vn);
                ap2.bind(vn);
                //XPath eval returns one node at a time
                while ((i = ap.evalXPath()) != -1)
                {
                    vn.push();
                    // get to the first child 
                    int j;
                    while ((j = ap2.evalXPath()) != -1)
                    {
                        Console.WriteLine(" text node ==>" + vn.toString(j));
                    }
                    // doesn't forget this since the next iteration will reuse
                    // ap2's XPath!!! 
                    ap2.resetXPath();
                    vn.pop();
                }
                ap.resetXPath();
            }
        }
    }
}

Ad-Hoc tree walker mode

Of course, there is nothing preventing you from calling evalXPath() just to move the cursor to the desired destination(s) in the XML document.

import com.ximpleware.*; 
public class vtdXPath5{
 public static void main(String args[]) throws Exception{
  VTDGen vg =new VTDGen();
  int i;
  AutoPilot ap = new AutoPilot(); 
  ap.selectXPath("/CATALOG/CD[PRICE < 10]"); 

  if (vg.parseFile("catalog.xml", false)){
   VTDNav vn = vg.getNav();
   ap.bind(vn);
    //XPath eval returns one node at a time
    if (ap.evalXPath()>=0){ 
      // do what ever you like here with vn 
    }    
  }
 }
}

XPath Evaluation Performance BenchMarks

This section will show you the VTD-XML's XPath evaluation performance and compare it with Xalan and JAXEN, both which works with Xerces DOM. The benchmark programs precompile a set of XPath expressions and then repetitively execute the queries over the same set of documents (three documents of similar structure, but differing in size, are chosen). The tables and charts below report average latencies of query executions. For a complete description and discussion of the test set up, methodology and code, please visit Latency Number

/*/*/*[position() mod 2 = 0]

  Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.0395 1.123 0.00994
po_medium.xml 0.524 6.552 0.115
po_big.xml 8.812 94.535 1.292

/purchaseOrder/items/item[USPrice <100]

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.0857 1.204 0.048
po_medium.xml 1.498 9.381 0.854
po_big.xml 22.262 113.831 9.238

/*/*/*/quantity/text()

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.107 1.0756 0.0317
po_medium.xml 1.922 7.0268 0.537
po_big.xml 32.181 106.7 6.339

//item/comment

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.356 1.208 0.089
po_medium.xml 9.824 9.922 1.505
po_big.xml 197.319 138.45 17.825

//item/comment/../quantity

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.374 1.274 0.105
po_medium.xml 11.147 10.298 1.787
po_big.xml 204.714 140.06 21.06

/*/*/*/* [position() mod 2 = 0]

Jaxen (ms) JDK(ms) VTD-XML(ms)
po_small.xml 0.136 1.21 0.0353
po_medium.xml 2.7 8.216 0.604
po_big.xml 65.469 122.83 6.559

/purchaseOrder/items/item[USPrice > 100]/comment/text()

Jaxen (ms) JDK (ms) VTD-XML (ms)
po_small.xml 0.113 1.262 .06
po_medium.xml 2.082 9.154 1.061
po_big.xml 32.892 121.07 12.35

Graphical View

/*/*/*[position() mod 2 = 0]

/purchaseOrder/items/item[USPrice <100]

/*/*/*/quantity/text()

//item/comment

//item/comment/../quantity

/*/*/*/* [position() mod 2 = 0]

/purchaseOrder/items/item[USPrice > 100]/comment/text()

Conclusion

Hopefully this article has provided you with both an introduction to, and an overview of, VTD-XML's XPath implementation. When combined the other inherent benefits, VTD-XML empowers you to write clean, readable code and acheive peerless performance, efficiency and flexibility. Because of it, some of the most deeply entrenched perceptions and assumptions about XML programming are about to change. Consequently, the infrastructure built on top of XML, such as Web Services and SOA, is on the verge of positive changes as well. My next articles will show you how to use VTD-XML to significantly improve Java-XML databinding.

History

  • 7/29/08 Initial Article posted
  • 8/11/08 Added subsection covering EvalXPathToString(), EvalXPathToNumber(), EvalXPathToBoolean().
  • 6/9/10 Article edited

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here