Click here to Skip to main content
16,010,473 members
Articles / Programming Languages / Java / Java SE
Article

Flat DOM

Rate me:
Please Sign up or sign in to vote.
3.46/5 (8 votes)
24 Oct 2005CPOL2 min read 46.7K   269   12   6
A simpler way to process XML

Introduction

[Note -- The concept below and the supplied code also work for HTML.]

When processing XML documents, one usually has three choices:

  1. Use a stateless event driven technique such as SAX. Fast, efficient, but very low level. This technique is just above raw parsing.
  2. Use a document object model (DOM) approach. This technique has the advantage of being standardized, but is slow and memory intensive. Surprisingly, parsing an XML document into a hierarchical data structure is not as useful as it seems and still leaves a huge amount of work to do anything useful.
  3. Use XSLT, or XPath on top of the standard DOM. This technique is often considered the easiest to use, but is the least efficient. In addition, there are many types of manipulation that do not fit within these higher-level models and jumping down to lower-level code is difficult.

We offer a fourth approach that runs near the efficiency of SAX and provides the near ease-of-use of XPath, while allowing us to jump down to a lower programming level with ease. We call our solution "Flat DOM" for reasons that will soon become obvious.

Very simply, we use SAX to build a list of key/value pairs. The key is the complete path up to that point for that particular entry. The value is the text or the attribute quantity. We use the special character @ to represent attributes and #text to represent text segments. Sequential text segments are merged together for ease of processing. The example below illustrates what we mean.

If we take the XML document example below...

XML
<?xml version="1.0" encoding="UTF-8"?>
<content xmlns="http:XMLSerialization">
    <object id="2">
        <class flag="3" id="0" name="NSArray" suid="-3789592578296478260">
            <field name="objects" type="java.lang.Object[]"/>
        </class>
        <array field="objects" id="4" ignoreEDB="1" length="3" type="java.lang.Object[]">
            <string id="5">The Chestry Oak</string>
            <string id="6">A Tree for Peter</string>
            <string id="7">The White Stag</string>
        </array>
    </object>
</content>

... and process it into a Flat DOM, we get the list of key/value pairs (represented as key: value)

/content/: #START#
/content/@xmlns: http:XMLSerialization
/content/object/: #START#
/content/object/@id: 2
/content/object/class/: #START#
/content/object/class/@flag: 3
/content/object/class/@id: 0
/content/object/class/@name: NSArray
/content/object/class/@suid: -3789592578296478260
/content/object/class/field/: #START#
/content/object/class/field/@name: objects
/content/object/class/field/@type: java.lang.Object[]
/content/object/class/field/: #END#
/content/object/class/: #END#
/content/object/array/: #START#
/content/object/array/@field: objects
/content/object/array/@id: 4
/content/object/array/@ignoreEDB: 1
/content/object/array/@length: 3
/content/object/array/@type: java.lang.Object[]
/content/object/array/string/: #START#
/content/object/array/string/@id: 5
/content/object/array/string/#text: The Chestry Oak
/content/object/array/string/: #END#
/content/object/array/string/: #START#
/content/object/array/string/@id: 6
/content/object/array/string/#text: A Tree for Peter
/content/object/array/string/: #END#
/content/object/array/string/: #START#
/content/object/array/string/@id: 7
/content/object/array/string/#text: The White Stag
/content/object/array/string/: #END#
/content/object/array/: #END#
/content/object/: #END#
/content/: #END#

Iterating through this list and extracting the data we need is now relatively trivial. We can use regular expressions, or complex matching criteria as we need.

We have chosen to take an interface and implementation approach for the code representing the structure. The reason for this is to allow for alternate representations of the list structure underneath. For example, there is a lot of repetition in the key part. We have written an implementation that stores the key parts separately in order to compress the representation in memory. We have also written an implementation for handling HTML. One could also write an implementation that stores and accesses the data from a file or a database. The interface is:

Java
public interface XMLVector {
   public int length();
   public String getPath(int i);
   public String getValue(int i);
   public int[] getPosition(String key);
   public int[] getPosition(Pattern keyRe);
   public String getValue(String key);
   public String getValue(Pattern keyRe);
}

And below is an example of using the library to extract all of the attributes from an XML file:

Java
public class Main {
   public static void main(String[] args) throws Throwable {
      File f = new File("sample1.xml");
      XMLVector vec = new XMLVectorImp(f);
      System.out.println(vec);
      // Extract all attributes for fun
      int pos[] = vec.getPosition(Pattern.compile(".*@.*"));
      for(int i = 0; i<pos.length; ++i) {
         System.out.println(vec.getPath(pos[i])+": "+vec.getValue(pos[i]));
      }
   }
}

Hopefully, you will find this as useful and easy to use as I do. The code is written in Java, but the concept is very simple and it shouldn't be too difficult to translate it into C# and other languages.

History

  • 24th October, 2005: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
Canada Canada
Software developer for Simple Software.

Currently developing a Java client application that allows you to view and interact with real-time web metrics sent from your web server.

Comments and Discussions

 
QuestionIterators? Pin
Davy Boy12-Dec-05 4:08
Davy Boy12-Dec-05 4:08 
AnswerRe: Iterators? Pin
Ian Schumacher12-Dec-05 5:59
Ian Schumacher12-Dec-05 5:59 
GeneralRe: Iterators? Pin
Davy Boy12-Dec-05 6:07
Davy Boy12-Dec-05 6:07 
GeneralRe: Iterators? Pin
Ian Schumacher12-Dec-05 6:12
Ian Schumacher12-Dec-05 6:12 
GeneralNice Idea Pin
Doron Barak24-Oct-05 18:13
Doron Barak24-Oct-05 18:13 

During my past projects, I've found that I need to read and write XML documents, sometimes from within an Applet which means that depending on SAX/DOM parsers would increase the Applet's footprint. Flattening XML with SAX or DOM doesn't work for every case since if you have to modify an XML file, you still need it in a tree structure so that you can write it out and still keep the XML structure intact.

Other times, like when I was using XML files as scripts with my scripting-engine, I had to parse and execute the instructions according to the XML elements because of features like recursion, branching and functions. Flattening these highly structured scripts from XML to key/value pairs would simply complicate things more than simplify them.

So... I wrote my own XML parser. Its about 580 lines long, in just one class file and it lets me read and write XML files. Of course it is simplified and does not support certain W3C-XML standard features. Its API is rather simple as well and does not follow too closely with either SAX or DOM. But, it works great in my Applets and other Applications, even the ones that used to depend on the DOM parsers. Big Grin | :-D


GeneralRe: Nice Idea Pin
Ian Schumacher24-Oct-05 18:28
Ian Schumacher24-Oct-05 18:28 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.