Introduction
[Note -- The concept below and the supplied code also work for HTML.]
When processing XML documents, one usually has three choices:
- Use a stateless event driven technique such as SAX. Fast, efficient, but very low level. This technique is just above raw parsing.
- Use a document object model (DOM) approach. This technique has the advantage of being standardized, but is slow and memory intensive. Surprisingly, parsing an XML document into a hierarchical data structure is not as useful as it seems and still leaves a huge amount of work to do anything useful.
- Use XSLT, or XPath on top of the standard DOM. This technique is often considered the easiest to use, but is the least efficient. In addition, there are many types of manipulation that do not fit within these higher-level models and jumping down to lower-level code is difficult.
We offer a fourth approach that runs near the efficiency of SAX and provides the near ease-of-use of XPath, while allowing us to jump down to a lower programming level with ease. We call our solution "Flat DOM" for reasons that will soon become obvious.
Very simply, we use SAX to build a list of key/value pairs. The key is the complete path up to that point for that particular entry. The value is the text or the attribute quantity. We use the special character @ to represent attributes and #text to represent text segments. Sequential text segments are merged together for ease of processing. The example below illustrates what we mean.
If we take the XML document example below...
="1.0"="UTF-8"
<content xmlns="http:XMLSerialization">
<object id="2">
<class flag="3" id="0" name="NSArray" suid="-3789592578296478260">
<field name="objects" type="java.lang.Object[]"/>
</class>
<array field="objects" id="4" ignoreEDB="1" length="3" type="java.lang.Object[]">
<string id="5">The Chestry Oak</string>
<string id="6">A Tree for Peter</string>
<string id="7">The White Stag</string>
</array>
</object>
</content>
... and process it into a Flat DOM, we get the list of key/value pairs (represented as key: value)
/content/: #START#
/content/@xmlns: http:XMLSerialization
/content/object/: #START#
/content/object/@id: 2
/content/object/class/: #START#
/content/object/class/@flag: 3
/content/object/class/@id: 0
/content/object/class/@name: NSArray
/content/object/class/@suid: -3789592578296478260
/content/object/class/field/: #START#
/content/object/class/field/@name: objects
/content/object/class/field/@type: java.lang.Object[]
/content/object/class/field/: #END#
/content/object/class/: #END#
/content/object/array/: #START#
/content/object/array/@field: objects
/content/object/array/@id: 4
/content/object/array/@ignoreEDB: 1
/content/object/array/@length: 3
/content/object/array/@type: java.lang.Object[]
/content/object/array/string/: #START#
/content/object/array/string/@id: 5
/content/object/array/string/#text: The Chestry Oak
/content/object/array/string/: #END#
/content/object/array/string/: #START#
/content/object/array/string/@id: 6
/content/object/array/string/#text: A Tree for Peter
/content/object/array/string/: #END#
/content/object/array/string/: #START#
/content/object/array/string/@id: 7
/content/object/array/string/#text: The White Stag
/content/object/array/string/: #END#
/content/object/array/: #END#
/content/object/: #END#
/content/: #END#
Iterating through this list and extracting the data we need is now relatively trivial. We can use regular expressions, or complex matching criteria as we need.
We have chosen to take an interface and implementation approach for the code representing the structure. The reason for this is to allow for alternate representations of the list structure underneath. For example, there is a lot of repetition in the key part. We have written an implementation that stores the key parts separately in order to compress the representation in memory. We have also written an implementation for handling HTML. One could also write an implementation that stores and accesses the data from a file or a database. The interface is:
public interface XMLVector {
public int length();
public String getPath(int i);
public String getValue(int i);
public int[] getPosition(String key);
public int[] getPosition(Pattern keyRe);
public String getValue(String key);
public String getValue(Pattern keyRe);
}
And below is an example of using the library to extract all of the attributes from an XML file:
public class Main {
public static void main(String[] args) throws Throwable {
File f = new File("sample1.xml");
XMLVector vec = new XMLVectorImp(f);
System.out.println(vec);
int pos[] = vec.getPosition(Pattern.compile(".*@.*"));
for(int i = 0; i<pos.length; ++i) {
System.out.println(vec.getPath(pos[i])+": "+vec.getValue(pos[i]));
}
}
}
Hopefully, you will find this as useful and easy to use as I do. The code is written in Java, but the concept is very simple and it shouldn't be too difficult to translate it into C# and other languages.
History
- 24th October, 2005: Initial post