What is XML Optimization
This is a set of techniques aimed to audit design metadata from any XML stream. Its purpose is to help XML producers minimize the side effects of using XML, such like the size overhead, or the versioning lockdown.
For instance, increase in size results in more network bandwidth required to send/retrieve the equivalent XML content, additionally to the increase in memory space required to store the XML locally, additionally to the increase in time required for the XML parser to process the stream.
XML optimization provides a report showing relevant figures to play with (see screen capture above). With this report in hands, the XML producers may choose to either use dedicated XML automation tools to transform XML streams according to defined rules. XML producers may find it even more appropriate to redesign the whole XML metadata.
Figures have been calculated and are displayed in the report because they are meaningful for almost any kind of XML stream, i.e., each could mean a substantial change in size or design. I have tested over 50 XML files before coming up with these figures.
XML optimization is new stuff. Before writing this article, I have browsed through the public internet sites, newsgroups and even quite a bunch of research papers, and I haven't found a single topic addressing it. Amazingly enough, I believe this is not only of interest in the real world - when you know that every company out there in the high-tech industry now uses XML somehow - this is as crucial as database tuning tools or network tuners. Why isn't this part of leading XML tools (Xmlspy, Xmetal, Msxml, .NET API) ? I don't know, may be developers are content enough with their use of XML without really seeing the impact of using XML instead of binary file formats and standard databases.
What is Not XML Optimization
XML optimization is not about compressing XML to any proprietary binary format. For that purpose, please don't hesitate to check out xmill (at&t) and XMLppm (sourceforge). Their intent is to make a binary format from XML by shrinking XML patterns. And indeed, it is very likely to be so because of either of these:
- element and attribute names appear many times, thus can be replaced with short tokens
- list of values may contain a lot of duplicated data, by analogy with SQL
join
records
Binary XML may be fine for some applications, but XML stops immediately being human readable. That is the reason why such tools are usually applied at the transport level, not at the application level. XML compression does not steal interest from XML optimization, since XML compression is the last thing to use when no smarter code or design principles can be of help - that's brute force in other words -. XML optimization on the other hand reveals best practices and caveats, thus is bound to help XML producers learn about their own metadata.
A Real World Sample
Before going into details, I would like to point a few links to an actual source XML stream, and the report obtained by applying the tool on it:
In the HTML report, don't hesitate to click on ? question marks for further information.
The remainder of this article can be broken down into the following sections (reflecting the sections in the HTML report):
XML Optimization: Structure in General
General figures about the XML stream are simple numbers to begin with.
Though the meaning of nb lines, nb elements, nb comments is obvious, it is of interest to know what are the effects over an XML stream with a high nb comments ratio in it. XML producers usually add comments above, in, or below the actual XML elements to explain the hierarchy and underlying design. But what they don't know is that in a lot of "content management server (CMS)" software, the XML is left as is, and sent to clients without removing these unnecessary comments. Resulting in data transport being often 10% larger compared to the size without comments. Of course, in this case, XML producers are more than encouraged to lift down their XML code. NB. CDATA sections and nb Process instructions play a similar role than nb comments.
NB namespaces used is interesting as it reflects whether elements, attributes, and even data itself, use a lot of prefixes, which in turn may significantly increase the size of the XML stream. For the report to be really useful, figures are often displayed both as absolute values, and as percentages.
XML Optimization: Structure in Details
Fasten your seatbelt, there are many topics here.
Structure Pattern
This reverse engineers the XML stream hierarchy by just processing the stream (it never reads the DTD if any), giving both parent/children relationships, and also datatypes when they are recognized (including float
, integers, currencies, dates, urls, and emails).
What for? Reverse engineering the structure pattern is not only a unique feature, it reveals a lot whether the XML is designed "vertically" (lot of elements), "horizontally" (lot of attributes), or somewhat diagonally. The structure pattern is a preliminary block that must be displayed before proceeding next topics because it simplifies figuring out the design.
Flattening the Structure Pattern
Distinct patterns tell if there is more than one main pattern in the XML stream. Pattern occurrences, Pattern height (amount of lines) and Pattern size (in bytes) show the key characteristics of the main structure pattern. These are figures that are worth mentioning by themselves, but are also preliminary to the next figure.
Now what are flattening patterns? That's what is obtained by replacing child elements with attributes, where possible. Follows is a sample before and after flattening:
Original XML
<person>
<firstname>John</firstname>
<lastname>Lepers</lastname>
</person>
Modified XML
<person firstname="John" lastname="Lepers"/>
Flattening the patterns makes use of what is known in the W3C XML norm as empty element tags, i.e., tags with no slash counterparts, thus reducing the size by significant amounts. Flattening patterns has a lot of interesting effects : 1. for instance, because the hierarchy is flat, the parsing will be faster. 2. it is much easier to do a diff on XML streams with flatten patterns.
Structure Depth
The depth we are talking about is the element depth in the hierarchy, ie "1
" for the root element, "2
" for the direct children, and so on. A measure usually comes with figures such like: the minimum value overall the XML stream, the maximum value, the average value, and the standard deviation value. A great standard deviation value means that the XML stream intensively uses indentation, <, > and end tags, which in turn increase the size.
To better reveal the depth, we also list the amount of elements at any given depth.
The depth measure is visually displayed using a bar chart (numerical figures in a list often hide the trend). For those interested in how the chart is built, using JavaScript code, read what follows:
var tabheight = 120;
var tabdata = new Array(1,15,83,159);
var tabtips = new Array("01","02","03","04");
showChart_Max("<p class='m1'>Depth histogram chart</p>",tabheight,
"#4488DD",tabdata,tabtips);
function showChart_Max(title, height, color, data, datatips)
{
if (data.length==0 || data.length!=datatips.length)
return;
var max = data[0];
var min = data[0];
for (i=0; i<data.length; i++)
{
c = data[i];
if ( max<c )
max = c;
if ( min>c )
min = c;
}
var average = (min+max)/2;
average = Math.floor(100*average)/100;
document.writeln ("<table height='"+height+"' cellpadding='0' " +
"cellspacing='0' border='0'>");
document.writeln ("<tr><td valign='center'><font size='-1'>max=" +
max+"</font></td>");
for (i=0; i<data.length; i++)
{
dataportion = height * data[i] / max;
voidportion = height - dataportion;
document.writeln ("<td height='129' width='15' rowspan='5'> </td>");
document.writeln ("<td width='15' rowspan='5'>");
document.writeln (" <table width='100%' cellpadding='0' " +
"cellspacing='0' border='0'>");
document.writeln (" <tr><td height='"+voidportion+"'></td></tr>");
document.writeln (" <tr><td height='"+dataportion+"' " +
"bgcolor='"+color+"'></td></tr></table>");
document.writeln ("</td>");
}
document.writeln ("</tr>");
document.writeln ("<tr><td> </td></tr>");
document.writeln ("<tr><td><font size='-1'>avg="+average+
"</font></td></tr>");
document.writeln ("<tr><td> </td></tr>");
document.writeln ("<tr><td><font size='-1'>min="+min+"</font></td></tr>");
document.writeln ("<tr><td valign='center'></td>");
for (i=0; i<data.length; i++)
{
j=i+1;
document.writeln ("<td width='15'> </td>");
if (datatips.length==0)
document.writeln ("<td width='15'><font size='-1'>"+j +
"</font></td>");
else
document.writeln ("<td width='15'><font size='-1'>" +
datatips[i]+"</font></td>");
}
document.writeln ("</tr>");
if (title!="")
document.writeln ("<caption valign='bottom'>"+title+"</caption>");
document.writeln ("</table><br><br>");
}
Structure Node Naming Strategy
Element and attribute names are usually chosen so they are self-descriptive. While this looks like an advantage, it has an overhead on size just because even in English, keywords enclosing content take statistically a significant space, resulting to a great contribution to the overall stream size. This can be avoided by enforcing a new strategy on naming described below. An element or attribute is any combination of letters and digits. With that in hand, why not make these names as short as possible? Let us take an example:
="1.0"="ISO-8859-1"
<!DOCTYPE Bookstore SYSTEM "bookshop.dtd">
<Bookstore>
<Book Genre="Thriller" In_Stock="Yes">
<Title>The Round Door</Title>
</Book>
</Bookstore>
Let's build a map of name pairs:
Bookstore becomes A
Book becomes B
Genre becomes C
In_Stock becomes D
Title becomes E
So we get the following equivalent XML document:
="1.0"="ISO-8859-1"
<!DOCTYPE Bookstore SYSTEM "bookshop_A.dtd">
<A>
<B C="Thriller" D="Yes">
<E>The Round Door</E>
</B>
</A>
Similarly with depth, the node naming strategy is also visually reflected using a bar chart, so we see the trend.
The gain resulting from applying the smart node naming strategy to the XML stream is calculated. That's often 30% or more, which is very very significant.
Structure Attributes
The Structure attributes indicator reveals how uniform attributes are dispatched within elements. Besides the standard amount of attributes per element (with min, max, mean and standard deviation) is the disorder ratio. The disorder ratio attempts to show if attributes are listed in the same order or not wrt element occurences. That's of course an average, because each element may have any amount of associated attributes. According to the W3C XML norm, there is no special ordering between attributes, it is simply a good habit to have attributes always following the same order.
Structure Namespaces
XML namespaces are declared by using a special attribute of the form xmlns:supplier="http://www.namespaces.com/supplier" and refers to a set of element and attribute names with a dedicated semantic meaning. Element and attributes with namespaces are prefixed by the namespace, for instance supplier:orderID
. Namespaces are not required in XML streams, but they special meanings and may simplify data binding, as long as namespace real meanings are made public and available to everyone. Any number of namespaces can be used, not only one. A namespace must always be declared before it is used. The URL used for the declaration is a fake URL here just for global uniqueness purpose. Below is a sample for the supplier namespace:
="1.0"="ISO-8859-1"
<Orders xmlns:supplier="http://www.namespaces.com/supplier">
<Order date="AA/45/10" supplier:id="UIYBAB47KDIU75">
<Id>NBZYSJSGSIAUSYGHBXNBJDUIUYE</Id>
</Order>
</Orders>
When namespaces are used, the report shows the ratio of namespaces' use, and the list of namespaces.
Not only using or not namespaces strongly changes the underlying XML design, they have effect on the node naming strategy, and in turn on the overall size of the XML stream.
Content Itself
Even as the content itself is not part of the XML metadata, there are many ways to produce size overhead. The simplest, of course, is to dump data in XML format from a relational database system, without factorizing duplicate values. It is easy to figure out that there is a lot of gain here.
Raw Content
Content size in element or attribute values exhibit a trend which can be described using minimum size, maximum, average, and standard deviation.
In addition, the ratio of element and attributes with no values is shown. If the ratio is high, it is easy to question whether the design of the metadata is good.
A somewhat odd indicator is the Ratio of multiple part values. Below are two samples of multiple part values for the <book>
element :
<book>
The name of this book is so inadequate for a general audience
that it has been decided not to print it.
</book>
...
<book>The Round Door
<year>1999</year>
<price>20$</price>Part II
</book>
Content Correlation
Content correlation is an in-depth examination of List Of Values that reveals valuables things. The first indicator is related to duplication, or how often the same values appear again and again. And it includes max, average and standard deviation. The second indicator is a ranking, it shows the most seen value in all List Of Values.
Content Spacing and Indentation
Indentation is often used in XML streams, as they are often designed and read by humans. But indentation produces a signication increase in size. In the report is shown the new size of XML stream without indentation at all. That's often 30%.
Summary of Important Measures
Out of the many figures from the HTML reports, several deserve some introductory explanations:
- Flattening patterns: That's the design rule of replacing 1-cardinality elements by attributes. Sounds awful, but a lot of space is gained here.
- Indentation and multiple spaces: Beautifying your XML stream is ok, as long as you're dealing with tiny streams. Indented XML streams are simply put twice larger. Just keep this in mind if your server-side component does not scale, and you're wrecking the entire network bandwidth.
- Disorder ratios: Those are the kind of measures that help by themselves improve the schema design, and by the way may reduce XML bug fixing.
- Correlation in content: Statistically speaking, an XML stream has a lot of overhead in size just because content is duplicated rather than factorized.
How to Use the Tool
Syntax:
single file : betterxml <your file>
betterxml bookshop.xml
betterxml c:\mydir\bookshop.xml
betterxml http://www.mysite.com/xml/bookshop.xml
whole directory : betterxml -d <your directory>
betterxml -d c:\tmp\repository
Technical Details
Technically, the tool is based on James Clark's Expat (royalty-free SAX Parser). The executable, which is a report generator on top of a static
library can be divided into three parts:
- betterxml.dsp (betterxml.exe), a report generator, contains mostly the
HTMLWriter
class which is straightforward, and reuses HTML templates stored in the .rc resource file. All strings are localized and ready for a foreign release, if anyone interested. The HTML reports have a built-in chart library (limited to bar charts) allowing to display charts using JavaScript. - SaxAnalyzer.dsp (SaxAnalyzer.lib), an XML extraction library, with the following shade of classes:
IXmlStats
: API to expose measures. Inherits IUnknown
. - AppLogic.cpp: callbacks from the XML parser, calculations of all measures
- Element.cpp: element + attribute API
- HtmlParser.cpp: general purpose HTML parser, used to extract details that expat does not see
- XmlFileManager.cpp: manages XML stream reading, including async monikers for URL-based XML streams
- xmlparser.dsp (xmlparser.lib), the expat library itself. Both VC6 and VC7 workspaces are provided.
History
- 12th September, 2002: Initial version
License
This article has no explicit license attached to it, but may contain usage terms in the article text or the download files themselves. If in doubt, please contact the author via the discussion board below.
A list of licenses authors might use can be found here.