Some of the project files created by Cyotek Sitemap Creator and WebCopy are fairly large and the load performance of such files is poor. The files are saved using a XmlWriter
class which is nice and fast. When reading the files back however, currently the whole file is loaded into an XmlDocument
and then XPath expressions are used to pull out the values. This article describes our effort at converting the load code to use an XmlReader
instead.
Sample XML
The following XML snippet can be used as a base for testing the code in this article, if required.
="1.0"="utf-8"="yes"
<cyotek.webcopy.project version="1.0.0.0"
generator="Cyotek WebCopy 1.0.0.2 (BETA))" edition="">
<uri lastCrawled="-8589156546443756722"
includeSubDomains="false">http://saturn/cyotekdev/</uri>
<additionalUri>
<uri>first url</uri>
<uri>second url</uri>
</additionalUri>
<authentication doNotAskForPasswords="false">
<credential uri="/" userName="username"
password="password" />
</authentication>
<saveFolder path="C:\Downloaded Web Sites"
emptyBeforeCrawl="true" createFolderForDomain="true"
flattenWebsiteDirectories="false" remapExtensions="true" />
<crawler removeFragments="true" followRedirects="true"
disableUriRemapping="false" slashedRootRemapMode="1"
sort="false" acceptDeflate="true" acceptGZip="true"
bufferSize="0" crawlAboveRoot="false" />
<defaultDocuments />
<linkInfo save="true" clearBeforeCrawl="true" />
<stripQueryString>false</stripQueryString>
<useHeaderChecking>true</useHeaderChecking>
<userAgent useDefault="true"></userAgent>
<rules>
<rule options="1" enabled="true">trackback\?id=</rule>
<rule options="1" enabled="false">/downloads/get</rule>
<rule options="1" enabled="false">/article</rule>
<rule options="1" enabled="false">/sitemap</rule>
<rule options="1" enabled="false">image/get/</rule>
<rule options="1" enabled="false">products</rule>
<rule options="1" enabled="false">zipviewer</rule>
</rules>
<domainAliases>
<alias>(?:http(?:s?):\/\/)?saturn/cyotekdev/</alias>
</domainAliases>
<forms>
<page name="" uri="login"
enabled="true" method="POST">
<parameters>
<parameter name="rememberMe">true</parameter>
<parameter name="username">username</parameter>
<parameter name="password">password</parameter>
</parameters>
</page>
</forms>
<linkMap>
<link id="b1b85626f9984279b5e033c30a0a3f65" uri=""
source="1" contentType="text/html" httpStatus="200"
lastDownloaded="-8589156550177150260"
hash="0333961593BD555C49ABF2355140225A07DA9297" fileName="index.htm">
<title>Cyotek</title>
<incomingLinks>
<link id="b1b85626f9984279b5e033c30a0a3f65" />
</incomingLinks>
<outgoingLinks>
<link id="96a358d21135449eb6561f25399e24de" />
</outgoingLinks>
<headers>
<header key="Content-Encoding" value="gzip" />
<header key="Vary" value="Accept-Encoding" />
<header key="X-AspNetMvc-Version" value="1.0" />
<header key="Content-Length" value="3415" />
<header key="Cache-Control" value="private" />
<header key="Content-Type" value="text/html; charset=utf-8" />
<header key="Date" value="Fri, 01 Oct 2010 16:51:07 GMT" />
<header key="Expires" value="Fri, 01 Oct 2010 16:51:07 GMT" />
<header key="ETag" value="" />
<header key="Server" value="Microsoft-IIS/7.5" />
<header key="X-Powered-By" value="UrlRewriter.NET 2.0.0" />
</headers>
</link>
</linkMap>
</cyotek.webcopy.project>
Writing XML using a XmlWriter
Before I start discussing how to load the data, here is a quick overview of how it is originally saved. For clarity, I'm only showing the bare bones of the method.
string workFile;
workFile = Path.GetTempFileName();
using (FileStream stream = File.Create(workFile))
{
XmlWriterSettings settings;
settings = new XmlWriterSettings { Indent = true, Encoding = Encoding.UTF8 };
using (XmlWriter writer = XmlWriter.Create(stream, settings))
{
writer.WriteStartDocument(true);
writer.WriteStartElement("uri");
if (this.LastCrawled.HasValue)
writer.WriteAttributeString("lastCrawled", this.LastCrawled.Value.ToBinary());
writer.WriteAttributeString("includeSubDomains", _includeSubDomains);
writer.WriteValue(this.Uri);
writer.WriteEndElement();
writer.WriteEndDocument();
}
}
File.Copy(workFile, fileName, true);
File.Delete(workFile);
The above code creates a new temporary file and opens this into a FileSteam
. An XmlSettings
object is created to specify some options (by default, it won't indent, making the output files difficult to read if you open then in a text editor), and then a XmlWriter
is created from both the settings and stream.
Once you have a writer, you can quickly save data in compliant format, with the caveat that you must ensure that your WriteStarts
have a corresponding WriteEnd
, that you only have a single document element, and so on.
Assuming the writer gets to the end without any errors, the stream is closed, then temporary file is copied to the final destination before being deleted. (This is a good tip in its own right, as this means you won't destroy the user's existing if an error occurs, which you would if you directly wrote to the destination file.)
Reading XML using a XmlDocument
As discussed above, currently we use a XmlDocument
to load data. The following snippet shows an example of this.
Note that the code below won't work "out of the box" as we use a number extension methods to handle data type conversion, which makes the code a lot more readable!
document = new XmlDocument();
document.Load(fileName);
_uri = documentElement.SelectSingleNode("uri").AsString();
_lastCrawled = documentElement.SelectSingleNode("uri/@lastCrawled").AsDate();
_includeSubDomains =
documentElement.SelectSingleNode("uri/@includeSubDomains").AsBoolean(false);
So, as you can see, we load an XmlDocument
with the contents of our file. We then call SelectSingleNode
several times with a different XPath expression.
And in the case of a crawler project, we do this a lot, as there is a large amount of information stored in the file.
I haven't tried to benchmark XPath, but I would assume that we could have optimized this by first getting the appropriate element (uri
in this case) and then run additional XPath to read text/attributes. But this article would be rather pointless then as we want to discuss the XmlReader
!
As an example, we have a 2 MB project file which represents the development version of cyotek.com. Using System.Diagnostics.Stopwatch
, we timed how long it took to load this project 10 times, and it averaged 25 seconds per load. Which is definitely unacceptable.
Reading Using an XmlReader
Which brings us to the point of this article, doing the job using a XmlReader
and hopefully improving the performance dramatically.
Before we continue though, a caveat:
This is the first time I've tried to use the XmlReader
class, therefore it is possible this article doesn't take the best approach. I also wrote this article at the same time as getting the reader to work in my application so I've gone back and forth already correcting errors and misconceptions, which at times (and possible still) left the article a little disjointed. If you spot any errors in this article, please let us know.
The XmlReader
seems to operate in the same principle as the XmlWriter
, in that you need to read the data in more or less the same order as it was written. I suppose the most convenient analogy is a forward cursor in SQL Server, where you can only move forward through the records and not back.
Creating the Reader
So, first things first - we need to create an object. But the XmlReader
(like the XmlWriter
) is abstract
. Fortunately exactly like the writer, there is a static
Create
method we can use.
Continuing in the reader-is-just-like-writer vein, there is also a XmlReaderSettings
class which you can use to fine tune certain aspects.
Let's get the document opened then. Unlike XmlDocument
where you just provide a file name, XmlReader
uses a stream
.
using (FileStream fileSteam = File.OpenRead(fileName))
{
XmlReaderSettings settings;
settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Document;
using(XmlReader reader = XmlReader.Create(fileSteam, settings))
{
}
}
This sets us up nicely. Continuing my analogy from earlier, if you're familiar with record sets, there's usually a MoveNext
or a Read
method you call to read the next record in the set. The XmlReader
doesn't seem to be different in this respect, as there's a dedicated Read
method for iterating through all elements in the document. In addition, there's a number of other read methods for performing more specific actions.
There's also a NodeType
property which lets you know what the current node type is, such as the start of an element, or the end of an element.
I'm going to use the IsStartElement
method to work out if the current node is the start of an element, then perform processing based on the element name.
Enumerating Elements, Regardless of Their Position in the Hierarchy
The following snippet will iterate all nodes and check to see if they are the start of an element. Note that this includes top level elements and child elements.
while (reader.Read())
{
if (reader.IsStartElement())
{
}
}
The Name
property will return the name of the active node. So I'm going to compare the name against the names written into the XML and do custom processing for each.
switch (reader.Name)
{
case "uri":
break;
}
Reading Attributes on the Active Element
I mentioned above that there are a number of Read*
methods. There are also several Move*
methods. The one that caught my eye is MoveToNextAttribute
, which I'm going to use for converting attributes to property values.
The Value
property will return the value of the current node. If MoveToNextAttribute
returns true
, then I know I'm in a valid attribute and I can use the aforementioned Name
property and the Value
property to update property assignments.
The following snippet demonstrates the MoveToNextAttribute
method and Value
property:
while (reader.MoveToNextAttribute())
{
switch (reader.Name)
{
case "lastCrawled":
if (!string.IsNullOrEmpty(reader.Value))
_lastCrawled = DateTime.FromBinary(Convert.ToInt64(reader.Value));
break;
case "includeSubDomains":
if (!string.IsNullOrEmpty(reader.Value))
_includeSubDomains = Convert.ToBoolean(reader.Value);
break;
}
}
This is actually quite a lot of work. Another alternative is to use the GetAttribute
method - this reads an attribute
value without moving the reader. I found this very handy when I was loading an object who's identifying property wasn't the first attribute in the XML block. It also takes up a lot less code.
entry.Headers.Add(reader.GetAttribute("key"), reader.GetAttribute("value"));
Reading the Content Value of an Element
I've now got two values out of hundreds in the file loaded and I'm finished with that element. Or am I? Actually I'm not - the original save code demonstrates that in addition to a pair of attributes, we're also saving data directly into the element.
As we have been iterating attributes, the active node type is the last attribute, not the original element. Fortunately, there's another method we can use - MoveToContent
. This time though, we can't use the Value
property. Instead, we'll call the ReadString
method, giving us the following snippet:
if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.Element)
_uri = reader.ReadString();
I've included a call to IsStartElement
in the above snippet as I found if I called MoveToContent
when I was already on a content node (for example if no attributes were present), then it skipped the current node and moved to the next one.
If required, you can call ReadElementContentAsString
instead of ReadString
.
Some node values aren't string
s though - in this case, the XmlReader
offers a number of strongly typed methods to return and convert the data for you, such as ReadElementContentAsBoolean
, ReadElementContentAsDateTime
, etc.
case "useHeaderChecking":
_useHeaderChecking = reader.ReadElementContentAsBoolean();
break;
Processing Nodes Where the Same Names are Reused for Different Purposes
In the sample XML document at the start of this article, we have two different types of nodes named uri
. The top level one has one purpose, and the children of additionalUri
have another.
The problem we now face is as we have a single loop which processes all elements the case statement for uri
will be triggered multiple times. We're going to need some way of determining which is which.
There are a few of ways we could do this, for example
- Continue to use the main processing loop, just add a means of identifying which type of element is being processed.
- Adding another loop to process the children of the
additionalUri
element. - Using the
ReadSubtree
method to create a brand new XmlReader
containing the children and process that accordingly.
As we already have a loop which handles the elements we should probably reuse this - there'll be a lot of duplicate code if we suddenly start adding new loops.
Unfortunately there doesn't seem to an equivalent of the parent functionality of the XmlDocument
class, the closest thing I could see was the Depth property. This returned 1 for the top level uri
node, and 2 for the child versions. You need to be careful at what point you read this property, it also returned 2 when iterating the attributes of the top level uri
node.
One workaround would be to use boolean flags to identify the type of node you are loading. This would also mean checking to see if the NodeType
was XmlNodeType.EndElement
, doing another name comparison, and resetting flags as appropriate. This might be more reliable (or understandable) than simply checking node depths, your mileage may vary.
Another alternative could be to combine depth and element start/end in order to push and pop a stack which would represent the current node hierarchy.
In order to get my converted code running, I've went with the boolean flag route. I suspect a future version of the crawler format is going to ensure the nodes have unique names so I don't have to do this hoop jumping again though!
Combined together, the load data code now looks like this:
while (reader.Read())
{
if (reader.IsStartElement())
{
switch (reader.Name)
{
case "uri":
if (!isLoadingAdditionalUris)
{
while (reader.MoveToNextAttribute())
{
switch (reader.Name)
{
case "lastCrawled":
if (!string.IsNullOrEmpty(reader.Value))
_lastCrawled = DateTime.FromBinary(Convert.ToInt64(reader.Value));
break;
case "includeSubDomains":
if (!string.IsNullOrEmpty(reader.Value))
_includeSubDomains = Convert.ToBoolean(reader.Value);
break;
}
}
if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.Element)
_uri = reader.ReadString();
}
else if (reader.IsStartElement() || reader.MoveToContent() == XmlNodeType.EndElement)
_additionalRootUris.Add(new Uri(UriHelpers.CombineUri(
this.GetBaseUri(), reader.ReadString(), this.SlashedRootRemapMode)));
break;
case "additionalUri":
isLoadingAdditionalUris = true;
break;
}
}
else if (reader.NodeType == XmlNodeType.EndElement)
{
switch (reader.Name)
{
case "additionalUri":
isLoadingAdditionalUris = false;
break;
}
}
}
Which is significantly more code than the original version, and it's only handling a few values.
Using the ReadSubtree Method
The save functionality of crawler projects isn't centralized, child objects such as rules perform their own loading and saving via the following interface:
public interface IXmlPersistance
{
void Write(string fileName, XmlWriter writer);
void Read(string fileName, XmlNode reader);
}
And the current XmlDocument
based code will call it like this:
_rules.Clear();
foreach (XmlNode child in documentElement.SelectNodes("rules/rule"))
{
Rule rule;
rule = new Rule();
((IXmlPersistance)rule).Read(fileName, child);
_rules.Add(rule);
}
None of this code will work now with the switch to use XmlReader
so it all needs changing. For this, I'll create a new interface
public interface IXmlPersistance2
{
void Write(string fileName, XmlWriter writer);
void Read(string fileName, XmlReader reader);
}
The only difference is the Read
method is now using a XmlReader
rather than a XmlNode
.
The next issue is that if I pass the original reader to this interface, the implementer will be able to read outside the boundaries of the element it is supposed to be reading, which could prevent the rest of the document from loading successfully.
We can resolve this particular issue by calling the ReadSubtree
method which returns a brand new XmlReader
object that only contains the active element and its children. This means our other settings objects can happily (mis)use the passed reader without affecting the underlying load.
Note in the snippet below what we have wrapped the new reader in is a using
statement. The MSDN documentation states that the result of ReadSubtree
should be closed before you continue reading from the original reader.
Rule rule;
rule = new Rule();
using (XmlReader childReader = reader.ReadSubtree())
((IXmlPersistance2)rule).Read(fileName, childReader);
_rules.Add(rule);
break;
Getting an XmlDocument from an XmlReader
One of the issues I did have was classes which extended the load behaviour of an existing class. For example, one abstract
class has a number of base properties, which I easily converted to use XmlReader
. However, this class is inherited by other classes and these load additional properties. Using the loop
method outlined above, it wasn't possible for these child classes to read their data as the reader had already been fully read. I didn't want to have these derived classes have to do the loading of base properties, and I didn't want to implement any half thought out idea. So, instead these classes continue to use the original loading of the XmlDocument
. So, given a source of a XmlReader
, how do you get an XmlDocument
?
Turns out this is also very simple - the Load
method of the XmlDocument
can accept a reader. The only disadvantage is the constructor of the XmlDocument
doesn't support this, which means you have to explicitly declare a document, load it, then pass it on, demonstrated below.
void IXmlPersistance2.Read(string fileName, XmlReader reader)
{
XmlDocument document;
document = new XmlDocument();
document.Load(reader);
((IXmlPersistance)this).Read(fileName, document.DocumentElement);
}
Fortunately, these classes aren't used frequently and so they shouldn't adversely affect the performance tuning I'm trying to do.
I could have used the GetAttribute
method I discussed earlier as this doesn't move the reader, but firstly I didn't discover that method until after I wrote this section of the article and I thought it had enough value to remain, and secondly I don't think there is an equivalent for elements.
The Final Verdict
Using the XmlReader
is certainly long winded compared to the original code. The core of the original code is around 100 lines. The core of the new code is more than triple this. I'll probably replace all the "move to next attribute" loops with direct calls to GetAttribute
which will cut down the amount of code a fair bit. I may also try to do a generic approach using reflection, although this will then have its own performance drawback.
However, the XML load performance increase was certainly worth the extra code - the average went from 25seconds down to 12seconds. This is still quite slow and I certainly want to improve it further, but at less than half the original load time, I'm pleased with the result.
You also need to be careful when writing the document. In Cyotek crawler projects, as we are using XPath to query an entire document, we can load values no matter where they are located. When using a XmlReader
, the values are read in the same order as they were written - so if you have saved a critical piece of information near the end of the document, but you require it when loading information at the start, you're going to run into problems.