A Simple XML Parser from Scratch

rj45

4.38/5 (9 votes)

8 Jul 2011CPOL3 min read

44.7K

468

For performance reasons, I decided to write my own XML parser.

Download source - 2.93 KB

Introduction

This is an implementation of a subset of the XML classes provided in .NET. This may be useful since it does run faster than the .NET classes, although it has less features and has some bugs. If you know exactly what data you need from a document, this is a fast approach to pull it out. It is also useful if you don't have access to these .NET libraries for some reason such as if you are using silverlight or embedded devices, with limited libraries, XBox, windows phone 7 etc etc.

If you've never used the .NET classes XmlDocument and XPathNavigator, you will need to know the select syntax. If you know, then feel free to skip this section. If not, first read this page.

Essentially, Select just lets you navigate through the trees of XML nodes. However, there are also attributes in XML such as ex.

XML

<shape><triangle sid="1/></shape> .

The sid is an attribute and to select this, the path would be:

ex1)

SQL

Select("shape/triangle") 
Evaluate( "@sid" )

returns 1

ex2) (no attribute)

XML

<car><wheel>spinning</wheel></car> :
Select = "car"
evaluate "wheel" // no @ needed

returns "spinning"

OR the fastest way, but with no scope protection:

Just simply evaluate("@sid") with no select at all will search out the first occurrence @sid and MoveNext() does what you'd think. This example is not the same as .NET and obviously increases performance massively but you can only do it when you know the document.

Scope

When I say Scope detection, I really mean limiting the evaluation to the node scope. The .NET classes have protection and will return String.Empty in cases where the scope is wrong. This class disables it so you can optimize your search without specifying any scope. In reality, it doesn't really disable it, inherent in the parsing algorithm is a key difference. This algorithm uses searches by simple indexof(). It does not compile and cache anything except two integers begin and end index. This approach uses almost no memory and is in every case faster but in cases with ambiguous scope it CAN return the wrong data. Therefore, you need to know the scope is unique in your selects.

Usage

Loading up some values from a document is similar to .NET:

public static void GetSingleValueFXmlDoc()
{
    string xml = "<car value=\"1\"><wheel sid=\"1\" bolts=\"4\">
	ddd</wheel><wheel sid=\"2\" bolts=\"4\">dddd</wheel></car>";

    DateTime dt = DateTime.Now;

    FXMLDocument doc = new FXMLDocument();
    doc.LoadXml(xml);
    doc.Select( "car/wheel" );
    string sid = doc.GetValueString("@sid"); // or doc.Evaluate(
    double df = doc.GetValueDouble("@bolts");
    
    string diff = DateTime.Now.Subtract(dt).ToString();
    Console.WriteLine("single values FXML Time\t\t={0}", diff);
}

It's basically the same as .NET but no navigator. Here, we load the document, select the path in syntax the same as xpath navigator and call get value. There is also an evaluate method but getvalue just returns the type you want with a simple wrapper.

Reading all attributes in a selected node

for (int i = 0; i < count; ++i)
{
    FXMLDocument doc = new FXMLDocument();
    doc.LoadXml(xml);
    doc.Select("car/wheel");

    foreach (FXMLDocument x in doc)
    {    
        string sid = x.GetValueString("@sid");
        double df = x.GetValueDouble("@bolts");
    }
}

How It Works

   for (int i = 0; i < nodes.Length; i++)
225      {
226          string nodeStart = String.Empty;
227          string nodeEnd = String.Empty;
228  
229          if ((nodes.Length > i + 1) && nodes[i + 1].IndexOf("@") == 0)
230          {
231              nodeStart = string.Format("{0}=\"", nodes[i + 1].Replace("@", ""));
232              nodeEnd = "\"";
233              isAttribute = true;
234          }
235          else if ((nodes.Length == 1) && nodes[i].IndexOf("@") == 0)
236          {
237              nodeStart = string.Format("{0}=\"", nodes[i].Replace("@", ""));
238              nodeEnd = "\"";
239              isAttribute = true;
240          }
241          else
242          {
243              // not last node
244              nodeStart = string.Format("{0}{1}", "<", nodes[i]);
245              nodeEnd = string.Format("{0}{1}{2}", "</", nodes[i], ">");
246              isAttribute = false;
247          }
248  
249          GetBetweenExcludeTokens(xmlDocument, nodeStart, nodeEnd, ref begin, out end);
250  
251          if (isAttribute)
252          {
253              return; //xmlbuffer;
254          }
255          int indexNextNode = (xmlDocument.IndexOf(">", begin)) + 1; // +1 to include >
256          int indexofEquals = xmlDocument.IndexOf("=", begin);
257          if (indexofEquals != -1 && (indexofEquals < indexNextNode))
258          {
259              if (begin < indexNextNode)
260              {
261                  begin += indexNextNode - begin;
262              }
263              else
264              {
265                  begin += begin - indexNextNode;
266              }
267          }
268          else
269          {
270              if (xmlDocument[begin] == ('>'))
271              {
272                  ++begin;
273              }
274          }
275      }
276  }

This is the main loop of the entire process. Basically, it just loops through all the nodes of the select statement and gets the text between them and that's it. Of course, there is special handling for attributes which really makes what would have been an elegant system a real nightmare to handle. I can bet you that there were some arguments in the office at Microsoft the day some smart guy wanted attributes. In any case, GetBetweenExcludeTokens is doing the meat of the entire operation, everything else is just syntax support.

Points of Interest

This runs only around 30% faster than the .NET classes which for most people are not going to be worth it since it is not fully compatible. However in example GetSingleValueFXmlDoc, there is a secret to speed things up. Because we know the document contains only wheels->bolts ( bolts is unique), why bother selecting car/wheel, with this you can just call doc.Evaluate( @bolts ) without selecting anything which will boost performance by at least 100% over .NET classes. This was actually all it was intended for but, as I went I hit a lot of gotchas and this minimizes them slightly when you turn ValidateSyntax on.

My first version was just a static method which was probably better and I wish I saved it, however you can't MoveNext very nicely in a static implementation (although it's nice not to have to new it). You could modify getSubNodes to be static quite easily.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)