.NET technique for converting large text files into XML structures intended for Android objects

Mario Ghecea

4.67/5 (4 votes)

5 Jul 2012CPOL12 min read

27.3K

515

This article describes the technique I used to ingest large text files to create an e-book publishing platform on the Android Market

Introduction

There are many methods available out on the web for performing large file ingestion and translating the resulting data sets into XML data which quickly maps to Android Object Models (AOM) used on mobile devices, without further post-processing being ultimately necessary. In other words, the resulting datasets could be so large, and cumbersome to work with, that it would be completely futile and time consuming attempting to modify the resulting XML by alternative methods for further refining the model! This implies that the parser must have the smarts to create resulting XML the first time around that immediately fits the standard form of the underlying Android objects. It also further implies that if the object dependency upon such ingested XML changes, that the XML structure of these objects also might have to change over time! With these small set of objectives in mind, I will show you in this tutorial how to implement a simple, yet efficient parsing strategy that will achieve these objectives in a relatively quick manner, one that does not involve the use of sophisticated frameworks such as XML DOM, SAX or other known parsers. The reason why I chose not to use these frameworks at all is mainly for sake of simplicity, availability and portability. I just wanted to create a simple model, one which can be easily transferred to other languages and platforms. Open source is prevalent throughout the industry due to cost savings and wide availability and I wanted a strategy that others and I can easily modify without much dependency upon external libraries. With that said, lets dive into the nuts and bolts of this technique!

Background

Figure 1. Android e-book reader on my Android phone featuring the XML parser discussed in this article

Click to download this application to your Android phone for FREE!

I have developed a simple e-book publishing framework on the Android Market. My objective was to create a framework that was lightweight, could handle large documents and also be able to load large XML files relativelly quickly and effortlessly! The assumption was made that the documents themselves would have to change over time. But why? Quite simple actually...Our mindset ultimately changes as our end user's expectations change, certain technologies advance and the feature set of our applications expand over time! This implies that the XML model would also have to change over time! To test out these assumptions I set out to experiment with my first large size e-book on Android. I wanted something that was copyright free and generated very large XML data for my Android classes to ingest! I did some googleing on the net and found the KJV copyright free...This was just what I needed to set out and build a parsing framework for text meta-data that could work for a variety of other documents to be converted to e-books in the future.

Figure 2. Android e-book reader on my Android NOOK tablet

Click to download this application to your Android tablet for FREE!

In a nutshell, this is what I started with, a text file which resulted into the following derived characteristics.

For the INDEX.XML file the following element structure will be created by the parser:

1. An XML root element called </KJV>

2. An XML element called </BOOK>

3. An XML subelement called </BOOKID>

4. An XML subelement called </BOOKNAME>

Figure 3. Loading the copyright free KJV document in notepad.exe

From those initial XML elements I have decided to split all the BOOK sections into their own indexable sections which formally imply that all the BOOK tags will index into an XML file carrying the BOOKID. This in a sense is a link back into the INDEX file. That was a good decision as object loading in Android was much faster when loading smaller individual sections rather than one illusively large text blob! This is the main point of this exercise and my offered TIP of the day! When creating large object models from XML metadata, one should break down large XML sections into smaller ones or the end user will have a field day waiting for your parse objects to load! Not to mention all the memory issues you might be facing in your parser objects, timeouts and so on...Sometimes, we all learn from making mistakes and end up refining those cumbersome processes further which might hinder measurable progress! With that fact in mind, one might consider segmenting the resulting XML into smaller sections, which helps greatly with the object loading speed and thus you can thank me later for telling you all this! If you ever experience tremendous waiting time that is unacceptable to the end user, go back and refine your XML model. Break it down into fast loading chunks and proceed from there instead! This has the repercussion of increasing your class granularity in your final object model, but that is a small price to pay for the speed you will be gaining later on and everything just works as it should!

Here is the resulting INDEX.XML from the C# .NET parser:

XML

<?xml version="1.0" encoding="utf-8"?>
<KJV>
  <BOOK>
    <BOOKID>BOOK1</BOOKID>
    <BOOKNAME>Genesis</BOOKNAME>
  </BOOK>
  <BOOK>
    <BOOKID>BOOK2</BOOKID>
    <BOOKNAME>Exodus</BOOKNAME>
  </BOOK>
  <BOOK>
    <BOOKID>BOOK3</BOOKID>
    <BOOKNAME>Leviticus</BOOKNAME>
  </BOOK>
...
 
</KJV>

For the BOOK#.XML files the following structure had been agreed upon:

1. - An XML root element called </BOOK#>

2. - An XML subelement called </BOOKNAME>

3. - An XML subelement called </SECTIONS>

4. - An XML subelement called </SECTION>

5. - An XML subelement called </VERSES>

6. - An XML subelement called </VERSE>

7. - An XML subelement called </TEXT>

Here is the BOOK1.XML which is derived from the <BOOKID> element referenced in INDEX.XML:

XML

<?xml version="1.0" encoding="utf-8"?>
<BOOK1>
  <BOOKNAME>Genesis</BOOKNAME>
  <SECTIONS>
    <SECTION id="001">
      <VERSES>
        <VERSE id="001">
          <TEXT>In the beginning God created the heaven and the earth.</TEXT>
        </VERSE>
        <VERSE id="002">
          <TEXT>And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.</TEXT>
        </VERSE>
...
 
       </VERSES>
     </SECTION>
  </SECTIONS>
</BOOK1>

So what is happening here? Well, to recapitulate, we are creating an hierarchy of the Android object model! Let us imagine that we have a BOOK object and as we flip through these virtual pages of this imaginary BOOK object, we have a SECTIONS collection and in each of the SECTIONS collection of the book we have a particular SECTION object which holds a VERSES collection consisting of VERSE objects which themselves consist of TEXT fields and so on. Obviously if we are thinking in terms of classes, we have properties and methods which have been associated with those XML sections mapped into their respective member variables. There is an easier way to achieve this hierarchy abstraction in a relatively painless fashion, so refer to the section where I mention the use of the SIMPLE framework further down in this article.

Each layer of the BOOK object is thus represented as an SECTIONS object broken down into collections of SECTION, VERSES, VERSE and TEXT elements harmoniously and sequentially following each other in perfect order, thus filling in the entire contents of the BOOK object. Furthermore, it helps to think of the possibilities that this offers the user! Users can now browse the hierarchy of the e-book utilizing the SECTION id and VERSE id attributes! This is an important level of abstraction! Makes it convenient for the user to more effectively navigate the contents of the BOOK object model...Here is a screenshot of the implementation of the object model derived from XML meta-data that I'm discussing above... This is illustrated quite clearly in the Android application of KJV e-book reader running on my NOOK tablet...

Figure 4. Android e-book reader on my Android tablet showing the Verse Reader

So intrinsically, we have covered the resulting XML object model in complete detail! But not enough has been said regarding the platform used to ingest all this data into a final Android Object Model (AOM). A breakdown is given of the pluses and minuses encountered while attempting to migrate your XML metadata to such a model. An attractive option is given to help the developer overcome the rigors of dealing with the discrepancies of the alternatives! Yes it can be painful to learn the hard way and that is exactly what I went through...

Some of the strategies involved in XML object parsing on Android evolved from the use of DOM and SAX parsing frameworks in JAVA. Personally I have found these frameworks limiting because they did not allow reading, writing and saving of modified XML structure uniformly from within the same parser object model. I sometimes found myself reaching for additional libraries or just creating my own from scratch, mixing and matching the various technologies.

All this was about to change when I ran across the following ubiquitous and lightweight framework called SIMPLE. I will not go into an in-depth discussion of SIMPLE, because you can find awesome tutorials on their website which discusses anything I might add in great detail! However, I will go on to say that I highly recommend for you to check out this small framework and give it a spin, as it allows a fast, low overhead transition without great cost in application size for your mobile application! Check out SIMPLE for Android at:

http://simple.sourceforge.net/articles.php

Before I get too carried away about the use of SIMPLE to map XML data to your AOM, I must come back and stay on topic about the purpose of this article, which is the first step in achieving a successful transition to XML ingestion frameworks such as SIMPLE...Let’s look at some XML pre-processing C# code and elaborate further on how to do this!

Using the code

Initially, I've decided to write a lightweight pre-parser so I rolled my own in C# .NET! I have Visual Studio 10 installed and decided, why not! It's simple enough to do, so lets start by showing the simple class diagram, member variables, methods and so on...So here we go diving into some details on how this was first conceived!

Figure 4. Fields of kjvParser

We start out with the using section and as promised, there is just nothing striking or special jumping at us, other than standard .NET objects being used and some System.IO assemblies for file manipulations.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
 
// kjv_parser
// created by Mario Ghecea on 6/05/2012
// a simple XML bible writing utility
// that takes a text bible and converts it to xml elements
// by parsing each line as text...

// format is as follows:
// chapters starting with the keyword "Book"
// verses starting with numerics i.e. "001:001"
// everything else is considered a verse
// or ignored completely in case of whitespace.

In the following section we declare the class that will perform the parsing. Since XML is tag based, I have described the structure of the documents as const string based tags. The most notable section of this code is the enum ELEMENT, which keeps a state machine of the navigation placeholder within the document hierarchy!

Figure 4. The ELEMENT Enum of the state machine

It basically allows the parser to know where it is at within the document. The accepted values are IS_BOOK, IS_VERSE or IS_SOMETHING_ELSE if it happens to land on a section that is identified as being neither, in which case it is assumed that the line reader is within a verse section. The IS_IGNORED value is not really used and would indicate something is really wrong with our parsing logic, therefore we need to ignore it all together! It is also worthy of notice that IS_EOF state is not even used in this version of the code! That is because EOF detection is not performed in the parser state machine logic. In this case, it could be removed completelly or used in other ways not shown in this implementation of the parser logic!

namespace kjv_parser
{
    class kjv_parser
    {
        // XML Tags
        const string ENCODING = @"<?xml version=""1.0"" encoding=""utf-8""?>";
        const string ROOT = @"<KJV>";
        const string END_ROOT = @"</KJV>";
        const string BOOKNAME = @"<BOOKNAME>";
        const string SECTIONS = @"<SECTIONS>";
        const string END_SECTIONS = @"</SECTIONS>";
        const string SECTION = @"<SECTION id =""";
        const string END_SECTION = @"</SECTION>";
        const string VERSES = @"<VERSES>";
        const string END_VERSES = @"</VERSES>";
        const string BOOK = @"<BOOK";
        const string END_BOOK = @"</BOOK";
        const string END_BOOKNAME = @"</BOOKNAME>";
        const string VERSE = @"<VERSE id =""";
        const string END_VERSE = @"</VERSE>";
        const string TEXT = @"<TEXT>";
        const string END_TEXT = @"</TEXT>";
        const string BOOKID = @"<BOOKID>";
        const string BOOKID_END = @"</BOOKID>";
 
        enum ELEMENT {IS_EOF = -1, IS_SOMETHING_ELSE, IS_BOOK, IS_VERSE, IS_IGNORED};         
 
        static StreamWriter writer;
        static StreamWriter indexer;
        static int bookCount = 0;
        static int verseCount = 0;
        static int lineCount = 0;
        static bool verseTagOpen = false;
        static bool sectionTagOpen = false;
        static string oldVerseRef;
        static string oldSectionRef = "";
        static bool bookStart = false;

Since this is a simple console app without the need of an interface, I just simply read each line from the text file one line at a time and write it to the console for simplicity just to show the progress as it flies by...It could be worthwhile to implement something better here and not show the text at all! In that case, you can comment out the Console.WriteLine. The state machine is shown next! Function IsChapterOrVerse(input) takes each read line of text from the StreamReader object and looks for clues as to whether we are starting on a BOOK marker within the text. If we are, we know for certain that we have landed inside a BOOK section and thus we can parse this information into a <BOOK> XML tag. Following the next piece of logic, the parser runs a numeric check on the first character within the line of text and if it finds a number, we know that is the start of the <VERSE> tag and its id attribute! Finally, the parser gives up and says, its neither a BOOK marker or VERSE marker, therefore, we are probably on a VERSE line anyway, so just continue by appending this line of whitespace stripped text to the <VERSE> tag.

static void Main(string[] args)
   {
       int i = 0;
       using (StreamReader sr = File.OpenText(@"C:\Bible\kjv12\KJV12.TXT"))
       {

           string input = null;
           OpenIndexFile(@"C:\Bible\kjv12\index.xml");
           while ((input = sr.ReadLine()) != null)
           {
               lineCount++;
               Console.WriteLine(input);
               switch (IsChapterOrVerse(input))
               {
                   case ELEMENT.IS_BOOK:
                       bookStart = !bookStart;

                       if (!bookStart)
                       {
                           if (verseTagOpen)
                           {
                               verseTagOpen = false;
                               CloseVerse();
                           }
                           CloseBook();

                           CloseOutputFile();
                           bookStart = true;
                       }

                       if (bookStart)
                       {
                           OpenOutputFile(@"C:\Bible\kjv12\book" + ++i + ".xml");
                           bookCount++;
                           WriteBook(input);
                       }

                       break;
                   case ELEMENT.IS_VERSE:
                       if (verseCount > 0 && verseTagOpen)
                       {
                           CloseVerse();
                           verseTagOpen = false;
                       }
                       WriteVerseTag(input);
                       WriteVerse(input.Substring(8, input.Length - 8));
                       verseTagOpen = true;
                       verseCount++;
                       break;
                   case ELEMENT.IS_SOMETHING_ELSE: // This is still part of the verse even if blank line
                       WriteVerse(input);
                       break;
                   default:
                   case ELEMENT.IS_IGNORED:
                       break;
               }
           }
           // As long as we parse a valid document, close the end elements
           if (lineCount > 0)
           {
               CloseVerse();
               CloseBook();
           }

           CloseOutputFile();
           CloseIndexFile();
       }

       //Console.ReadLine();
   }

Here is the heart of the parser and thus the main decision maker which indicates which section we happen to fall on! More complicated text document parsers, could probably have more evolved logic here and this is how more sophisticated state machines would be created in a similar fashion looking for textual clues to make a decision and advance the logic to the next state!

 static ELEMENT IsChapterOrVerse(string inputLine)
{
    ELEMENT el = ELEMENT.IS_IGNORED;
    if (inputLine.Length > 0)
    {
        if (inputLine.Substring(0,4).ToUpper().Contains("BOOK"))
            el = ELEMENT.IS_BOOK;
        else
        {
            int number1;
            if (inputLine.Length <= 1)
                el = ELEMENT.IS_IGNORED;
            else if (int.TryParse(inputLine.Substring(0, 1), out number1))
            {
                el = ELEMENT.IS_VERSE;
            }
            else
                el = ELEMENT.IS_SOMETHING_ELSE;
        }
    }
    return el;
}

The worker bees of the parser class are the following static methods shown below.

These methods are quite self-explanatory, so I'll leave it up to the reader to decipher as a mental excercise, but suffice to say, the code simply caries out the task of opening and closing the XML tags and furthermore finalizes the saving of the INDEX.XML and BOOK#.XML files to disk in a particular directory! This code is not perfect by any means, but it goes to show what can be done quite quickly and with relative flexibility in mind...An alternative method would obviously depend on the usage of the XML DOM framework to achieve all this, but I find it to be a kludge and unnecessary for such a simple task.

static void OpenOutputFile(string filePath)
  {
      writer = File.CreateText(filePath);
      writer.Write(ENCODING);
      //writer.Write(ROOT);
  }

  static void OpenIndexFile(string filePath)
  {
      indexer = File.CreateText(filePath);
      indexer.Write(ENCODING);
      indexer.Write(ROOT);
  }

  static void CloseIndexFile()
  {
      indexer.Write(END_ROOT);
      indexer.Close();
  }

  static void WriteBook(string chapterLine)
  {
      writer.Write(BOOK + bookCount + ">");
      indexer.Write(BOOK + ">");
      indexer.Write(BOOKID);
      indexer.Write("BOOK" + bookCount);
      indexer.Write(BOOKID_END);
      writer.Write(BOOKNAME);
      indexer.Write(BOOKNAME);
      int modifier = 0;
      if (bookCount > 9) modifier = 1;
      int start = bookCount.ToString().Length + 7 - modifier;
      string tempLine = chapterLine.Substring(start, chapterLine.Length - start);
      writer.Write(tempLine);
      indexer.Write(tempLine);
      writer.Write(END_BOOKNAME);
      indexer.Write(END_BOOKNAME);
      writer.Write(SECTIONS);
  }

  static void WriteVerse(string verseLine)
  {
      if (verseTagOpen)
          verseLine = " " + verseLine.TrimStart();

      writer.Write(verseLine);
  }

  static void WriteVerseTag(string verseRef)
  {
      String tempRef = verseRef.Substring(0, 3);
      verseRef = verseRef.Substring(4,3);


      if ((oldSectionRef == "") || (oldSectionRef != tempRef))
      {
          if (sectionTagOpen)
          {
              writer.Write(END_VERSES);
              writer.Write(END_SECTION);
              sectionTagOpen = false;
          }

          writer.Write(SECTION + tempRef + "\">");
          writer.Write(VERSES);
          oldSectionRef = tempRef;
          sectionTagOpen = true;
      }
      writer.Write(VERSE + verseRef + "\">");
      writer.Write(TEXT);
      oldVerseRef = verseRef;
  }


  static void CloseBook()
  {
      if (sectionTagOpen)
      {
          writer.Write(END_VERSES);
          writer.Write(END_SECTION);
          oldSectionRef = "";
          sectionTagOpen = false;
      }

      writer.Write(END_SECTIONS);
      writer.Write(END_BOOK + bookCount + ">");
      indexer.Write(END_BOOK + ">");
  }

  static void CloseVerse()
  {
      writer.Write(END_TEXT);
      writer.Write(END_VERSE);
      oldVerseRef = "";
  }

  static void CloseOutputFile()
  {
      writer.Close();
  }

Points of Interest

You can read my recent blog on Posterous http://post.ly/84uut regarding the e-book reader and follow me on twitter to keep track of releases and updates on this XML framework and other Android frameworks currently under development! In the next article I will discuss Android XML based objects and meta-data ingestion on this mobile platform. I hope to hear from you and let me know if this helps you in developing your own e-book reader on Android or other mobile platforms!

History

First revision released on 4th of July, 2012!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)