Introduction
There are many
methods available out on the web for performing large file ingestion and
translating the resulting data sets into XML data which quickly maps to Android
Object Models (AOM) used on mobile devices, without further post-processing
being ultimately necessary. In other words, the resulting datasets could be so
large, and cumbersome to work with, that it would be completely futile and time
consuming attempting to modify the resulting XML by alternative methods for
further refining the model! This implies that the parser must have the smarts
to create resulting XML the first time around that immediately fits the
standard form of the underlying Android objects. It also further implies that
if the object dependency upon such ingested XML changes, that the XML structure
of these objects also might have to change over time! With these small set of
objectives in mind, I will show you in this tutorial how to implement a simple,
yet efficient parsing strategy that will achieve these objectives in a relatively
quick manner, one that does not involve the use of sophisticated frameworks
such as XML DOM, SAX or other known parsers. The reason why I chose not to use
these frameworks at all is mainly for sake of simplicity, availability and
portability. I just wanted to create a simple model, one which can be easily
transferred to other languages and platforms. Open source is prevalent
throughout the industry due to cost savings and wide availability and I wanted
a strategy that others and I can easily modify without much dependency upon
external libraries. With that said, lets dive into the nuts and bolts of this
technique!
Background
Figure 1. Android e-book reader on my Android phone featuring the XML parser discussed in this article
I have developed a simple e-book publishing framework on the Android Market. My objective was to create a framework that was lightweight, could handle large documents and also be able to load large XML files relativelly quickly and effortlessly! The assumption was made that the documents themselves would have to change over time. But why? Quite simple actually...Our mindset ultimately changes as our end user's expectations change, certain technologies advance and the feature set of our applications expand over time! This implies that the XML model would also have to change over time! To test out these assumptions I set out to experiment with my first large size e-book on Android. I wanted something that was copyright free and generated very large XML data for my Android classes to ingest! I did some googleing on the net and found the KJV copyright free...This was just what I needed to set out and build a parsing framework for text meta-data that could work for a variety of other documents to be converted to e-books in the future.
Figure 2. Android e-book reader on my Android NOOK tablet
In a nutshell, this is what I started with, a text file which resulted into the following derived characteristics.
For the INDEX.XML file the following element structure will be created by the parser:
1. An XML root element called </KJV>
2. An XML element called </BOOK>
3. An XML subelement called </BOOKID>
4. An XML subelement called </BOOKNAME>
Figure 3. Loading the copyright free KJV document in notepad.exe
From those
initial XML elements I have decided to split all the BOOK sections into their
own indexable sections which formally imply that all the BOOK tags will index
into an XML file carrying the BOOKID. This in a sense is a link back into the
INDEX file. That was a good decision as object loading in Android was much
faster when loading smaller individual sections rather than one illusively
large text blob! This is the main point of this exercise and my offered TIP of
the day! When creating large object models from XML metadata, one should break
down large XML sections into smaller ones or the end user will have a field day
waiting for your parse objects to load! Not to mention all the memory issues
you might be facing in your parser objects, timeouts and so on...Sometimes, we all learn from making mistakes and
end up refining those cumbersome processes further which might hinder measurable
progress! With that fact in mind, one might consider segmenting the resulting XML into smaller sections, which helps
greatly with the object loading speed and thus you can thank me later for
telling you all this! If you ever experience tremendous waiting time that is unacceptable
to the end user, go back and refine your XML model. Break it down into fast
loading chunks and proceed from there instead! This has the repercussion of
increasing your class granularity in your final object model, but that is a small
price to pay for the speed you will be gaining later on and everything just works as it should!
Here is the resulting INDEX.XML from the C# .NET parser:
="1.0"="utf-8"
<KJV>
<BOOK>
<BOOKID>BOOK1</BOOKID>
<BOOKNAME>Genesis</BOOKNAME>
</BOOK>
<BOOK>
<BOOKID>BOOK2</BOOKID>
<BOOKNAME>Exodus</BOOKNAME>
</BOOK>
<BOOK>
<BOOKID>BOOK3</BOOKID>
<BOOKNAME>Leviticus</BOOKNAME>
</BOOK>
...
</KJV>
For the BOOK#.XML files the following structure had been agreed upon:
1. - An XML root element called </BOOK#>
2. - An XML subelement called </BOOKNAME>
3. - An XML subelement called </SECTIONS>
4. - An XML subelement called </SECTION>
5. - An XML subelement called </VERSES>
6. - An XML subelement called </VERSE>
7. - An XML subelement called </TEXT>
Here is the BOOK1.XML which is derived from the <BOOKID> element referenced in INDEX.XML:
="1.0"="utf-8"
<BOOK1>
<BOOKNAME>Genesis</BOOKNAME>
<SECTIONS>
<SECTION id="001">
<VERSES>
<VERSE id="001">
<TEXT>In the beginning God created the heaven and the earth.</TEXT>
</VERSE>
<VERSE id="002">
<TEXT>And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.</TEXT>
</VERSE>
...
</VERSES>
</SECTION>
</SECTIONS>
</BOOK1>
So what is
happening here? Well, to recapitulate, we are creating an hierarchy of the
Android object model! Let us imagine that we have a BOOK object and as
we flip through these virtual pages of this imaginary BOOK object, we have a SECTIONS
collection and in each of the SECTIONS collection of the book we have a
particular SECTION object which holds a VERSES collection consisting of VERSE
objects which themselves consist of TEXT fields and so on. Obviously if we are
thinking in terms of classes, we have properties and methods which have been
associated with those XML sections mapped into their respective member
variables. There is an easier way to achieve this hierarchy abstraction in a
relatively painless fashion, so refer to the section where I mention the use of
the SIMPLE framework further down in this article.
Each layer of
the BOOK object is thus represented as an SECTIONS object broken down into
collections of SECTION, VERSES, VERSE and TEXT elements harmoniously and
sequentially following each other in perfect order, thus filling in the entire
contents of the BOOK object. Furthermore, it helps to think of the
possibilities that this offers the user! Users can now browse the hierarchy of
the e-book utilizing the SECTION id and VERSE id attributes! This is an
important level of abstraction! Makes it convenient for the user to more effectively
navigate the contents of the BOOK object model...Here is a screenshot of the
implementation of the object model derived from XML meta-data that I'm
discussing above... This is illustrated quite clearly in the Android
application of KJV e-book reader running on my NOOK tablet...
Figure 4. Android e-book reader on my Android tablet showing the Verse Reader
So
intrinsically, we have covered the resulting XML object model in complete
detail! But not enough has been said regarding the platform used to ingest all
this data into a final Android Object Model (AOM). A breakdown is given of the
pluses and minuses encountered while attempting to migrate your XML metadata to
such a model. An attractive option is given to help the developer overcome the
rigors of dealing with the discrepancies of the alternatives! Yes it can be painful
to learn the hard way and that is exactly what I went through...
Some of the
strategies involved in XML object parsing on Android evolved from the use of
DOM and SAX parsing frameworks in JAVA. Personally I have found these
frameworks limiting because they did not allow reading, writing and saving of modified
XML structure uniformly from within the same parser object model. I sometimes
found myself reaching for additional libraries or just creating my own from
scratch, mixing and matching the various technologies.
All this was
about to change when I ran across the following ubiquitous and lightweight
framework called SIMPLE. I will not go into an in-depth discussion of SIMPLE,
because you can find awesome tutorials on their website which discusses
anything I might add in great detail! However, I will go on to say that I highly
recommend for you to check out this small framework and give it a spin, as it
allows a fast, low overhead transition without great cost in application size
for your mobile application! Check out SIMPLE for Android at:
http://simple.sourceforge.net/articles.php
Before I get too
carried away about the use of SIMPLE to map XML data to your AOM, I must come back and stay on topic about the purpose of this article,
which is the first step in achieving a successful transition to XML ingestion
frameworks such as SIMPLE...Let’s look at some XML pre-processing C# code and
elaborate further on how to do this!
Using the code
Initially, I've decided to write a lightweight pre-parser so I rolled my own in C# .NET! I have Visual Studio 10 installed and decided, why not! It's simple enough to do, so lets start by showing the simple class diagram, member variables, methods and so on...So here we go diving into some details on how this was first conceived!
Figure 4. Fields of kjvParser
We start out with the using section and as promised, there is just nothing striking or special jumping at us, other than standard .NET objects being used and some System.IO assemblies for file manipulations.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
In the following section we declare the class that will perform the parsing. Since XML is tag based, I have described the structure of the documents as const string based tags. The most notable section of this code is the enum ELEMENT, which keeps a state machine of the navigation placeholder within the document hierarchy!
Figure 4. The ELEMENT Enum of the state machine
It basically allows the parser to know where it is at within the document. The accepted values are IS_BOOK, IS_VERSE or IS_SOMETHING_ELSE if it happens to land on a section that is identified as being neither, in which case it is assumed that the line reader is within a verse section. The IS_IGNORED value is not really used and would indicate something is really wrong with our parsing logic, therefore we need to ignore it all together! It is also worthy of notice that IS_EOF state is not even used in this version of the code! That is because EOF detection is not performed in the parser state machine logic. In this case, it could be removed completelly or used in other ways not shown in this implementation of the parser logic!
namespace kjv_parser
{
class kjv_parser
{
const string ENCODING = @"<?xml version=""1.0"" encoding=""utf-8""?>";
const string ROOT = @"<KJV>";
const string END_ROOT = @"</KJV>";
const string BOOKNAME = @"<BOOKNAME>";
const string SECTIONS = @"<SECTIONS>";
const string END_SECTIONS = @"</SECTIONS>";
const string SECTION = @"<SECTION id =""";
const string END_SECTION = @"</SECTION>";
const string VERSES = @"<VERSES>";
const string END_VERSES = @"</VERSES>";
const string BOOK = @"<BOOK";
const string END_BOOK = @"</BOOK";
const string END_BOOKNAME = @"</BOOKNAME>";
const string VERSE = @"<VERSE id =""";
const string END_VERSE = @"</VERSE>";
const string TEXT = @"<TEXT>";
const string END_TEXT = @"</TEXT>";
const string BOOKID = @"<BOOKID>";
const string BOOKID_END = @"</BOOKID>";
enum ELEMENT {IS_EOF = -1, IS_SOMETHING_ELSE, IS_BOOK, IS_VERSE, IS_IGNORED};
static StreamWriter writer;
static StreamWriter indexer;
static int bookCount = 0;
static int verseCount = 0;
static int lineCount = 0;
static bool verseTagOpen = false;
static bool sectionTagOpen = false;
static string oldVerseRef;
static string oldSectionRef = "";
static bool bookStart = false;
Since this is a simple console app without the need of an interface, I just simply read each line from the text file one line at a time and write it to the console for simplicity just to show the progress as it flies by...It could be worthwhile to implement something better here and not show the text at all! In that case, you can comment out the Console.WriteLine. The state machine is shown next! Function IsChapterOrVerse(input) takes each read line of text from the StreamReader object and looks for clues as to whether we are starting on a BOOK marker within the text. If we are, we know for certain that we have landed inside a BOOK section and thus we can parse this information into a <BOOK> XML tag. Following the next piece of logic, the parser runs a numeric check on the first character within the line of text and if it finds a number, we know that is the start of the <VERSE> tag and its id attribute! Finally, the parser gives up and says, its neither a BOOK marker or VERSE marker, therefore, we are probably on a VERSE line anyway, so just continue by appending this line of whitespace stripped text to the <VERSE> tag.
static void Main(string[] args)
{
int i = 0;
using (StreamReader sr = File.OpenText(@"C:\Bible\kjv12\KJV12.TXT"))
{
string input = null;
OpenIndexFile(@"C:\Bible\kjv12\index.xml");
while ((input = sr.ReadLine()) != null)
{
lineCount++;
Console.WriteLine(input);
switch (IsChapterOrVerse(input))
{
case ELEMENT.IS_BOOK:
bookStart = !bookStart;
if (!bookStart)
{
if (verseTagOpen)
{
verseTagOpen = false;
CloseVerse();
}
CloseBook();
CloseOutputFile();
bookStart = true;
}
if (bookStart)
{
OpenOutputFile(@"C:\Bible\kjv12\book" + ++i + ".xml");
bookCount++;
WriteBook(input);
}
break;
case ELEMENT.IS_VERSE:
if (verseCount > 0 && verseTagOpen)
{
CloseVerse();
verseTagOpen = false;
}
WriteVerseTag(input);
WriteVerse(input.Substring(8, input.Length - 8));
verseTagOpen = true;
verseCount++;
break;
case ELEMENT.IS_SOMETHING_ELSE:
WriteVerse(input);
break;
default:
case ELEMENT.IS_IGNORED:
break;
}
}
if (lineCount > 0)
{
CloseVerse();
CloseBook();
}
CloseOutputFile();
CloseIndexFile();
}
}
Here is the heart of the parser and thus the main decision maker which indicates which section we happen to fall on! More complicated text document parsers, could probably have more evolved logic here and this is how more sophisticated state machines would be created in a similar fashion looking for textual clues to make a decision and advance the logic to the next state!
static ELEMENT IsChapterOrVerse(string inputLine)
{
ELEMENT el = ELEMENT.IS_IGNORED;
if (inputLine.Length > 0)
{
if (inputLine.Substring(0,4).ToUpper().Contains("BOOK"))
el = ELEMENT.IS_BOOK;
else
{
int number1;
if (inputLine.Length <= 1)
el = ELEMENT.IS_IGNORED;
else if (int.TryParse(inputLine.Substring(0, 1), out number1))
{
el = ELEMENT.IS_VERSE;
}
else
el = ELEMENT.IS_SOMETHING_ELSE;
}
}
return el;
}
The worker bees of the parser class are the following static methods shown below.
These methods are quite self-explanatory, so I'll leave it up to the reader to decipher as a mental excercise, but suffice to say, the code simply caries out the task of opening and closing the XML tags and furthermore finalizes the saving of the INDEX.XML and BOOK#.XML files to disk in a particular directory! This code is not perfect by any means, but it goes to show what can be done quite quickly and with relative flexibility in mind...An alternative method would obviously depend on the usage of the XML DOM framework to achieve all this, but I find it to be a kludge and unnecessary for such a simple task.
static void OpenOutputFile(string filePath)
{
writer = File.CreateText(filePath);
writer.Write(ENCODING);
}
static void OpenIndexFile(string filePath)
{
indexer = File.CreateText(filePath);
indexer.Write(ENCODING);
indexer.Write(ROOT);
}
static void CloseIndexFile()
{
indexer.Write(END_ROOT);
indexer.Close();
}
static void WriteBook(string chapterLine)
{
writer.Write(BOOK + bookCount + ">");
indexer.Write(BOOK + ">");
indexer.Write(BOOKID);
indexer.Write("BOOK" + bookCount);
indexer.Write(BOOKID_END);
writer.Write(BOOKNAME);
indexer.Write(BOOKNAME);
int modifier = 0;
if (bookCount > 9) modifier = 1;
int start = bookCount.ToString().Length + 7 - modifier;
string tempLine = chapterLine.Substring(start, chapterLine.Length - start);
writer.Write(tempLine);
indexer.Write(tempLine);
writer.Write(END_BOOKNAME);
indexer.Write(END_BOOKNAME);
writer.Write(SECTIONS);
}
static void WriteVerse(string verseLine)
{
if (verseTagOpen)
verseLine = " " + verseLine.TrimStart();
writer.Write(verseLine);
}
static void WriteVerseTag(string verseRef)
{
String tempRef = verseRef.Substring(0, 3);
verseRef = verseRef.Substring(4,3);
if ((oldSectionRef == "") || (oldSectionRef != tempRef))
{
if (sectionTagOpen)
{
writer.Write(END_VERSES);
writer.Write(END_SECTION);
sectionTagOpen = false;
}
writer.Write(SECTION + tempRef + "\">");
writer.Write(VERSES);
oldSectionRef = tempRef;
sectionTagOpen = true;
}
writer.Write(VERSE + verseRef + "\">");
writer.Write(TEXT);
oldVerseRef = verseRef;
}
static void CloseBook()
{
if (sectionTagOpen)
{
writer.Write(END_VERSES);
writer.Write(END_SECTION);
oldSectionRef = "";
sectionTagOpen = false;
}
writer.Write(END_SECTIONS);
writer.Write(END_BOOK + bookCount + ">");
indexer.Write(END_BOOK + ">");
}
static void CloseVerse()
{
writer.Write(END_TEXT);
writer.Write(END_VERSE);
oldVerseRef = "";
}
static void CloseOutputFile()
{
writer.Close();
}
Points of Interest
You can read my recent blog on Posterous http://post.ly/84uut regarding the e-book reader and follow me on twitter to keep track of releases and updates on this XML framework and other Android frameworks currently under development! In the next article I will discuss Android XML based objects and meta-data ingestion on this mobile platform. I hope to hear from you and let me know if this helps you in developing your own e-book reader on Android or other mobile platforms!
History
First revision released on 4th of July, 2012!