Introduction
For my job at Trezorix, we're required to quite often open and read large (100 MB+) XML files. Usually we open XML files in Notepad, Internet Explorer (IE), or some kind of text editor. However, when you want to open a large XML file, it takes these systems hours to open the file, if these systems do not crash while opening the file. Since we work with huge XML files and we want to be able to view the content of those files reasonably quickly, we decided to take a peek on the web for existing software. We could not find a system that covers our needs and thus decided to develop a tool ourselves.
Approach
The main goal of the tool is to read large XML files and quickly present it to the screen. Most tools reading XML (except for Notepad) first read the entire file and then use an interpreter to put the XML document's structure together. We found that's the weakness of these tools because they need to read the entire XML file before they can display anything. We decided we wanted to run through the document and display data as quickly as possible, and thus developed an on-the-fly interpreter. This interpreter may not be as seamless as you're used to, but the gained performance (in my opinion) weighs much heavier.
Presentation
Although tools like IE are not really capable of opening large XML files, they do have one large pro, the presentation. Because the XML files are fully interpreted, the opening and closing tags in the XML files are matched and IE will allow you to expand and collapse elements, which makes reading the XML data easier and prettier. Second is highlighting the XML content so the user is able to quickly identify elements, attributes, and values. Because of performance reasons, we decided to drop the ability to expand and collapse elements. For highlighting the XML, we decided to make use of RTF.
The code
The code is basic, simple, and to the point. We developed two classes, one for reading and interpreting the XML, and one containing the ability to search through the read XML data. A third class make these two classes come together. Both the reading and searching methods are implemented asynchronously. For reading the XML, a simple while
loop does the trick.
using (FileStream streamSource = new FileStream(m_sFilename,
FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
using (XmlReader xmlReader = XmlReader.Create(streamSource))
{
StringBuilder sbMarkedUp = new StringBuilder();
xmlReader.MoveToElement();
while (xmlReader.Read())
{
}
}
}
The interpreter
The interpreter is very simple. It checks the NodeType
and handles the XML accordingly. If an element was found, it will write the element tag to a StringBuilder
object. Each line of XML will be written to a generic list of strings. The interpreter decides when to write a line of XML to the list. After the line is added to the generic list, the StringBuilder
is cleared and the process repeats itself until the while
loop is finished.
Reading portions of XML
The reading class exposes a function called ReadFragment
. This function accepts a parameter (Offset
) allowing the user to decide where to start the reading. The ReadFragment
adds a header line with RTF definitions. Then it starts adding the lines of XML from the generic string list. The property VisibleLines
allows the user to define the amount of lines returned by the ReadFragment
function.
Events
The reader class exposes four events: StartParsing
, EndParsing
, ErrorOccured
, and ReadyForPresentation
which can be used in the GUI. Start
- and EndParsing
are used to indicate that the process reading the XML file was started or ended. The ErrorOccured
event will obviously be raised when reading a file failed for whatever reason. The ReadyForPresentation
event is raised when a certain amount of lines is added to the generic list. Handling this event allows you to immediately display interpreted XML to the user.
Searching
The search function is implemented to be able to find phrases within the XML document. It loops through each line in the generic list of strings and looks for the given phrase in each line. When a match is found, an event FoundItem
will be raised. The matching word and line number will be returned in the event arguments. The search class will also maintain a list of found items also containing the matching words and line numbers. If the search process completes, a SearchComplete
event will be raised.
for (int iCount = 0; iCount < iLines; iCount++)
{
string stringToSearch = m_lstLinesToSearch[iCount];
int foundIndex = stringToSearch.IndexOf(m_sSearchString,
StringComparison.OrdinalIgnoreCase);
if (foundIndex >= 0)
AddSearchResult(foundPhrase, iCount + 1);
}
Future plans
We plan to further develop the software so it supports a Find & Replace method and allows to save changes made by the user in the XML files. We also plan to add the ability to collapse and expand elements.
Resources
The demo project uses the DockPanel suite (http://sourceforge.net/projects/dockpanelsuite/) to be able to dock windows.