Table of Contents
Introduction
Word 2007 documents are Office Open XML Documents, a combination of XML architecture and ZIP compression used to store an XML and non-XML files together within a single ZIP archive. These documents usually have DOCX extension, but there are exceptions for macro enabled documents, templates etc.
This article will show how you can read and view a DOCX file in WPF with the use of only .NET Framework 3.0 (without using any 3rd party code).
A DOCX file is actually a zipped group of files and folders, called a package. Package consists of package parts (files that contain any type of data like text, images, binary, etc.) and relationships files. Package parts have a unique URI name and relationships XML files contain these URIs.
When you open the DOCX file with a zipping application, you can see the document structure and its package's parts.
DOCX main content is stored in the package part document.xml, which is often located in word directory, but it does not have to be. To find out URI (location) of document.xml, we should read a relationships XML file inside the _rels directory and look for a relationship type http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument.
Document.xml file contains XML elements defined primarily in WordprocessingML XML namespace of Office Open XML specification. The basic structure of document.xml consists of a document (<document>) element which contains a body (<body>) element. Body element consists of one or more block level elements such as paragraph (<p>) elements. A paragraph contains one or more inline level elements such as run (<r>) elements. A run element contains one or more document's text content elements such as text (<t>), page break (<br>) and tab (<tab>) elements.
Implementation
In short, to retrieve and display a DOCX text content, application will use two classes: DocxReader
and its subclass DocxToFlowDocumentConverter
.
DocxReader
will unzip the file with the help of System.IO.Packaging
namespace, find the document.xml file through the relationship and read it with XmlReader
.
DocxToFlowDocumentConverter
will convert the XML elements from XmlReader
into a corresponding WPF’s FlowDocument
elements.
DocxReader
DocxReader
constructor first opens (unzips) the package from the DOCX file stream and retrieves the mainDocumentPart
(document.xml) with the help of its PackageRelationship
.
protected const string MainDocumentRelationshipType =
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
private readonly Package package;
private readonly PackagePart mainDocumentPart;
public DocxReader(Stream stream)
{
if (stream == null)
throw new ArgumentNullException("stream");
this.package = Package.Open(stream, FileMode.Open, FileAccess.Read);
foreach (var relationship in
this.package.GetRelationshipsByType(MainDocumentRelationshipType))
{
this.mainDocumentPart =
package.GetPart(PackUriHelper.CreatePartUri(relationship.TargetUri));
break;
}
}
After retrieving the document.xml PackagePart
, we can read it with .NET’s XmlReader
class, a fast forward-only XML reader which has the same path trajectory as depth-first traversal algorithm in tree data structure.
First path, 1 to 4, shows the simplest path in retrieving a text from the paragraph element. The second path, 5 - …, shows a more complex paragraph content. In this path, we will also read paragraph properties (<pPr>) and run properties (<rPr>) which contain various formatting options.
We create a series of reading methods for every element we wish to support in this path trajectory.
protected virtual void ReadDocument(XmlReader reader)
{
while (reader.Read())
if (reader.NodeType == XmlNodeType.Element && reader.NamespaceURI ==
WordprocessingMLNamespace && reader.LocalName == BodyElement)
{
ReadXmlSubtree(reader, this.ReadBody);
break;
}
}
private void ReadBody(XmlReader reader) {...}
private void ReadBlockLevelElement(XmlReader reader) {...}
protected virtual void ReadParagraph(XmlReader reader) {...}
private void ReadInlineLevelElement(XmlReader reader) {...}
protected virtual void ReadRun(XmlReader reader) {...}
private void ReadRunContentElement(XmlReader reader) {...}
protected virtual void ReadText(XmlReader reader) {...}
To point out a few things you will notice in DocxReader
reading methods:
- We use
XmlNameTable
to store XML namespace, element and attribute names. This provides us with a better looking code but we also get better performance because now we can do an object (reference) comparisons on these strings rather than a more expensive string (value) comparison since XmlReader
will use atomized strings from XmlNameTable
for its LocalName
and NamespaceURI
properties and because .NET uses string interning and cleverly implements string equality by first doing reference equality and then value equality. - We use
XmlReader.ReadSubtree
method while passing the XmlReader
into a specific DocxReader
reading method to create a boundary around that XML element. DocxReader
reading methods will now have access to only that specific XML element, rather than to the entire document.xml. Using this method has some performance penalty which we traded for more secure and intuitive code.
private static void ReadXmlSubtree(XmlReader reader, Action<XmlReader> action)
{
using (var subtreeReader = reader.ReadSubtree())
{
subtreeReader.Read();
if (action != null)
action(subtreeReader);
}
}
DocxToFlowDocumentConverter
This class inherits from the DocxReader
and it overrides some of the reading methods of DocxReader
to create a corresponding WPF’s FlowDocument
element.
So, for example, while reading document element, we will create a new FlowDocument
, while reading paragraph element we will create a new Paragraph
element and while reading run element we will create a new Span
element.
protected override void ReadDocument(XmlReader reader)
{
this.document = new FlowDocument();
this.document.BeginInit();
base.ReadDocument(reader);
this.document.EndInit();
}
protected override void ReadParagraph(XmlReader reader)
{
using (this.SetCurrent(new Paragraph()))
base.ReadParagraph(reader);
}
protected override void ReadRun(XmlReader reader)
{
using (this.SetCurrent(new Span()))
base.ReadRun(reader);
}
Also, this class implements setting some Paragraph
and Span
properties which are read from paragraph property element <pPr>
and run property element <rPr>
. While XmlReader
is reading these property elements we have already created a new Paragraph
or Span
element and now we need to set their properties.
Because we are moving from the parent element (Paragraph
) to child elements (Spans
) and back to a parent, we will have to track our current element in the FlowDocument
with a variable of type TextElement
(an abstract base class for Paragraph
and Span
).
This is accomplished with a help of CurrentHandle
and C# using statement syntactic sugar for try-finally construct. With a SetCurrent
method we set a current TextElement
and with a Dispose
method will retrieve our previous TextElement
and set it as the current TextElement
.
private struct CurrentHandle : IDisposable
{
private readonly DocxToFlowDocumentConverter converter;
private readonly TextElement previous;
public CurrentHandle(DocxToFlowDocumentConverter converter, TextElement current)
{
this.converter = converter;
this.converter.AddChild(current);
this.previous = this.converter.current;
this.converter.current = current;
}
public void Dispose()
{
this.converter.current = this.previous;
}
}
private IDisposable SetCurrent(TextElement current)
{
return new CurrentHandle(this, current);
}
Using the Code
To get a FlowDocument
all we need is to create a new DocxToFlowDocumentConverter
instance from a DOCX file stream and call Read
method on that instance.
After that, we can display the flow document content in WPF application using the FlowDocumentReader
control.
using (var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
var flowDocumentConverter = new DocxToFlowDocumentConverter(stream);
flowDocumentConverter.Read();
this.flowDocumentReader.Document = flowDocumentConverter.Document;
this.Title = Path.GetFileName(path);
}
Conclusion
DOCX Reader is not a complete solution and is intended to be used for simple scenarios (without tables, lists, pictures, headers/footers, styles, etc.). This application can be enhanced to read more DOCX features, but to get a full DOCX support with all advanced features would require a lot more time and knowledge of DOCX file format. Hopefully, this article and accompanying application has shown you some insights into DOCX file format and might provide a basis for doing more complex DOCX related applications.