Show Word File in WPF

JosipK

4.96/5 (158 votes)

10 Sep 2013CPOL5 min read

107.9K

5.8K

Small WPF application that loads DOCX file, reads DOCX file and displays its content in WPF

Download source code - 213.2 KB

Introduction
DOCX Overview
Implementation
- DocxReader
- DocxToFlowDocumentConverter
Using the Code
Conclusion

Introduction

Word 2007 documents are Office Open XML Documents, a combination of XML architecture and ZIP compression used to store an XML and non-XML files together within a single ZIP archive. These documents usually have DOCX extension, but there are exceptions for macro enabled documents, templates etc.

This article will show how you can read and view a DOCX file in WPF with the use of only .NET Framework 3.0 (without using any 3^rd party code).

DOCX Overview

A DOCX file is actually a zipped group of files and folders, called a package. Package consists of package parts (files that contain any type of data like text, images, binary, etc.) and relationships files. Package parts have a unique URI name and relationships XML files contain these URIs.

When you open the DOCX file with a zipping application, you can see the document structure and its package's parts.

DOCX main content is stored in the package part document.xml, which is often located in word directory, but it does not have to be. To find out URI (location) of document.xml, we should read a relationships XML file inside the _rels directory and look for a relationship type http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument.

Document.xml file contains XML elements defined primarily in WordprocessingML XML namespace of Office Open XML specification. The basic structure of document.xml consists of a document (<document>) element which contains a body (<body>) element. Body element consists of one or more block level elements such as paragraph (<p>) elements. A paragraph contains one or more inline level elements such as run (<r>) elements. A run element contains one or more document's text content elements such as text (<t>), page break (<br>) and tab (<tab>) elements.

Implementation

In short, to retrieve and display a DOCX text content, application will use two classes: DocxReader and its subclass DocxToFlowDocumentConverter.

DocxReader will unzip the file with the help of System.IO.Packaging namespace, find the document.xml file through the relationship and read it with XmlReader.

DocxToFlowDocumentConverter will convert the XML elements from XmlReader into a corresponding WPF’s FlowDocument elements.

DocxReader

DocxReader constructor first opens (unzips) the package from the DOCX file stream and retrieves the mainDocumentPart (document.xml) with the help of its PackageRelationship.

protected const string MainDocumentRelationshipType = 
   "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
private readonly Package package;
private readonly PackagePart mainDocumentPart;
 
public DocxReader(Stream stream)
{
    if (stream == null)
        throw new ArgumentNullException("stream");
 
    this.package = Package.Open(stream, FileMode.Open, FileAccess.Read);
 
    foreach (var relationship in 
       this.package.GetRelationshipsByType(MainDocumentRelationshipType))
    {
        this.mainDocumentPart = 
          package.GetPart(PackUriHelper.CreatePartUri(relationship.TargetUri));
        break;
    }
}

After retrieving the document.xml PackagePart, we can read it with .NET’s XmlReader class, a fast forward-only XML reader which has the same path trajectory as depth-first traversal algorithm in tree data structure.

First path, 1 to 4, shows the simplest path in retrieving a text from the paragraph element. The second path, 5 - …, shows a more complex paragraph content. In this path, we will also read paragraph properties (<pPr>) and run properties (<rPr>) which contain various formatting options.

We create a series of reading methods for every element we wish to support in this path trajectory.

protected virtual void ReadDocument(XmlReader reader)
{
    while (reader.Read())
        if (reader.NodeType == XmlNodeType.Element && reader.NamespaceURI == 
          WordprocessingMLNamespace && reader.LocalName == BodyElement)
        {
            ReadXmlSubtree(reader, this.ReadBody);
            break;
        }
}
 
private void ReadBody(XmlReader reader) {...}
private void ReadBlockLevelElement(XmlReader reader) {...}
protected virtual void ReadParagraph(XmlReader reader) {...}
private void ReadInlineLevelElement(XmlReader reader) {...}
protected virtual void ReadRun(XmlReader reader) {...}
private void ReadRunContentElement(XmlReader reader) {...}
protected virtual void ReadText(XmlReader reader) {...}

To point out a few things you will notice in DocxReader reading methods:

We use XmlNameTable to store XML namespace, element and attribute names. This provides us with a better looking code but we also get better performance because now we can do an object (reference) comparisons on these strings rather than a more expensive string (value) comparison since XmlReader will use atomized strings from XmlNameTable for its LocalName and NamespaceURI properties and because .NET uses string interning and cleverly implements string equality by first doing reference equality and then value equality.
We use XmlReader.ReadSubtree method while passing the XmlReader into a specific DocxReader reading method to create a boundary around that XML element. DocxReader reading methods will now have access to only that specific XML element, rather than to the entire document.xml. Using this method has some performance penalty which we traded for more secure and intuitive code.

private static void ReadXmlSubtree(XmlReader reader, Action<XmlReader> action)
{
    using (var subtreeReader = reader.ReadSubtree())
    {
        // Position on the first node.
        subtreeReader.Read();

        if (action != null)
           action(subtreeReader);
    }
}

DocxToFlowDocumentConverter

This class inherits from the DocxReader and it overrides some of the reading methods of DocxReader to create a corresponding WPF’s FlowDocument element.

So, for example, while reading document element, we will create a new FlowDocument, while reading paragraph element we will create a new Paragraph element and while reading run element we will create a new Span element.

protected override void ReadDocument(XmlReader reader)
{
    this.document = new FlowDocument();
    this.document.BeginInit();
    base.ReadDocument(reader);
    this.document.EndInit();
}
 
protected override void ReadParagraph(XmlReader reader)
{
    using (this.SetCurrent(new Paragraph()))
        base.ReadParagraph(reader);
}
 
protected override void ReadRun(XmlReader reader)
{
    using (this.SetCurrent(new Span()))
        base.ReadRun(reader);
}

Also, this class implements setting some Paragraph and Span properties which are read from paragraph property element <pPr> and run property element <rPr>. While XmlReader is reading these property elements we have already created a new Paragraph or Span element and now we need to set their properties.

Because we are moving from the parent element (Paragraph) to child elements (Spans) and back to a parent, we will have to track our current element in the FlowDocument with a variable of type TextElement (an abstract base class for Paragraph and Span).

This is accomplished with a help of CurrentHandle and C# using statement syntactic sugar for try-finally construct. With a SetCurrent method we set a current TextElement and with a Dispose method will retrieve our previous TextElement and set it as the current TextElement.

private struct CurrentHandle : IDisposable
{
    private readonly DocxToFlowDocumentConverter converter;
    private readonly TextElement previous;
 
    public CurrentHandle(DocxToFlowDocumentConverter converter, TextElement current)
    {
        this.converter = converter;
        this.converter.AddChild(current);
        this.previous = this.converter.current;
        this.converter.current = current;
    }
 
    public void Dispose()
    {
        this.converter.current = this.previous;
    }
}

private IDisposable SetCurrent(TextElement current)
{
    return new CurrentHandle(this, current);
}

Using the Code

To get a FlowDocument all we need is to create a new DocxToFlowDocumentConverter instance from a DOCX file stream and call Read method on that instance.

After that, we can display the flow document content in WPF application using the FlowDocumentReader control.

using (var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
    var flowDocumentConverter = new DocxToFlowDocumentConverter(stream);
    flowDocumentConverter.Read();
    this.flowDocumentReader.Document = flowDocumentConverter.Document;
    this.Title = Path.GetFileName(path);
}

Conclusion

DOCX Reader is not a complete solution and is intended to be used for simple scenarios (without tables, lists, pictures, headers/footers, styles, etc.). This application can be enhanced to read more DOCX features, but to get a full DOCX support with all advanced features would require a lot more time and knowledge of DOCX file format. Hopefully, this article and accompanying application has shown you some insights into DOCX file format and might provide a basis for doing more complex DOCX related applications.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)