Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / productivity / Office

Basics of OpenXML (Word 2007) for Beginners

3.33/5 (18 votes)
4 Jun 2009CPOL5 min read 1   6  
Introduction to the OpenXML format.

Introduction

Since Microsoft unveiled OpenXML with Office 2007, many people started to check if they can take advantage of it. However, if you search on newsgroups/communities/forums, you will find find that it is much more complicated and difficult to study and implement.

I think it is not difficult, but lengthy and somewhat complex.

ODF vs. OpenXML: It is another point of debate. ODF (OpenDocument Format) is simpler than MS OpenXML. Both follow the same XML+ Zip format. However, there is lack of help/tutorials/support of Open Source technologies if we compare those with MS products.

This article: I try to explain here the basics of OpenXML programming to help beginners. I have dealt with Word 2007, and hence I will cover the part regarding Word 2007 only. However, OpenXML implementations are quite similar across Office components.

This is not my new invention, but I am putting the basic facts scattered over the internet in one place.

Basics

Let's start with the basics. A Word file with the extension DOCX is actually a compressed archive (Zip) of some files. These files are nothing but XML files and some folders/subfolders. These files are inter-related with relations.

The following figures shows the files inside a DOCX:

Screenshot - Package.jpg

To view these files, just open the DOCX file with WinZip (or any other software you have). Everything (some exclusions like images, ActiveX) is converted into XML. You need to remember the following keywords: Package, Parts, Relations.

Package: Package is nothing but your DOCX file. This zip file is called a Package.

Parts: Parts are nothing but files in the Package. E.g., the area where you type (after opening Word) is the main document part. If you insert an image, it will be another part. Everything is managed in parts (numbering [bullets], images, styles, settings etc.). If you want to insert/delete/retrieve images, then you have to play with ImageParts (a sub-class of part) and so on.

Relations: The parts are linked with relations. The main relations are maintained in .rels files inside the _rels folder within a package. Of-course, you can find XML tags in this file. There are other relation files in the word/_rels folder. These are sub-relationships. E.g., if you include an image for a bullet (picture bullets), then you can find the numbering.xml.rels file in this folder. There are many other relation files and it is hard to list all of those.

Relations IDs: Each relation has a unique ID. This is referred in the referencing part and in the relation file. With the help of this ID, Office searches for the appropriate referenced parts and displays them accordingly. For instance, add a new image in your document, then save it. Open it with WinZip. Open document.xml, look for the w:drawing tag, then inside that, look for the a:blip r:embed tag. The value of this tag will look like rId2. Then, open the document.xml.rels file and search for rId2; you will find the path of the image in the package!

Do it

To deal with DOCX programmatically and to simplify programming, you may want to download this SDK [Microsoft SDK for OpenXML Formats] [SDK 2.0 here] provided by Microsoft. The final release is not out yet. Download it -> Add a reference to your project -> Import it.

Code

To open a document
VB
Dim doc As WordprocessingDocument = _
         WordprocessingDocument.Open("C:\Test\abc.docx", True)
Dim mainDoc As MainDocumentPart = doc.MainDocumentPart 

MainDoc is the main document (document.xml) and contains every line of text you typed in the document.

To load in XMLDocument

You may want to load the XML of document.xml in an XMLDocument class object. Try this:

VB
Dim streamReader As System.IO.StreamReader = New IO.StreamReader(mainDoc.GetStream)
Dim str As String = streamReader.ReadToEnd
Dim xmlDoc As New System.Xml.XmlDocument
xmlDoc.LoadXml(str)

Remember, to travel within this XML, you need XmlNamespaceManager and add the required namespaces to that. You can add the required namespaces in the document.xml file. If you want to add paragraphs, then add child nodes in xmlDoc and then save the xmlDoc (xmlDoc.Save(mainDoc.GetStream())).

Add New Part

To add a new part, you can use the AddPart and AddNewPart methods of the WordprocessingDocument and d d MainDocumentPart classes. These are generic methods and you need to specify which part you want to add. The method returns the part you added, and then you can play with that.

Example 1:

To add a new numbering part in the main document, the try following:

VB
Dim numPart As NumberingDefinitionsPartnumPart = _
    mainDoc.AddNewPart(Of NumberingDefinitionsPart)()

Load numPart's XML using the GetStream method into the XmlDocument, do the manipulations, and then save it.

Example 2:

To add a new image part in the main document, try the following:

VB
Dim doc As WordprocessingDocument = WordprocessingDocument.Open("C:\Test\abc.docx", True)
Dim mainDoc As MainDocumentPart = doc.MainDocumentPart
Dim iPartImage As ImagePart = mainDoc.AddImagePart(ImagePartType.Png)
Dim img As Image = Image.FromFile("C:\images\test.gif")
img.Save(iPartImage.GetStream(), Imaging.ImageFormat.Png)
doc.Close()

The above code will add an image in the package. Remember, it will not display in your document unless you manually add paragraphs and the required nodes in mainDoc's XML. After executing the code above, open the package with WinZip, and check that the image is added under the media folder. Also, check the relation file document.xml.rels and search for media/image; you will find a new relation tag is added and a new unique ID is created for that image.

[This article would help you to add new paragraphs.]

You can iterate through each part using the Parts property of the Part class. Try to use a for-each loop and check each part in Debug mode (put a breakpoint inside the for-each loop). [Check the mainDoc.Parts pProperty].

Delete existing part

You can see the ID of the part from document.xml. Once you have the ID of the part, call the GetPartById method of mainDoc. This will return the part that you want to delete. Then, call thr DeletePart method. This will delete the part as well as updates the relation file (document.xml.rels).

Saving Document

WordprocessingDocument.Close() automatically saves and closes the document. You don't need to save it explicitly.

Final Words

You need to work hard to understand OpenXML. Debugging and some R&D will help you know it better.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)