Table of Contents
Introduction
This project starts with the purpose to simplify and to automate the publishing of my company documents.
We have a lot word document that should be published on the web but that have to be converted first. Microsoft word has a function to export as html, but we need to apply our style sheets, moreover we have different stylesheets per arguments to be applied.
With these prerequisite I realized a little prototype that could be interesting to improve.
Architecture
This is only a prototype, it has limited features compared to all you can do with Microsoft word, but I converted several docs with different style and the results were appreciable. I use it also to write document to put in cms or submit to codeproject ;-)
The conversion process consist in scanning a word document, in reading order, extracting style properties of each word or paragraph, finding a matching html tag and then render in a html file.
References
Reading a docx document is not simple, it is an xml, then it is readable but… look at the size of the specifications:
Really big. But after studing a few days, I was able to understand something more and achieve some good results.
First of all, docx format is a zip compressed package. You can use custom lib to open it but you can also use classes in the “System.IO.Packaging” namespace that is a member of PresentationCore assembly provided with .net Framework 3.0. This last one is very useful, and is what I use in this project, because implements, also, all functions to manage and retrieve the package “relationships”.
The package, in fact is structured as a directory tree where are present different folders with different files.
Files can be related each other and this relationship was defined and stored in the package too.
I explain this because in many document I converted there were embedded images, and, in the package, images are parts with a relationship with the document part.
Second, the main file I want to analyze is the one that contains the texts. This is the “\word\document.xml”. This file describes the content of document then, element by element you can scan all the document in its parts.
Third, Getting from OOXML spec:
“The basis of a WordprocessingML document is its actual text contents. Those text contents can be stored in many contexts (tables, text boxes, etc.), but the most basic form of text contents in WordprocessingML is the paragraph, specified using the p element (§2.3.1.22). Within the paragraph, all rich formatting at the paragraph level is stored within the pPr element (§2.3.1.25; §2.3.1.26). [Note: Some examples of paragraph properties are alignment, border, hyphenation override, indentation, line spacing, shading, text direction, and widow/orphan control.] Within the paragraph, text is grouped into one or more runs, represented by the r element (§2.3.2.23), which define a region of text with a common set of properties.”
Notes
Technical choice
Paragraph can have its own style but the runs, contained, can override that style with one more specific. Then there are two level of style. It has been useful distinguish between an applied Style (normal, heading, list number…) and a text modifiers (bold, underline, italic…).
About languages /localization
This program works matching literally the name of style specified in the word document and that specified in the map file. Opening a word document and saving it in a different languages, makes all style names to be translated in the new word language. The demo file I attached was saved with an English version of Word and style names still in English language (Normal, Heading1, Quote, Subtitle... ) but if I save the document with an Italian version, styles name will be translated in italian (Normale, Titolo1, Citazione, sottotitolo…).
The program outputs on the console all the name of styles not matched during conversion.
Points of interest
I created a structure that saves in a buffer all text with the same format, and that flush the buffer when format changes. The StyleClass
does this work. It has two public properties StyleName
and Modifiers
, setting these attributes we can check when format change. It shadows the ToString()
function returning an unique string with the values of its properties. Each paragraph has its own StyleClass and each subelements have their too. If a sub element specifies a different style or add a text modifier, it sets a StyleClass and, when the style changes, it makes the buffer to flush.
Private Class StyleClass
Public StyleName As String
Private p_modifiers As Hashtable
Public Sub New()
p_modifiers = New Hashtable
End Sub
Public Sub AddModifier(ByVal modifier As String)
If p_modifiers(modifier) Is Nothing Then
Me.p_modifiers.Add(modifier, modifier)
End If
End Sub
Public ReadOnly Property Modifiers()
Get
Return p_modifiers.Values
End Get
End Property
Public Shadows Function ToString() As String
Dim tmp As String
tmp = ""
For Each m As String In p_modifiers.Values
tmp &= m
Next
Return StyleName & "|" & tmp
End Function
End Class
Flushing the buffer means also adding an html tag to the text. I created a table that contains a list of Word styles with their matching “html start tag” and “html end tag”:
<text style="color: #FFFFFF;">Style
| <text style="color: #FFFFFF;">Start tag
| <text style="color: #FFFFFF;">End tag
|
Normal
| <p>
| </p>
|
Heading1
| <h1>
| </h1>
|
Code
| <text style=””>
| </text>
|
…
| | |
And one for the text modifiers:
<text style="color: #FFFFFF;">Modifier
| <text style="color: #FFFFFF;">Start tag
| <text style="color: #FFFFFF;">End tag
|
bold
| <b>
| </b>
|
Italic
| <i>
| </i>
|
Underline
| <u>
| </u>
|
…
| | |
I load this matching table in memory from an external xml file and then I use it to render the text according to specification.
="1.0"="utf-8"
<conversion_map >
<styles>
<style name="Normale">
<start_ctag>[p]</start_ctag>
<end_ctag>[/p]</end_ctag>
</style>
<style name="Paragrafoelenco_l0">
<start_ctag>[li style='list-style-type: circle; margin: 5px 0 5px 15px;']</start_ctag>
<end_ctag>[/li]</end_ctag>
</style>
<style name="Code">
<start_ctag>[pre lang="VB.NET"]</start_ctag>
<end_ctag>[/pre]</end_ctag>
</style>
<style name="Grigliatabella">
<start_ctag>[table class='feature' cellspacing='0' cellpadding='0' style='width:100%;']</start_ctag>
<end_ctag>[/table]</end_ctag>
</style>
...
</styles>
<modifiers>
<modifer name="b">
<start_ctag>[b]</start_ctag>
<end_ctag>[/b]</end_ctag>
</modifer>
<modifer name="c">
<start_ctag>[text style='color: #{0};']</start_ctag>
<end_ctag>[/text]</end_ctag>
</modifer>
<modifer name="h">
<start_ctag>[text style='background-color: {0};']</start_ctag>
<end_ctag>[/text]</end_ctag>
</modifer>
...
</modifiers>
</conversion_map>
The function that writes to html document uses a stack to add the start tag and the end tag in the right order:
Dim tmp As String
Dim cs As ConvertionClass
cs = ht_Style(style.StyleName)
If cs Is Nothing Then
Console.WriteLine("Style not found: " & style.StyleName)
cs = New ConvertionClass() With {.StartTag = "", .EndTag = ""}
End If
Dim cm As ConvertionClass
Dim s As New Stack
tmp = cs.StartTag
For Each m In style.Modifiers
cm = ht_Mod(m)
If Not cm Is Nothing Then
s.Push(cm)
tmp &= cm.StartTag
End If
Next
tmp &= buffer
While s.Count > 0 AndAlso Not s.Peek Is Nothing
cm = s.Pop
tmp &= cm.EndTag
End While
tmp &= cs.EndTag
Return tmp
Limitations
This project can convert simple documents.
- It can read and understand all styles used in the document,
- it can manage a "table of content" (that convert into anchor names),
- it can convert simple bulleted lists,
- it can convert simple tables,
- it can convert external hyperlinks.
Conversion style XML file must be specified as second argument in the command line.
The result of the conversion is an html file saved in the relative folder “.\CDATA\”. If document contains images, they are saved in the relative folder “.\CDATA\img\” and referred as a source by img tag.
I've tested it with document created by Microsoft Word 2010 and 2013.
Conclusion
This article is posted using this tool.
I wrote the article in Microsoft Word, I prepared a conversion table for codeproject and finally I copied/pasted the converted document. It works! :-P
This article shows only a technique to read and convert a docx document. It will not be exhaustive compared to all OOXML specification. If someone needs support, please contact me at Gekoproject.com