Introduction
The WordML spec will probably never be fully completed. But, here is an attempt to extend the documentation where Microsoft is silent.
The RTF spec became a gigantic mess because it was incomplete in places, vague in other places, and wrong (i.e., didn’t match what Word did) in others. Because of that, every programmer writing code to read/write RTF files made different reasonable assumptions, and we were left with a not so terribly portable spec.
I am hoping that in the case of WordML, we can get most developers to make the same assumptions. This will, at worst, leave us with two WordML specs in practice, the common assumptions that all non-Microsoft developers use, and the assumptions used by developers in the Word group. So, as you make assumptions of your own, please email me what you did and I will add it to the spec, as long as your assumptions do not conflict with an assumption already listed.
Note: The most up to date version of this document can be found at http://dave.thielen.com/articles/The%20WordML%20Spec.htm.
Questions
- To store an image in a WordML document, the .chm file says you use
<pict>
and it has a <binData> subnode. It has several other subnodes listed, but they are all irrelevant for a bitmap image. WordProcessingML.doc and any XML file saved by Word shows a <v:shapetype>
and a <v:shape>
subnode. But neither explains any of the attributes or subnodes of those two nodes. The VML schema lists all of the attributes and sub-nodes, but has no description for any of them. To store a bitmap in a WordML file, what attributes and sub-nodes need to be set and with what values? If I just guess on this based on saving a bunch of files, there are so many undocumented elements here, I'm guaranteed to guess wrong on some (details below on how I think an image should be written).
<instrText>
- Where can I find a list of what fields Word uses this for, and how to set all attributes and sub nodes in each case?
- For
<w:fldSimple w:instr=
, where can I find a list of all values Word 2003 supports? And for each value, a list of which child nodes it requires and/or uses.
- When I create a hyperlink in Word, it saves it in WordML using
<fldChar>
/<instrText>
/<fldChar>
... <fldChar>
. Why does it do this instead of using hlink
(which is what the docs show)?
- Reading the WordProcessingML.doc and the OfficeXML.chm files, it looks like a horizontal merge of cells are supposed to be created using the
<hmerge>
node. But, when Word creates the XML file, it uses gridSpan
. When should you use one vs. the other? (I assume as it has both, there is a reason for this, and there are situations where using one is required, and other situations where the other is required.)
Assumptions
- The
wx:
elements appear to be duplications of w: or o: elements, and exist to make it easier for a program other than Word (Internet Explorer?) to make the document appear identically. As they are redundant information, and may be wrong in places (wx:bdrwidth
is specified as points, but Word appears to write it in twips), my approach is never write a wx:
element or attribute, and never use one when reading the document.
- If you want just a single header or footer, as opposed to one for odd pages and one for even pages, just create an odd header. As Word does it this way and the spec is silent on the issue, my approach is it is an error to create just an even header. For a header on even pages only, create a blank odd header.
- For
<w:font>
, do not write/read the usb-0
…csb-1
attributes. They are not needed, and there is no guarantee that the values for the font on your system will be the same on another person’s system.
- Apparently, there is no way to set the default language for a document (unlike RTF). You can set it on a paragraph by paragraph basis.
- When I write "\u2003\u2002\u2009" (em, 1/2em, 1/6 em) to a text node in a WordML file, I get (using Courier, so it's fixed width):
- em - just slightly under 2 chars wide space.
- 1/2 em - 1 char wide space.
- 1/6 em - a box (the unknown char symbol) 1 char wide.
- It looks the same with Arial and Times New Roman. The box for 1/6 em is definitely wrong, and the other spacing is not what you normally get. No idea why, but this is how Word does it.
- The units for
<w:pBrdr><w:top w:space='#'>
are listed as 1/8 of a point. However, Word appears to interpret them as 1 point. It does accept and handle real numbers, so you can have <w:pBrdr><w:top w:space='3.5'>
.
- Word creates
<w:lvlText w:val="%1.%2.$3."/>
where the formatting appears to put the level 1 number at %1, the level 2 number at %2, etc. Word only allows 9 levels, so %10 should be considered illegal.
- The character 0x2011 (a non-breaking hyphen) shows up as the unknown glyph box. If you use use
ToggleCharacterCode
= Ctrl+X twice on the "unknown glyph" box, it displays and works properly. So use <w:noBreakHyphen/>
instead.
Images
This is quite a bit, so I have it here in its own section. What I have here works – but I had to make so many assumptions. I am almost certain that some of them do not match Microsoft’s.
GIF
<w:pict>
<w:binData w:name="wordml://01000006.gif">R0lGODlhEAAQALMAAAAAAIAAAACAAICAAAA
AgIAAgACAgICAgMDAwP8AAAD/AP//AAAA//8A/wD//////yH5BAEAAA0ALAAAAAAQABAAAARaMJxJ
Z7u4ncf7Axm2ASRAIGCofQ7DnGk4ui+qrkSeV5WG/L+NhwM6lEijo03YGTkMhsMSoah8oFFb4wgYV
bSalBQIjBkvRm63e1gM2ENOGuBGUnnnbUxNakQAADs=
</w:binData> <v:shape id="_1" type="#_x0000_t75"
style="width:12pt;height:12pt">
<v:imagedata src="wordml://01000006.gif" o:title="networ6"/>
</v:shape>
</w:pict>
binData
is the uuencode
of a GIF image. Don’t change type="#_x0000_t75"
. wordml:name.gif must match. style="width:12pt;height:12pt"
gives the size in the doc.
PNG
<w:pict>
<w:binData w:name="wordml://03000001.png">iVBORw0KGgoAAAANSUhEUgAAABAAAAAQBA
MAAADt3eJSAAAAB3RJTUUH1QEDFRoyw+VrogAAAAlwSFlzAAALEgAACxIB0t1+/AAAACdQTFRF/w
D/AAAA//8AgIAAgICAwMDAAP8A////AICAAP//AACAAAD/gAAA6ZmItQAAAAF0Uk5TAEDm2GYAAA
BoSURBVHjaNcvBDYAgDIXhugEvwUTikUVsUsIEbOAOrOAgemEFrw5mC9rTn/R79IjIRnq51up6AN
eIcH+x/tHa2X0qZXgG1M/O5jkcyVHaJS8Wk768aBCbLz3Ug0mitzkTItQLE8E83Atm2Rvlc68eJA
AAAABJRU5ErkJggk==</w:binData>
<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:12pt;height:12pt">
<v:imagedata src="wordml://03000001.png" o:title="network"/>
</v:shape>
</w:pict>
Same type, different extension in WordML. Also, .NET uuencode
ends with ...BJRU5ErkJggg==
. Don't know what kind of bug (if any) this is!!!
JPG
<w:pict>
<w:binData w:name="wordml://02000001.jpg">/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAA
UDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEP
ERETFhwXExQaFRERGCEYGh0dHx8fExciJCIeJBweHx7/2wBDAQUFBQcGBw4ICA4eFBEUHh4eHh4e
Hh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh7/wAARCAAQABADASIA
AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA
AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3
ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWm
p6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEA
AwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSEx
BhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElK
U1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3
uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDOl8Se
I5Pinonh7whceF7nRRcadamxiXR3ldBBbrcReXIv2gOJBcbiW9MbQuTmWVgtj4cikufFHhrVJ7SC
BbprHXbe9lZ2eOEPtjdnIMkijcR/ECcVf8P+PvElv4w0Yy+P7m90lIY7ldPvJPItLNoZX2x3mId9
um1I3QsCSqlvmyoONq914c0/wTc6ZpXiTwlfzw29qkUVt4jnu7q5+zSRSRxIjWUSszGJU4YYDZAb
AU/R4DN6uXVIqmoreLaTfNyzabd9b3XRdtD6HKswx3D+NSptNTbUnq9FJrRWu9fuTXof/9k=
</w:binData>
<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:12pt;height:12pt">
<v:imagedata src="wordml://02000001.jpg" o:title="network"/>
</v:shape>
</w:pict>
The most up to date version of this document can be found at http://dave.thielen.com/articles/The%20WordML%20Spec.htm.
This document may be freely distributed as long as it is distributed in whole.