Introduction
One of my recent projects required automated generation of
contracts for customers. Contract is a legal document of about 10
pages length. One contract form can be applied for many customers so
the document is a template with customer info put in certain places.
In this article I am going to show you how I solved this
problem.
Requirements
This is an initial version of formalized requirements
:
Specified data must be placed in marked
places of a complex DOC/DOCX file.
The requirements were subsequently refined and
expanded
:
-
Specified data must be placed in marked
places of a complex DOCX file.
-
Output markup must be scriptlet-like: ${},
<%%>, <%=%>.
-
Output data may be not only strings but also
hashes and objects. Field access must be an option.
-
Output language must be brief and
script-friendly: Groovy, JavaScript.
-
A possibility to display list of objects in a
table, each cell displaying a field.
Background
It turned out that the existing products in the field
(I'm talking about Java world) do not fit into initial
requirements.
A brief overview of the products:
Jasper reports
Jasper Reports uses *.jrxml files as templates.
Template
file in combination with input data (SQL result set or a Map of
params) are given to a processor which forms any of these formats:
PDF, XML, HTML, CSV, XLS, RTF, TXT.
Did not fit in:
- It’s
not a WYSIWYG, even with help of iReport —
a visual tool to create jrxml-templates.
- JasperReports API must be learned well to create
and style a complex template.
- JR does not output in a suitable format. PDF might be okay,
but ability of hand-editing is preferable.
Docx4java
Docx4j is a Java library for creating and manipulating
Microsoft Open XML (Word docx, Powerpoint pptx, and Excel xlsx)
files.
Did not fit in:
- There is no case meeting my requirements in docx4java
documentation.
A
brief note about XMLUtils.unmarshallFromTemplate functionality is
present but it only does simpliest substitutions.
- Repeats of output is done with prepared XML-sources and
XPath,
link.
Apache POI
Apache POI is a Java tool for creating and manipulating
parts of *.
doc, *.ppt, *.xls documents. A major use of the Apache POI
api is for Text Extraction applications such as web spiders, index
builders, and content management systems.
Did not fit in:
- Does not have any options that meet my requirements.
Word Content Control Toolkit
Word Content Control Toolkit is
a stand-alone, light-weight tool that opens any Word Open XML
document and lists all of the content controls inside of it.
After I developed my own solution with scriptlets I heard
of a solution based on combination of this tool and
XSDT-transformations. It may work for somebody but I did not bother
digging because it simply takes less steps to use my solution
straightforward.
Solution of the problem
It was fun!
1.
Document
text content is stored as Open XML file inside a zip-archive.
Traditional JDK 6 zipper does not support an explicit encoding
parameter. That is, a broken docx-file may be produced using this
zipper. I had to use a Groovy-wrapper AntBuilder for zipping, which
does have an encoding parameter.
2.
Any
text inside you enter in MS Word may be “arbitrary”
broken into parts wrapped with XML. So, I had to solve the problem
of cleaning pads generated from the template xml. I used regular
expressions for this task. I did not try to use XSLT or anything
because I thought RegEx would be faster.
3.
I
decided to use Groovy as a scripting language because of its
simplicity, Java-nature, and a built-in
template processor. I found an interesting issue related to the processor. It turned
out that even in a small 10-sheet document one can easily run into a
restriction on the length of a string between two scriptlets.
I had to substitute the text going between a pair of scriptlets
with a UUID-string, run the Groovy template processor using the
modified text, and finally swich back those UUID-placeholders with
the initial text fragments.
After overcoming these difficulties, I tried out the
project in real life. It turned out well!
I created a project website and published it.
Project address:
snowindy.github.com/scriptlet4docx/
Code example
HashMap<String, Object> params = new HashMap<String, Object>();
params.put("name","John");
params.put("sirname","Smith");
DocxTemplater docxTemplater = new DocxTemplater(new File("path_to_docx_template/template.docx"));
docxTemplater.process(new File("path_to_result_docx/result.docx"), params);
Scriptlet types explanation
${ data }
Equivalent to out.print(data)
<%= data %>
Equivalent to out.print(data)
<% any_code %>
Evaluates containing code. No output applied. May be
used for divided conditions:
<% if (cond) { %>
This text block will be printed in case of "cond == true"
<% } else { %>
This text block will be printed otherwise.
<% } %>
$[ @listVar.field ]
This is a custom Scriptlet4docx scriptlet type designed to
output collection of objects to docx tables. It must be used inside
a table cell.
Say, we have a list of person objects. Each has two fields:
'name' and 'address'. We want to output them to a
two-column table.
- Create a binding with key 'personList'
referencing that collection.
- Create a two-column table inside a template
docx-document: two columns, one row.
- $[@person.name] goes to the first column cell;
$[@person.address] goes to the second.
- Voila, the whole collection will be printed to the
table.
Live template example
You can check all mentioned scriptlets usage in a
demonstration template.
Project future
If I actually developed a new approach to processing
docx-templates, it would be nice to popularize it.
Projects TODOs:
- Preprocessed templates caching,
- Scriptlets support in lists
- Streaming API
N.B. English is not my main language, excuse me for the mistakes.