Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / Markdown

How to Programmatically Create HTML, ODT, DOCX & PDFs Documents for Free

3.40/5 (3 votes)
3 Apr 2023CPOL10 min read 6.8K  
Learn to create documents in popular formats using only free and open-source software
Learn to create documents in popular formats for free. The entire toolchain will be free and open-source software (FOSS). This process can be easily automated and integrated with application servers.

Introduction

Governments, businesses and other organizations create a lot of documents. When they automate their manual or paper-based operations, they purchase proprietary document-creation software components that cost hundreds or even thousands of dollars per year or per computer. It seems like a colossal waste, particularly when there are free and open-source software that can do the same for no money.

Background

I have self-published a lot of books and they were all created using only free and open-source software (FOSS). My only expenses have been for internet access and electricity. When I listen to podcasts or see online videos featuring other self-published authors, I am horrified to learn that they have spent thousands of dollars to publish just one book. This does not include marketing.

In this article, I will describe how anyone can create ODT, DOCX and PDF files for free. The entire toolchain will be FOSS. This process can be easily automated and integrated with application servers.

Use CommonMark (MarkDown) for the Source Document

You can write documents in plain text, markup, markdown or directly in rich text. If you use markdown as the source, then you can easily export it to all other forms. HTML (HyperText Markup Language) is markup. It is displayed as rich text. Rich text is what you see in a browser or a document editor such as Microsoft Word or LibreOffice Writer. Markup is what you see when you do View » Source on a web page. The DOCX/ODT file that you create in Word/Writer is actually a renamed zip file with mostly XML files (eXtended Markup Language).

Markdown is the opposite of markup. It uses ordinary plain-text enhancements to unobtrusively mark up text. Before the WorldWide Web came into existence, the Internet was mostly email. Email, BBS and Usenet (newsgroups) users developed a plain-text style of formatting for their messages. For example, bold type was wrapped in **asterisks**. Italicized text was wrapped in _underscores_. When John Gruber and Aaron Swartz created Markdown in 2004, they extended this style even further. It is now the most popular form of markdown. Markdown was released as a perl script. Markdown documents were usually saved with the extension .md. I wrote my first book in markdown and converted it like this:

Bash
perl markdown.pl jokebook.md > jokebook.html

Other implementations of Markdown were based on the perl script (markdown.pl) but there were some differences. Finally, in 2019, Jeff Atwood and John MacFarlane published a standardized implementation of Markdown called CommonMark. I learned about it sometime in 2020 and earned the bragging rights for the first book on CommonMark. It is called CommonMark Ready Reference. It is available for free in many ebook stores. However, I will give you a quickie intro so that you will have an idea what it looks like.

Markdown primer

Unlike MarkDown, CommonMark has a proper specification. It also has an implementation written in C that is blisteringly fast. (I provide Linux and Windows executables on my website.) With the C-based executable, you can convert your Markdown/CommonMark source documents like this:

commonmark  --unsafe --validate-utf8 jokebook.md > jokebook.html

CommonMark can generate headings, paragraphs, blockquotes, images, links, lists, code spans and blocks, horizontal breaks and line breaks. However, that is about it. It cannot generate tables and other fancy stuff. If you want those, you can write raw HTML and use the -unsafe option. By default, CommonMark omits raw HTML to protect software systems from code injections.

This HTML, that the CommonMark executable or the Markdown perl script generates, is validation-safe well-structured HTML. However, it will not have HTML, HEAD, TITLE or BODY tags. The executable's sole purpose is to create HTML markup that can be straightaway used in a pre-existing page or a HTML template.

Imagine that this is the markdown source document.

markdown
Science Jokes
-------------

* **How many astronauts would it take to a screw a lightbulb?**  
    One to turn the bulb and several to prevent the spacecraft 
    from spinning in the same direction.
* **What did one radio wave say to another?**  
  "You are interfering with my work."
* **What's a radio engineer's favourite food?**  
  A can of tuna.

CommonMark can be used to convert it like this:

HTML
echo '<!DOCTYPE html><html><title>2020 Jokebook</title></head><body>' > jokebook.html
commonmark  --unsafe --validate-utf8 jokebook.md >> jokebook.html
echo '</body></html>' >> jokebook.html

The output HTML will look like this:

HTML
<!DOCTYPE html><html><title>2020 Jokebook</title></head><body>
<h2>Science Jokes</h2>
<ul>
<li><strong>How many astronauts would it take to a screw a lightbulb?</strong><br />
One to turn the bulb and several to prevent the spacecraft from 
spinning in the same direction.</li>
<li><strong>What did one radio wave say to another?</strong><br />
&quot;You are interfering with my work.&quot;</li>
<li><strong>What's a radio engineer's favourite food?</strong><br />
A can of tuna.</li>
</ul>
</body></html>

CommonMark-generated markup starts from <h2> and ends in </ul>. The rest is the HTML template. This HTML looks like this in a browser.

Screenshot of HTML document

Use LibreOffice to Create ODT, DOCX and PDFs

You already know that LibreOffice is the FOSS alternative to Microsoft Office. It has a word processor Writer (the Word alternative), spreadsheet application Calc (the Excel alternative), presentation slide maker Impress (the PowerPoint alternative) and few other applications. While LibreOffice hoots and toots like a regular GUI application, it also has a demure command-line interface.

To convert the afore-mentioned HTML document to ODT format:

Bash
libreoffice --convert-to "odt" jokebook.html

ODT document

You can use the same HTML document and convert it to DOCX so that Microsoft Office users can feel happy. (Microsoft Word can edit ODT files just fine.)

Bash
libreoffice --convert-to "docx:MS Word 2007 XML" jokebook.html

You can use the ODT or DOCX file that you generated and convert it to a PDF file.

Bash
libreoffice --convert-to "pdf" jokebook.odt

PDF document

Why not convert to PDF straight from the HTML? Why create the intermediate ODT or DOCX file? Because the HTML document does not have any concept of page size, margins, headers and footers.

Create Documents With Images

When you convert a HTML document containing images, the resultant ODT or DOCX documents will show the images all right. When you move the documents or mail it to someone, the images will disappear. This is because the images in the ODT or DOCX documents continue to be loaded from the source image files. To fix this problem, you need to encode the images as text. This is similar to how images and attachments are encoded in email messages — using base64 encoding. View the message source of an email containing an attachment, you will find the file encoded as plain text.

Instead of using an image file like this:

HTML
<img src="lion-and-deer.png" />

… you can encode it as text like this …

HTML
<img src="data:image/png;base64,iVBORw0KGgoAAAA…" />

No, that is not all. I have truncated the actual text of the encoded image. The full text is nearly 600 lines. If you are curious, you try a command like this:

Bash
base64 lion-and-deer.png

Text-encoded image

You do not have to dirty your hands with manual text-encoding. LibreOffice will encode the images as text.

Bash
echo '<!DOCTYPE html><html><title>2020 Jokebook</title></head><body>' > jokebook.htm
commonmark  --unsafe --validate-utf8 jokebook.md >> jokebook.htm
echo '</body></html>' >> jokebook.htm

libreoffice --convert-to "html:HTML:EmbedImages" jokebook.htm

Here, the .htm file (referring to an external image) was created by CommonMark. LibreOffice consumed that .htm file and created a .html file with the text-encoded image. This self-contained HTML is now portable and not dependent on any external files. When you convert such an HTML document, then the resultant ODT or DOCX file will also be self-contained and portable. Even if you delete the source image file, you will still be able to see it in the .html, .odt or .docx files.

Bash
libreoffice --convert-to "html:HTML:EmbedImages" jokebook.htm
libreoffice --convert-to "odt" jokebook.html
libreoffice --convert-to "docx:MS Word 2007 XML" jokebook.odt
libreoffice --convert-to "pdf" jokebook.odt

Images in a document

Enhanced Document Content

As mentioned earlier, CommonMark outputs only a limited set of HTML tags. For creating content that it does not support, you will have to add raw HTML in your markdown.

markdown
Animal Jokes
------------

* **Why did the lion cross the road?**  
  Because <span style="color: white; background-color: red; 
  border-radius: 0.5em; border: 2px dashed yellow; ">the buck stops here</span>.  
  ![Lion and deer](lion-and-deer.png)

Do not go overboard with this raw HTML content. LibreOffice has its own limited set of HTML tags and CSS styles that it can convert.

Output with raw HTML in Markdown

What does this screenshot say? LibreOffice does not do rounded corners, among other things. So, temper your excitement.

This use of raw HTML is crude. It defeats the idea of markdown. The purpose of CommonMark is to create a well-structured document. Special styling can be effected by including CSS styles in the HTML template. I leave that as a homework for you or your developers. (Just brush up on CSS pseudo-classes, selectors and attribute matching.) I gave the above example to make the concept easy to understand.

Unlike styles, tables are not fancy stuff. They are the nuts and bolts of financial documents. For those documents, you can use raw HTML inline. LibreOffice will convert HTML tables all right.

What about headers and footers? Does CommonMark support them? Can they be added to the HTML template similar to CSS styles. Unfortunately, no. This kind of toolchain is good for one-page documents. Or, multi-page documents without headers and footers. If you want those, then you have to use a different tool called wkhtmltopdf. This is essentially a headless Firefox browser that can convert heavily formatted HTML documents to PDF. You can specify headers and footers using separate HTML files. I created all my books using this tool. It supports lots of CSS styles that LibreOffice does not support. My books would not look rich if I relied on LibreOffice. (wkhtmltopdf is based on an old Firefox codebase that is not being updated. A lone Indian is maintaining it. It also has a few bugs.) Also unlike LibreOffice, wkhtmltopdf can execute JavaScript. It has an option for adding a few seconds of delay so that the JavaScript can do its thing before wkhtmltopdf starts printing the document to PDF. So, for simple documents, use LibreOffice. And, for heavy-duty documents, use wkhtmltopdf. Both of these programs will use the headings in the source document to create the bookmark tree in the PDF document. LibreOffice has one advantage that wkhtmltopdf does not have — it can create ePUB ebooks and several other types of documents.

Additional Document Features

If you have to merge two ODT or DOCX documents, I do not know of or care for any tool to do it. Do the merging in the markdown source document and then create the combined document.

For PDFs, there are lots of tools. For my books, I have to combine some page-size images with the PDF of the interior pages. For that, I use ImageMagick and pdftk.

Bash
magick title-page.png -resize 100% front.pdf
pdftk front.pdf jokes.pdf output book.pdf

pdftk is a very powerful tool and does more than merging PDF pages or collating them from different documents. It can also watermark and encrypt your PDF.

Bash
pdftk book.pdf output book-encrypted.pdf \
      encrypt_128bit \
      owner_pw RcHrDsTlMn^012 \
      user_pw FrSfTWrFnDtn^321

And, to add a final touch of class, add metadata to the PDF.

Bash
echo "InfoBegin" > meta.txt
echo "InfoKey: Title" >> meta.txt
echo "InfoValue: 2020 Jokebook by V. Subhash" >> meta.txt

echo "InfoBegin" >> meta.txt
echo "InfoKey: Subject" >> meta.txt
echo "InfoValue: Fresh Clean Jokes" >> meta.txt

echo "InfoBegin" >> meta.txt
echo "InfoKey: Author" >> meta.txt
echo "InfoValue: V. Subhash (&#169; 2022 V. Subhash. All rights reserved.)" >> meta.txt

pdftk "jokebook.pdf" update_info meta.txt output 2020-jokebook.pdf

Document properties

Whatever I cannot do with these free PDF tools, I go ahead and write my own custom utility using iText Java. For example, when I found that pdftk was unable to handle some of my bigger books, I could do it with a JAR file executable that I created using iText.

Points of Interest

  • Use in online applications: Document creation requires a lot of heavy lifting, in terms of CPU usage, disk access and memory requirements. Do not integrate these tools directly with web applications. Use a dæmon (background process or service) that is launched by the OS on startup to do the document creation. Your web applications should just queue document-creation jobs to this dæmon. When the documents are ready, the dæmon should invoke an API routine in the web application to notify that the job has been complete. Otherwise, your online users will crash your system when their numbers build up.
  • Donate: If free software such as those mentioned in this article helped you to reduce costs, then make it a point to donate some money to their projects. If you are an independent software vendor (ISV), then tell your client that you did not write all the code and some part of the system relies on FOSS tools. Tell the client to make donations commensurate to the usage of the free software in that system. The creator of iText has an incredible story of how Google refused to offer nothing more than a T-shirt and a mug for using his PDF library in their products such as Google Analytics, Google Docs and Google Calendar. Do not be like that. In many countries, corporate outfits have a statutory obligation to allocate a certain amount of their profits to charitable causes. Most of them will be glad to route such donations to open source projects.
  • This article was originally published in Open Source For You magazine.

History

  • 2nd April, 2023: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)